advertise
« 5 Rockin' Tips for Scaling PHP to 30,000 Concurrent Users Per Server | Main | Stuff The Internet Says On Scalability For June 28, 2013 »
Monday
Jul012013

PRISM: The Amazingly Low Cost of ­Using BigData to Know More About You in Under a Minute

This is a guest post by BugSense Founder/CTO Jon Vlachogiannis and Head of Infrastructure at BugSense Panagiotis Papadomitsos.

There has been a lot of speculation and assumptions around whether PRISM exists and if it is cost effective. I don't know whether it exists or not, but I can tell you if it could be built.

Short answer: It can.

If you believe it would be impossible for someone with access to a social "datapool" to find out more about you (if they really want to track you down) in the tsunami of data, you need to think again.

Devices, apps and websites are transmitting data. Lots of data. The questions are could the data compiled and searched and how costly would it be to search for your targeted data. (hint: It is not $4.56 trillion).

Let's experiment and try to build PRISM by ourselves with a few assumptions:

  • Assumption 1: We have all the appropriate "data connectors" that will provide us with data.
  • Assumption 2: These connectors provide direct access to social networks, emails, mobile traffic etc.
  • Assumption 3: Even though there are commercially available solutions that might perform better for data analysis, we are going to rely mostly on open source tools.

With those assumptions, how much would it cost us to have PRISM up and running and to find information about a person in less than a minute?

Let’s begin with what data is generated every month that might contain information about you.

DATA

Facebook: 500 TB/day * 30 = 1.5 PT/month (source)

Twitter: 8 TB/day * 30 = 240 TB/month 8 TB/day (source)

Email/Other info: 193PT/month Google says 24 PB per day (2008). Five years later lets assume this is 8 times bigger = 192 PB. Now, real user information is 1/3 = 64 PT/day (source)

Mobile traffic/machine­to­machine exchanges/vehicles etc: 4000 TB per day = 117 PB/month (source)

Total Data =~312PB month

Hardware Costs

The prices below correspond to renting off­the­shelf servers from commercial high­end datacenters (considering the data will be stored in a distributed filesystem architecture such as HDFS). This is a worst case scenario that does not include potential discounts due to renting such a high volume of hardware and traffic or acquiring the aforementioned hardware (which incurs a higher initial investment but lower recurring costs) . The hardware configuration used for calculating costs in this case study is comprised of a 2U chassis, dual Intel Hexa­core processors, 16 GB of RAM, 30 TB of usable space combined with hardware­level redundancy (RAID5).

We’ll be needing about 20K servers, put into 320 46U racks. Cost for the server hardware is calculated to be about €7.5M / month (including servers for auxiliary services). Cost for the racks, electricity and traffic is calculated to be about €0.5M / month (including auxiliary devices and networking equipment).

Total hardware cost per year for 3.75 EB of data storage: €168M

Development Costs

  • 3 (top notch) developers ­> 1.5M per year
  • 5 administrators ­> 1.5M per year
  • 8 more supporting developers ­ > 2M per year
  • Developer costs ­> $1M­5M per year (assumes avg developer pay of $500k per year) = 3.74M euro

Total personnel costs: €4Μ

Total Hardware & Personnel Costs: €12M per month (€144M per year) = $187M per year

Software

On the software side, the two main components necessary are:

  • A Stream (in­memory) Database to alert about specific events or patterns taking place in real­time and to make aggregations and correlations.
  • A MapReduce system (like Hadoop) to further analyze the data.

Now that we know the cost of finding anything about you, how would it be done?

The data is "streamed" to the Stream Database from the data connectors (social networks, emails etc), aggregated, and saved to HDFS in order for a MapReduce system to analyze them offline

(Bugsense is doing exactly the same thing with crashes coming from 520M devices around the globe with less than 10 servers using LDB, so we know this is both feasible and cost efficient. Yup, 10 servers for 520M. In real­time).

Next, we’d run a new search query on the 312PT dataset. How long will that take?

We could use Hive in order to run a more SQLish query on our dataset, but this might take a lot of time because data "jobs" need to be mapped, need to be read & processed, and results need to be send back and “reduced”/aggregated to the main machine

To speed this up, we can create a small program that saves data in columnar format in a radix tree (like KDB and Dremel does) so searching is done much faster. How much faster? Probably less than 10 seconds for 400TB for simple queries. That translates (very naively) to less than 10 seconds to find information about you.

Do you think that PRISM can be built using a different tech stack?

Related Articles

Reader Comments (15)

Dual Intel 8 core, 16 GB of RAM, 30 TB of usable space combined with hardware­ level redundancy (RAID5) for $375/month. Please show me where I can get that kind of prices!!! hs1.8xlarge instance on AWS costs 4 times more and has similar spec. Also RAID5 is not an acceptable storage for any serious load. I think your calculations are order of magnitude lower than they should be.

July 1, 2013 | Unregistered Commentercurious

I like your free, infinite bandwidth.

July 1, 2013 | Unregistered CommenterMcCrab

If you guys have any questions, we are more than happy to answer them! Ping me at @jonromero!

July 1, 2013 | Unregistered CommenterJon Vlachogiannis

[/home/snowden]
# cat prismdata.txt | grep "$targetname"


---
that could work..

July 1, 2013 | Unregistered Commenterunixhead

I think the development costs are way underestimated. I have an experience with projects that were probably much smaller and less complex, and there were dozens of developers working just on the integration (ESB and this sort of stuff) + there were multiple dba's and database developers and it wasn't even as volume intensive as the potential PRISM (hundreds gb a month). And I'm not even talking about UI, reporting, BI and so on. And now add that you are doing all of this in extremely constrained environment due to security and other reasons, recruiting is going to be extremely slow and difficult (they apparently failed in this part).

July 1, 2013 | Unregistered CommenterTomas

Could you please advise on where the avg developer makes 500k/y?

July 1, 2013 | Unregistered CommenterSergey

Nice work on calculations. I am sure hardware, software and development costs will be much higher than given assumptions.

July 2, 2013 | Unregistered CommenterRaju Bhupathi

DATA

Facebook: 500 TB/day * 30 = 1.5 PT/month (source)

Twitter: 8 TB/day * 30 = 240 TB/month 8 TB/day (source)

Email/Other info: 193PT/month Google says 24 PB per day (2008). Five years later lets assume this is 8 times bigger = 192 PB. Now, real user information is 1/3 = 64 PT/day (source)

Mobile traffic/machine­to­machine exchanges/vehicles etc: 4000 TB per day = 117 PB/month (source)

Total Data =~312PB month

are you kidding????!!!!

July 2, 2013 | Unregistered CommenterHao Zhong

the technology was built with spying in mind from the ground up
do a google search of "amdocs phone records" this has been going on for years
do a lil research and as the smart people i know you to be you will see what's going on..= ) have a nice day

July 2, 2013 | Unregistered Commenterwethecom

Nice article.

Who cares if the cost are off or not. The premise was if it was possible or not. And given the information, its certainly possible. As for cost. Lets not forget that this people printed more than a trillion dollars of debt (as they "borrow" from the FED to print the money) per year. So even if the figure where 1 billion. Is still certainly well within the capacity of the USA government.

They had the means, the opportunity and certainly the motive. And there is a witness. The only thing left is the smoking gun.

Problem is that the suspect is investigating himself and decided that the witness is the guilty party. You got more chance of a meteor hitting the White House than of ever proving what really happened (or not).

July 2, 2013 | Unregistered Commenterrxantos

Hey Curious,

That's not too far off the cost of a dedicated server - the following is £329 per month
2 x 8 core Intel 2.66
128GB RAM
36TB strorage (12 x 3TB SATA 3)
There are some really cheap options and you can always talk providers into extras. Also you can talk them into discounts/custom builds - if you're spending hundreds per month they look after you. If you are spending millions per month they will do anything for you :)

July 2, 2013 | Unregistered Commenter3Dhendo

$500k per year for a developer seems a little high!
If everything else is similarly up in price then it becomes something any country could afford (or large company)

Big problem is still pulling the real information out.
For example Email/text
'The main course is at 8pm, make sure you cover the sweet trolly

In open text does this mean evening meal starts at 8 or confirm the time for the attack to start, make sure you cover our exit

Meaning is everything,you still need the context of the information - you still can't beat a real agent on the ground!

July 2, 2013 | Unregistered Commenterbrianm

Nice article. Couple of other points to consider.

This doesn't take into consideration encrypted communications.I imagine there would be a similar sized cluster (maybe a couple of supercomputers) for breaking encryption.

The prices are quite low as well. This program was top secret; there is automatically an inflated cost of everything to store and process data in top secret networks.

I'd imagine that the NSA would not be comfortable renting components and would opt to perform the majority of work in-house (or at least through contractors).

Backup / retention has not been considered.

Dev / Test environments has not been considered.

July 3, 2013 | Unregistered CommenterChris

Hey guys, thanks for the replies!

First of all, these are not AWS servers. And yes, the sane thing to do is to build a datacenter. Probably in Utah. I wanted to demonstrate how someone that doesn't want to build a datacenter can do it. Trust me, you can get a great discount if you start renting all these machines - as many people mentioned.

About the salaries: For a great (and also "trusted") engineer 500k is nothing if you consider insurance, taxes and other costs a companies needs to pay. At the end of the day, maybe around 250k will end up in his pocket (people on hackernews suggest even less).

Thanks again for reading this and commenting! Feel free to "bug" me at @jonromero!

July 4, 2013 | Unregistered CommenterJon Romero

The NSA released Accumulo. Might be related :)

July 9, 2013 | Unregistered CommenterG

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>