UFJJU006IFRoZSBBbWF6aW5nbHkgTG93IENvc3Qgb2Ygwq1Vc2luZyBCaWdEYXRhIHRvIEtub3cg TW9yZSBBYm91dCBZb3UgaW4gVW5kZXIgYSBNaW51dGU=
This is a guest post by BugSense Founder/CTO Jon Vlachogiannis and Head of Infrastructure at BugSense Panagiotis Papadomitsos.
There has been a lot of speculation and assumptions around whether PRISM exists and if it is cost effective. I don't know whether it exists or not, but I can tell you if it could be built.
Short answer: It can.
If you believe it would be impossible for someone with access to a social "datapool" to find out more about you (if they really want to track you down) in the tsunami of data, you need to think again.
Devices, apps and websites are transmitting data. Lots of data. The questions are could the data compiled and searched and how costly would it be to search for your targeted data. (hint: It is not $4.56 trillion).
Let's experiment and try to build PRISM by ourselves with a few assumptions:
- Assumption 1: We have all the appropriate "data connectors" that will provide us with data.
- Assumption 2: These connectors provide direct access to social networks, emails, mobile traffic etc.
- Assumption 3: Even though there are commercially available solutions that might perform better for data analysis, we are going to rely mostly on open source tools.
With those assumptions, how much would it cost us to have PRISM up and running and to find information about a person in less than a minute?
Let’s begin with what data is generated every month that might contain information about you.
DATA
Facebook: 500 TB/day * 30 = 1.5 PT/month (source)
Twitter: 8 TB/day * 30 = 240 TB/month 8 TB/day (source)
Email/Other info: 193PT/month Google says 24 PB per day (2008). Five years later lets assume this is 8 times bigger = 192 PB. Now, real user information is 1/3 = 64 PT/day (source)
Mobile traffic/machinetomachine exchanges/vehicles etc: 4000 TB per day = 117 PB/month (source)
Total Data =~312PB month
Hardware Costs
The prices below correspond to renting offtheshelf servers from commercial highend datacenters (considering the data will be stored in a distributed filesystem architecture such as HDFS). This is a worst case scenario that does not include potential discounts due to renting such a high volume of hardware and traffic or acquiring the aforementioned hardware (which incurs a higher initial investment but lower recurring costs) . The hardware configuration used for calculating costs in this case study is comprised of a 2U chassis, dual Intel Hexacore processors, 16 GB of RAM, 30 TB of usable space combined with hardwarelevel redundancy (RAID5).
We’ll be needing about 20K servers, put into 320 46U racks. Cost for the server hardware is calculated to be about €7.5M / month (including servers for auxiliary services). Cost for the racks, electricity and traffic is calculated to be about €0.5M / month (including auxiliary devices and networking equipment).
Total hardware cost per year for 3.75 EB of data storage: €168M
Development Costs
- 3 (top notch) developers > 1.5M per year
- 5 administrators > 1.5M per year
- 8 more supporting developers > 2M per year
- Developer costs > $1M5M per year (assumes avg developer pay of $500k per year) = 3.74M euro
Total personnel costs: €4Μ
Total Hardware & Personnel Costs: €12M per month (€144M per year) = $187M per year
Software
On the software side, the two main components necessary are:
- A Stream (inmemory) Database to alert about specific events or patterns taking place in realtime and to make aggregations and correlations.
- A MapReduce system (like Hadoop) to further analyze the data.
Now that we know the cost of finding anything about you, how would it be done?
The data is "streamed" to the Stream Database from the data connectors (social networks, emails etc), aggregated, and saved to HDFS in order for a MapReduce system to analyze them offline
(Bugsense is doing exactly the same thing with crashes coming from 520M devices around the globe with less than 10 servers using LDB, so we know this is both feasible and cost efficient. Yup, 10 servers for 520M. In realtime).
Next, we’d run a new search query on the 312PT dataset. How long will that take?
We could use Hive in order to run a more SQLish query on our dataset, but this might take a lot of time because data "jobs" need to be mapped, need to be read & processed, and results need to be send back and “reduced”/aggregated to the main machine
To speed this up, we can create a small program that saves data in columnar format in a radix tree (like KDB and Dremel does) so searching is done much faster. How much faster? Probably less than 10 seconds for 400TB for simple queries. That translates (very naively) to less than 10 seconds to find information about you.
Do you think that PRISM can be built using a different tech stack?