How Can the Large Hadron Collider Withstand One Petabyte of Data a Second?

Why is there something rather than nothing? That's the kind of question the Large Hadron Collider in CERN is hopefully poised to answer. And what is the output of this beautiful 17-mile long, 6 billion dollar wabi-sabish proton smashing machine? Data. Great heaping torrents of Grand Canyon sized data. 15 million gigabytes every year. That's 1000 times the information printed in books every year. It's so much data 10,000 scientists will use a grid of 80,000+ computers, in 300 computer centers , in 50 different countries just to help make sense of it all.

How will all this data be collected, transported, stored, and analyzed? It turns out, using what amounts to sort of Internet of Particles instead of an Internet of Things.

Two good articles have recently shed some electro-magnetic energy in the human visible spectrum on the IT aspects of the collider: LHC computing grid pushes petabytes of data, beats expectations by John Timmer on Ars Technica and an overview of the Brookhaven's RHIC/ATLAS Computing Facility. These articles explain the system well, so I won't repeat it all here, I'll just touch on some of the highlights.

For Main Detectors Generate the Data

First we need to know what produces the data: detectors that implement particular science experiments. The detectors are sensors that "watch" the collision for various attributes and send those samples on for analysis. In principle these are no different than a temperature sensor, but in practice some of the detectors have more parts than the Space Station.

There are four main detectors:

ATLAS - one of two general purpose detectors. ATLAS will be used to look for signs of new physics, including the origins of mass and extra dimensions.
CMS - the other general purpose detector will, like ATLAS, hunt for the Higgs boson and look for clues to the nature of dark matter.
ALICE - will study a "liquid" form of matter called quark–gluon plasma that existed shortly after the Big Bang.
LHCb - equal amounts of matter and antimatter were created in the Big Bang. LHCb will try to investigate what happened to the "missing" antimatter.

I love the image conjured up by Physicist Brian Cox when says how these four giant experiments, ATLAS, CMS, LHCb and ALICE, “photograph” the resulting miniature “big bangs.” To give an idea of the scale here, more than 2,000 other physicists from 37 countries will work on ATLAS. That's BIG science.

Summary

Some of the interesting factoids about the system, taken from the above mentioned articles, and from some great comments, are:

More data is produced by the detectors than can be stored. 2,000 CPUs at each of the four detectors filter the readings so only the interesting collisions are kept. This reduces the flow of data from a petabyte a second to a mere gigabyte per second.
- Uninteresting data is dropped. The format of the data is such that it is already pretty well compressed.
- Data flows through special-purpose hardware buffers until filtering decisions are made. They're reconstructed and reformatted in large computing clusters for further decisions. Final writes to disk are done in parallel, and wind up being some multiple of hundreds of MB/s -- well within the capacity of parallel RAIDs. Though the data stream is serial, events are discrete -- so it's easy to buffer writes across a number of systems that feed into the same large storage pool.
Data is streamed to the main compute facility on a dedicated 10Gbps connection.
Data is stored on 50PBs of tape storage on 20PBs worth of disk storage.
Disks are managed as a cloud service. Arrays of drives hook up to Linux boxes using a JBOD (Just a Bunch Of Disks) configuration. Each box has a 1Gbps connection. Any CPU in the storage cluster can read data from any other disk.
The grid is set up in a tiered system to distribute data down a tree of sites:
- Tier-0: the Large Hadron Collider located at CERN. CERN receives raw data from the ATLAS detector, performs a first-pass analysis, and then distributes it among 10 Tier-1 locations, also known as regional centers.
- Tier-1 level connects to Tier-0 over a dedicated 10Gbps optical network path.
- Only a fraction of the raw data will be stored, processed, and analyzed at Tier-1 facilities.
- Tier-1 facilities distribute derived data to Tier-2 computing facilities.
- Tier-2 facilities provide data storage and processing capacities for more in-depth user analysis and simulation. There are over one hundred Tier-2 data centers.
Brookhaven's RHIC/ATLAS Computing Facility is the only Tier-1 computing facility for ATLAS in the United States. Some of its vital stats are:
- 3000 CPUs dedicated to crunching Atlas data.
- 2 million gigabytes of distributed storage.
- Data is crunched on 24x7.
- The amount of data the comes in from CERN would fill a CD per second.
The infrastructure is in a state of constant rearchitecture. Disks fill, sites go down, and jobs fail, but it all could have gone a lot worse.
A physicist submitting a job to the grid:
- Feeds a configuration file with the specifics of what to do to a program that deals with the grid. Most of the magic is done by these programs: they take the configuration file, find the data (which may be anywhere in the world), run on it and then bring the output back to the user. Physicists generally don't see the grid unless something goes wrong. The program takes care of everything.
- A "job" in the context of LHC computing is something submitted to a single (virtual) machine. One job is only used to run over a very small dataset. Hundreds or even thousands of jobs are used for larger problems. Keep in mind that there is not just one dataset. Filters decide whether or not to accept an event, they also decide on which dataset it should go based on which particles it contains. It is possible to run over all of them, but very few analyses require this.
- You do not want to be running over a whole dataset every time because that is very hardware intensive. For each of the big experiments, the LHC writes out a few hundred MB/s and all of that data has to be analyzed. Instead, you run over it once and select the only the data that is interesting for your analysis and thereafter your jobs only look at that.
The network works so well that they've been able to invert the typical execute the job where the data is paradigm which says it's more efficient to send code to where the data is than it is to send data to the code. MapReduce works on this idea. With fast efficient networks they've found that they can stream data anywhere in the network in real-time.
Oracle is used by CERN to store collision event metadata, the major requirement being rapid synchronization across a global network.
Data is an just an accumulation of independent similar events. Loss of an event or even an entire tape is not ‘serious’. One copy of any piece of data will normally do which reduces costs considerably.

A Few Lessons Learned

Human can do cool things when mind, money, and will are all aligned. When a lot of money and a lot of genius people work together on a project they care deeply about, they can do wonders. The Human Genome Project and The Manhattan Project show that as well. Maybe we need fewer wars against and more projects for?
Reduce the flow of data at the source. Filter out unwanted data before it hits the main part of your system. This applies to sensor, logging, everything that is chatty.
Move data to code on fast networks. It's quite surprising that a fast network makes it possible to send data to code instead of code to data. This is exactly opposite of current best practices. Will data centers follow suite?
Buy new, don't fix old. Improvements in performance per Watt have caused CERN to no longer sign hardware support contracts longer than three years. Machines run until they die. They have a very high utilization of equipment (‘duty cycle’, 7 x 24 x 365). Replacing hardware makes more sense because of the lower cost and the power savings of new hardware.
Large pools of expert users and staff promotes complexity.
Many can appear as one. A highly distributed system, composed of resources from over the world, using many different technologies, under many different spheres of political control, can be abstracted out to appear as one system to users. It's quite an accomplishment.