Here’s a question: What is the universe made out of? Certainly not an easy one to answer, and a little daunting to think about. If you ask a chemist, he’ll say chemicals or atoms, while a nuclear physicist will tell you about protons, neutrons, and the atomic nucleus.
If you haven’t had enough, take a trip to CERN in Switzerland, ask one of the 10 000 (particle) physicists and you’ll hear about quarks, gluons, leptons, neutrinos, virtual particles, and the standard model. Which mostly sounds made up.
To put your minds at ease, I’ll tell you that the particle physicists at CERN and their standard model are putting it all to the test. They aim to better understand how fundamental particles interact by crashing them into each other at very high energies and studying what comes out. Problem is, they’ve reached such incredibly small length scales that they need a record-breaking collision energy to uncover new physics.
Enter the Large Hadron Collider (LHC), which is more than a prop name used in Big Bang Theory. Rather, the LHC is a massive circular particle accelerator and has the power to take us down to the level of quarks. It’s the largest scientific experiment in the world, large enough to bring protons to about 99.9999991 % of the unattainable speed of light.
At full speed, bunches of protons circulate the 27km long LHC circumference in opposite directions at more than 11 000 laps per second. The bunches are carefully synchronised and focused to overlap at certain locations where detectors surround the tunnel in segmented layers to track the trajectories of different reaction products.
With 600 million particle collisions per second, the amount of data generated is unmanageable. It’s the biggest of big data. Obviously, though, managing the data is crucial. It’s a complicated process where physicists reconstruct the events and extract the physics from the data.
They do this by collaborating over the LHC computing grid, starting from the on premise tier-0 data centre on site. Data is shared in series down three more tiers, distributing the processing. At each consecutive tier, the data is further mined and aggregated before being made available.
Thus, the size of the data-sets decreases while their relevance, demand, and availability increases. The data is finally reduced to a graph which provides us with greater understanding.
This case study goes to show that physicists may have more in common with the business world than you think. We both deal with big data. Though, where we rely on on cutting edge open source, you would more likely require the data governance, production stability, and reliable support that enterprise software provides. That said, we shouldn’t dismiss the fact that your business still needs leading edge data science to manage your data.
Consider SAP, an enterprise-ready software. How would it weigh in against CERN’s data landscape? Let’s take a look. SAP Event Stream Processor (ESP) offers a programmable buffer for on-the-fly-treatment of inbound events, so that we can handle large data streams in real-time. With that first step covered, we then need to get the data to the analysts which can be done using SAP Landscape Transformation (SLT) to automate the data replication between system landscapes. This achieves the tiered architecture in short order.
Sounds simple enough, but really, moving data around and making physical copies at each aggregation step is inefficient. That’s why, as a physicist, I am particularly excited about SAP HANA. Take a look at how this system can follow LHC data management principles to help businesses deal with their big data issues.
HANA’s In-memory platform can reproduce CERN’s tiered data landscape very elegantly inside a single device and has a beastly amount of RAM and parallel cores to accelerate the intuitive modelling approach based on optimised abstraction. This delivers the data so fast that you would think it was a physically prepared dataset sitting in RAM.
Direct, Efficient Storage & Delivery
HANA stores “tier-0” data once, in its raw form, in- or close-to memory, which data modellers can then prepare and distribute with different degrees of granularity (preparedness) by creating analytic views and analytic privileges.
As a result, serving data to differently skilled analysts with appropriate levels of data detail is made easy and efficient. Better yet, for the experienced data wrangler, there is direct access to the raw data and generous processing power. This brings data transformation back from the batch processes of ETL and data warehousing to where it belongs: an integral part of the analysis workflow.
This is a natural facet of HANA. When an analysis workflow reveals an interesting or useful arrangement of the data, a new view can be made available, with no overhead.
The value grows exponentially as more analysts are able to explore the virtually prepared dataset. This productivity grows down the virtual tiers, until the business analyst is presented with datasets that are manageable with spreadsheet know-how.
The data-centric convergence of business operations onto the HANA platform makes large amounts of interesting business data available with the agility of analysis that I have grown used to in the scientific world. It’s clear that the merging of scientific trends into the business world allows us to take advantage of our mutual problems, and more importantly, mutual solutions.
Let’s face it. The technology landscape has become more intuitive and innovative, meaning that effective data management is both urgent and complex. In this area, SAP is clearly on the right track, and HANA is our opening into the discoveries waiting in that data. This is precisely what pulled me out of the physics lab.
Jake Bouma - Repurposed Nuclear Physicist at Britehouse