Paper: Nanocubes: Nanocubes for Real-Time Exploration of Spatiotemporal Datasets
How do you turn Big Data into fast, useful, and interesting visualizations? Using R and technology called Nanocubes. The visualizations are stunning and amazingly reactive. Almost as interesting as the technologies behind them.
David Smith wrote a great article explaining the technology and showing a demo by Simon Urbanek of a visualization that uses 32Tb of Twitter data. It runs smoothly and interactively on a single machine with 16Gb of RAM. For more information and demos go to nanocubes.net.
David Smith sums it up nicely:
Despite the massive number of data points and the beauty and complexity of the real-time data visualization, it runs impressively quickly. The underlying data structure is based on Nanocubes, a fast datastructure for in-memory data cubes. The basic idea is that nanocubes aggregate data hierarchically, so that as you zoom in and out of the interactive application, one pixel on the screen is mapped to just one data point, aggregated from the many that sit "behind" that pixel. Learn more about nanocubes, and try out the application yourself (modern browser required) at the link below.
Abstract from Nanocubes for Real-Time Exploration of Spatiotemporal Datasets:
Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop’s main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.