Three Fast Data Application Patterns
This is guest post by John Piekos, VP Engineering at VoltDB. I understand this is a little PRish, but I think the ideas are solid.
The focus of many developers and architects in the past few years has been on Big Data, specifically mining historical intelligence from the Data Lake (usually a Hadoop stack containing terabytes to petabytes of data).
Now, product architects are asking how they can use this business intelligence for competitive advantage. As a result, application developers have come to see the value of using and acting in real-time on streams of fast data; using OLAP reporting wisdom, they can realize the benefits of both fast data and Big Data. As a result, a new set of application patterns have emerged. The applications are designed to capture value from fast-moving streaming data, before it reaches Hadoop.
At VoltDB we call this new breed of applications “fast data” applications. The goal of these fast data applications is to do more than just push data into Hadoop asap, but also to capture real-time value from the data the moment the data arrives.
Because traditional databases historically haven’t been fast enough, developers have been forced to go to great effort to build fast data applications - they build complex multi-tier systems often involving a handful of tools typically utilizing a dozen or more servers. However, a new class of database technology, especially NewSQL offerings, has changed this equation.
If you have a relational database that is fast enough, highly available, and able to scale horizontally, the ability to build fast data applications becomes less esoteric and much more manageable. Three new real-time application patterns have emerged as the necessary dataflows to implement real-time applications. These patterns, enabled by new, fast database technology, are:
Real-time Analytics
Real-time Decision Engine
Fast Data Pipeline
Let’s take a look at the characteristics of each of these fast data application patterns and how a NewSQL database can improve and simplify building applications.
Real-time Analytics
This application pattern processes streaming data from one or many sources and performs real-time analytic computations on that fast data. Today this application pattern is often combined with Big Data analytics, producing analytics on both fast data and big data.
In this pattern, the value captured from the fast data stream is primarily real-time analytics. The streaming engine tracks counts and metrics derived from each message. Applications tap these stored results and display dashboard state and possibly offer real-time alerts.
Important features in this application pattern include:
Pre-built connectors are required to easily feed the streams of data into the database.
The database needs to compute real-time analytics by pre-computing materialized views on a per-message basis.
Standard reporting tools such as Tableau or MicroStrategy can use SQL to query real-time state and compute ad hoc analytics.
The database needs the ability to perform analytics and aggregations over time windows of data, such as by the second, minute, hour, day, etc.
The database needs the ability to discard, or age-out, data after it is processed. VoltDB accomplishes this with “capped tables,” the ability to define a table constraint to limit the number of rows a table has, and when that constraint is violated, to automatically execute delete statements to remove older rows.
Real-time Decision Engine
This application pattern processes inbound requests from many clients, perhaps tens of thousands simultaneously, and returns a low latency response or decision to the client. This is a classic OLTP application pattern but running at scale against high velocity incoming data.
Scaling to support per-event high velocity transactions enables applications that evaluate campaign, policy, authorization and other business logic, to respond in real-time, in milliseconds, to applications. In this pattern, the business value is providing “smart,” or calculated, responses to high velocity requests. Applications that make use of this model today include digital ad-tech campaign balance processing as well as ad choice (based on precomputed user segmentation or other mined heuristics), smart grid electrical grids, and telecom billing and policy decisioning. In all cases, the database is processing incoming requests at exceptionally high rates. Each incoming request runs a transaction to calculate a decision and return a response to the calling application. For example, an incoming telecom request (a new Call Data Record), may need to decide, “Does this user have enough balance to process this call?” A digital ad platform may ask, “Which of my ads should I serve to this mobile device, based on campaign available balance?”
Important features in this application pattern include:
ACID Transactions. In this pattern, the decision engine (the database) is updating state in a consistent and durable manner. The state is often a balance of some kind (usually monetary). Additionally, consistency is important. These responses (decisions) are based on data that must be correct, thus consistent.
Low latency responses. These applications need a response in real-time. Often there is a budget for database processing that ranges in the single-digit millisecond range.
Durable data. In this pattern the database is often a system of record, at least for a time window. Should the system go down, data would need to be durable and recoverable.
Ability to use standard tooling, i.e. SQL, to query and compute real-time state. Real-time dashboards capturing the state of the system (balances, transactions/second) as well the ability to ad hoc query state are important to this application pattern.
Fast Data Data Pipeline
This application pattern processes streaming data from one or many sources and performs real-time ETL (Extract, Transform, Load) on the data, delivering the result to a historical archive. In a streaming data pipeline, incoming data may be sessionized, enriched, validated, de-duped, aggregated, counted, discarded, cleansed, etc. by the database before being delivered to the Data Lake.
Important features in this application pattern include:
Pre-built connectors to feed the streams of data into the pipeline. VoltDB includes connectors to import Kafka streams, relational data, and also supports HadoopOutputFormat results from Hive and Pig.
The ability to process data, to aggregate, de-dupe, or transform, as part of ETL workflow. VoltDB transactions, implemented as Java Stored Procedures, allow developers to transactionally execute SQL, combined with Java business logic, to process each message individually or in aggregate.
Pre-built export connectors to stream data downstream to historical archive as fast as it arrived. VoltDB includes export connectors to stream data to Kafka, RabbitMQ, Hadoop, Vertica, Netezza, or any relational data store via JDBC.
Applications making use of this pattern often are processing continuous streams of data that must be validated, transformed and archived in some manner. One example is processing device ids (usually in the form of cookies). The pipeline computes segmentation output intelligence, providing correlation data to be used for advanced decisioning applications, often in the digital ad tech arena.
The Fast Data Pipeline
The fast data processing layer must have the following properties across all use cases:
High ingestion (write) rate. The pipeline must ingest data at historically challenging transaction rates. Transactions occurring at hundreds of thousands to millions of times per second are not uncommon.
Analyze incoming data in real-time. Real-time analytics enable users to derive seasonal patterns, statistical summaries, scoring models, recommendation models, rankings, leaderboards and other artifacts for use in user-facing applications.
Real-time decisions. Enabling real-time transactions against new incoming data makes it possible to respond back to users or drive downstream processes based on the output of the analysis activity and the context and content of the present event.
In addition, the three patterns require a system architected to deliver:
High availability. The pipeline processing engine, in this case VoltDB, can survive machine loss or (most) networking failures, either for routine maintenance or due to environmental issues or errors, and continue to operate correctly.
Elastically and horizontally scalable. As throughput increases, more machines can be added to the pipeline processing engine to accommodate the additional traffic without interrupting the running system.
Related Articles
- On Hacker News
- Lambda Architecture
- In Memory Data Grid Technologies
- The Architecture Twitter Uses To Deal With 150M Active Users, 300K QPS, A 22 MB/S Firehose, And Send Tweets In Under 5 Seconds
- Are Cloud Based Memory Architectures The Next Big Thing?
- The Performance Of Distributed Data-Structures Running On A "Cache-Coherent" In-Memory Data Grid
- 35+ Use Cases For Choosing Your Next NoSQL Database
- For a fast data pipeline code example demonstrating click stream processing based on user segmentation, using VoltDB as the ingestion and processing engine, see https://github.com/VoltDB/app-fastdata.
- For a real-time decisioning (high velocity transactions) code example demonstrating an ‘American Idol’-like voting system, with per-vote validation, see https://github.com/VoltDB/voltdb/tree/master/examples/voter.