Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions

Constructing a scalable risk analysis solution is a fascinating architectural challenge. If you come from Financial Services you are sure to appreciate that. But even architects from other domains are bound to find the challenges fascinating, and the architectural patterns of my suggested solution highly useful in other domains.

Recently I held an interesting webinar around architecting solutions for scalable and near-real-time risk analysis solutions based on experience gathered with Financial Services customers. Seeing the vast interest in the webinar, I would like to share the highlights with you here.

From an architectural point of view, risk analysis is a data-intensive and a compute-intensive process, which also has an elaborate orchestration logic. volumes in this domain are massive and ever-increasing, together with an ever-increasing demand to reduce response time. These trends are aggravated by global financial regulatory reforms set following the late-2000s financial crisis, which mandate reducing exposure to risk by shortening risk settlement cycles.

Traditional architectures were based on overnight batch processing using compute grids and relational databases. But can the traditional architecture meet the new demand for near-real-time processing? Experience over many customers show that traditional architecture fails to meet the challenge: data become stale quickly; relational databases become bottlenecks due to the disk and network constraints; predefined queries are too rigid; remote execution of pre-processing/post-processing on the data (such as formatting data inputs, or performing calculation result aggratagion) is too slow; implementing home-grown orchestration layer is too cumbersome and inefficient.

Constructing a massively-scalable near-realtime risk analysis solution requires a new architectural approach. The correct way to look at the problem is that of a realtime analytics on big data. it's important to realize that the intraday data changes very frequently but is of limited volume, whereas historical data changes much less frequently but is of much higher volumes. Good architecture should accomodate for these inherent differences, employing a multi-tiered architecture with in-memory data grid for intraday data and NoSQL database for historical data, and a processing layer to unify the two datastores, making them look as one for querying purposes.

Another challenge in risk calculations is streaming the results back to the clients as they arrive, and also to dispatch ticks and other events, which arrive at high rate, back to UI. I find Event Driven Architecture (EDA) is highly suitable for handling these use cases. Supporting asynchronous data fetch and having the ability to treat data mutation as an event that can be dispatched are some of the characteristics that I'd be looking for when implementing such architectures.

The risk calculations are usually accompnied by ETL pre-processing of the data for aligning data format to industry standards, and post-processing of calculation results for result aggregation. This pre-processing and post-processing logic should happen very efficiently given the high rate of data streamed. Remote invocation of this logic on the data is too cumbersome. Having the ability to execute this logic co-located with the data, preferably on the very same VM, is what I'm looking for in such architectures.

For such challenges I find Elastic Application Platforms to be a suitable tool, providing both the in-memory data grid and the ability to execute business logic and messaging co-located with the data, having it all scalable via sharding the data and HA via redundant synchronous replicas. For my implementation I used GigaSpaces XAP, which in addition to all of that provided me easy integration with the back-end NoSQL database of the historical data, which made it easy to host the end-to-end big-data solution as one cohesive solution.

This is of course just the basic architecture. There are many challenges in how to intersect the compute grid with the data grid, how to co-locate the orchestration logic with the data and scale it in a highly-available manner, how to extend the architecture to multi-site deployment, how to onboard such a system to the cloud, preferably keeping the solution vendor-agnostic, and so forth.

You can read a more in-depth discussion on such architectures here.