Cassandra NoSQL Data Model Design 
Monday, November 13, 2017 at 8:56AM
Shannen Wehrman

We at Instaclustr recently published a blog post on the most common data modelling mistakes that we see with Cassandra. This post was very popular and led me to think about what advice we could provide on how to approach designing your Cassandra data model so as to come up with a quality design that avoids the traps.

There are a number of good articles around that with rules and patterns to fit your data model into: 6 Step Guide to Apache Cassandra Data Modelling and Data Modelling Recommended Practices.

However, we haven’t found a step by step guide to analysing your data to determine how to fit in these rules and patterns. This white paper is a quick attempt at filling that gap.

Phase 1: Understand the data

This phase has two distinct steps that are both designed to gain a good understanding of the data that you are modelling and the access patterns required.

Define the data domain

The first step is to get a good understanding of your data domain. As someone very familiar with relation data modelling, I tend to sketch (or at least think) ER diagrams to understand the entities, their keys and relationships. However, if you’re familiar with another notation then it would likely work just as well. The key things you need to understand at a logical level are:

• What are the entities (or objects) in your data model?
• What are the primary key attributes of the entities?
• What are the relationships between the entities (i.e. references from one to the other)?
• What is the relative cardinality of the relationships (i.e. if you have a one to many is it one to 10 or one to 10,000 on average)?

Basically, these are the same things you’d expect in from logical ER model (although we probably don’t need a complete picture of all the attributes) along with a complete understanding of the cardinality of relationships that you’d normally need for a relational model. An understanding of the demographics of key attributes (cardinality, distribution) will also be useful in finalising your Cassandra model. Also, understand which key attributes are fixed and which change over the life of a record.

Define the required access patterns

The next step, or quite likely a step carried out in conjunction with step 1, is to understand how you will need to access your data:

Phase 2: Understand the entities

This phase has two specific steps designed to gain an understanding of both the primary and secondary entities associated with the data.

Identify primary access entities

Now we’re moving from analysing your data domain and application requirements to starting to design your data model. You really want to be pretty solid on steps 1 and 2 before moving on to this stage.

The idea here is to denormalize your data into the smallest number of tables possible based on your access patterns. For each lookup by key that your access patterns require, you will need a table to satisfy that lookup. I’ve coined the term primary access entity to describe the entity your using for the lookup (for example, a lookup by client id is using client as the primary access entity, a lookup by server and metric name is using a server-metric entity as the primary access entity).

The primary access entity defines the partition level (or grain if you’re familiar with dimensional modelling) of the resulting denormalized table (i.e. there will be one partition in the table for each instance of the primary access entity).

You may choose to satisfy some access patterns using secondary indexes rather than complete replicas of the data with a different primary access entity. Keep in mind that columns in include in a secondary index should have a significantly lower cardinality than the table being indexed and be aware of the frequency of updates of the indexed value.

For the example access patterns above, we would define the following primary access entities:

Allocate secondary entities

The next step is to find a place to store the data that belongs to entities that have not been chosen as primary access entities (I’ll call these entities secondary entities). You can choose to:

For some secondary entities, there will only be one related primary access entity and so there is no need to choose where and which direction to push. For other entities, you will need to choose will need to choose which primary access entities to push the data into.

For optimal read performance, you should push a copy of the data to every primary access entity that is used as an access path for the data in the secondary entity.

However, this comes at an insert/update performance and application complexity cost of maintaining multiple copies the data. This trade-off between read performance and data maintenance cost needs to be judged in the context of the specific performance requirements of your application.

The other decision to be made at this stage is between using a cluster key or a multi-value type for pushing up. In general:

Note that these rules are probably oversimplified but serve as a starting point for more detailed consideration.

Phase 3: Review & Tune

The last phase provides an opportunity to review the data model, test and to tune as necessary.

Review partition & cluster keys

Entering this stage, you have all the data you need to store allocated to a table or tables and your tables support accessing that data according to your required access patterns. The next step is to check that the resulting data model makes efficient use of Cassandra and, if not, to adjust. The items to check and adjust at this stage are:

Test and tune

The final step is perhaps the most important – test your data model and tune it as required. Keep in mind that issues like partitions or rows growing too large or tombstones building up in a table may only become visible after days (or longer) of use under real-world load. It’s therefore important to test as closely as possible to real-world load and to monitor closely for any warning signs (the nodetool cfstats and cfhistograms commands are very useful for this).

At this stage you may also consider tuning some of the settings that effect the physical storage of your data. For example:

A Worked Example

To illustrate this, I’ll walk through a basic example based on building a database to store and retrieve log messages from multiple servers. Note this is quite simplified compared to most real-world requirements.

Step 1: Define the data domain

Defining the data model domain Instaclustr Data Model design

The previous ER diagram illustrated the data domain. We have:

Step 2: Define the required access patterns

We need to be able to:

Step 3: Identify primary access entities

There are two primary access entities here – source and source type. The cardinality (~20) of source type makes it a good candidate for a secondary index so we will use source as the primary access entity and add a secondary index for source type.

Step 4: Allocate secondary entities

In this example, this step is relatively simple as all data needs to roll into the log source primary access entity. So we:

The end result is that would be a single table with a partition key of source ID and a clustering key of (message time, message type).

Step 5: Review partition and cluster keys

Checking these partition and cluster keys against the checklist:

So, we need to address the unbound partition size. A typical pattern to address that in time series data such as this is to introduce a grouping of time periods into the cluster key. In this case 10,000 messages per day is a reasonable number to include in one partition so we’ll use day as part of our partition key.

The resulting Cassandra table will look some like:

 

Conclusion

Hopefully, this process and basic example will help you start to get familiar with Cassandra data modelling. We’ve only covered a basic implementation that fits well with Cassandra, however there are many other examples on the web which can help you work through more complex requirements. Instaclustr also provides our customers with data modelling review and assistance, so get in touch with us if you need some hands-on assistance.

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.