In our previous post, we looked at how data is partitioned in a Cassandra cluster using a partition key. In this blog we will look at how we can define a partition key, and what is the significance of clustering keys

1.0 Introduction to Primary Key, Partition Key and Clustering keys

Primary keys in Cassandra are exactly same as what they are in other databases – it identifies a unique row in the table.

BUT in Cassandra a primary key has addition role

  • Specify the partition keys
  • Specify the clustering keys

PRIMARY KEY
= PARTITION KEY + CLUSTERING KEYS
= Unique in the table

1.1 Partition Keys

Partition Keys are the first column specified in the Primary Key

  • They are used to partition the data in the Cassandra cluster
  • They can be either single column or composed of multiple columns.
  • Restrictions in Query
    • All columns specified as partition keys are required in the select queries
    • We cannot use any other operator other than = for the columns specified as partition key

1.2 Clustering Keys

Clustering Keys are the columsn in the primary key that comes after the `Partition keys’

  • They are used to sort the data within a parition
  • Any EQ (=) or range operators like >, <=, etc can be used on the columns specified as Clustering keys
  • All conditions before a non-EQ (< , > ) condition must be an EQ condition.

2.0 Some examples of Primary Keys, and restrictions in Select queries.

Suppose we have a movies table with following columns

columndatatype
iduuid
yearint
titletext
ratingdecimal
directortext

Let’s see some impact of creating different type of PRIMARY keys for this table


Example 1: Only single partition key

PRIMARY KEY ( id )
  • id is the partition key
  • No clustering columns specified in the Primary Key.
  • Results in 1 row partition in the cluster, as id is unique

Now let’s see some possible queries on the table defined with above primary key


Example 2: Partition key and clustering keys

PRIMARY KEY( director, year, rating, id)

OR

PRIMARY KEY( (director), year, rating, id)
  • The first key by default becomes the Partition key, and remaining keys are the Clustering keys.
  • director is the Partition Key
  • year, rating, id – are the Clustering keys.
  • Data within a partition is sorted by year, rating and id (in order)
  • id column as been added as a clustering key, to add uniqueness to the PRIMARY KEY – as there can be more than 1 movie with same director, year and same rating.

Now let’s see some possible queries on the table defined with above primary key

  • Only EQ operator (=) can be specified for partition key.
  • Both EQ (=) or non-EQ (< , >) operators can be applied on the Clustering columns.
  • Only EQ conditions must come before any non-EQ conditions.

Example 3: Composite Partition Key

PRIMARY KEY ( (year, director), rating, id)
  • Composite partition key of year and director
  • Clustering keys are rating and id

In above example, each year and director combination is one partition (in this example almost all partition have only 1 row atmost)

Now let’s see some possible queries on the table defined with above primary key


3.0 Change sort order using With Clustering Order

We have seen that all data within a partition is sorted in ASCENDING order, as specified by the clustering columns.

What if we wanted the data to be sorted in specific order instead of ASCENDING order always?

Cassandra provides us a way to overrride the sorting order by specifying With Clustering Order, when creating the table.

Lets see some examples for this primary key given below

PRIMARY KEY( (director), year, rating, id)

Example 1: Sort by Year Descending and Rating Ascending

We can do so using following create table statement

CREATE TABLE movies (
    id uuid, 
    year int,
    title text,
    rating decimal,
    director text,
    PRIMARY KEY( (director), year, rating, id)
) WITH CLUSTERING ORDER BY ( year desc);

Since we want to change sort order for first clustering key (year), we are not required to specify the following clustering keys if their sort order is not changing.

All three options below are same.

WITH CLUSTERING ORDER BY ( year desc)

OR

WITH CLUSTERING ORDER BY ( year desc, rating asc)

OR

WITH CLUSTERING ORDER BY ( year desc, rating asc, id asc)

Example 2: Sort by Year Ascending, but Rating Descending

We can do so using following create table statement

CREATE TABLE movies (
    id uuid, 
    year int,
    title text,
    rating decimal,
    director text,
    PRIMARY KEY( (director), year, rating, id)
) WITH CLUSTERING ORDER BY ( year asc, rating desc);

Since we want to change sort order for second clustering key (rating), we must specify sort order for all preceding keys (e.g. year)

Below clustering order WILL NOT work

WITH CLUSTERING ORDER BY ( rating desc)


4.0 Override query restrictions using Allow Filtering

We have seen many examples where queries will not work because

  • They did not specify the partition key
  • OR Partition Key used a non-EQ condition
  • OR Preceding clustering keys have used a non-EQ condition

or similar restrictions on queries because of the way we have specified the PRIMARY KEY

Cassandra provides us a way to override these restrictions, and query data using ALLOW FILTERING

Lets take the example of following PRIMARY KEY

PRIMARY KEY( (director), year, rating, id)

ALL Following queries which were not working earlier will now work if we specify ALLOW FILTERING in the query.

select * from movies where year = 2014 ALLOW FILTERING;

select * from movies where director = 'Russo' and rating > 8.0 ALLOW FILTERING;

This feature should be used with lot of caution, as it causes Cassandra to scan all partitions.

Ideally this should not be used or if possible then only use on small data sets.