Horizontal Partitioning in Azure Cosmos DB
In Azure Cosmos DB, partitioning is what allows you to massively scale your database, not just in terms of storage but also throughput. You simply create a container in your database, and let Cosmos DB partition the data you store in that container, to manage its growth.
This means that you just work with the one container as a single logical resource where you store data. And you can just let the container grow and grow, without worrying about scale, because Cosmos DB creates as many partitions as needed behind the scenes to accommodate your data.
These partitions themselves are the physical storage for the data in your container. Think of partitions as individual buckets of data that, collectively, is the container. But you don’t need to think about it too much, because the whole idea is that it all just works automatically and transparently. So when one partition gets too full, Cosmos DB automatically creates a new partition, and frees up space in the growing partition by splitting its data, and moving some of it over into the new partition.
This technique scales the storage of your container, but the partitions themselves are also replicated internally, and so this also ensures availability of your container. Furthermore, Cosmos DB will split partitions even well before they grow too full, which also scales throughput because there are now more partitions available to satisfy a larger workload.
Life for a container begins with a single partition, and when you start storing data to the container, it gets written to that partition. Then you write some more data, which continues to fill the partition, until at some point, a new partition gets created to store the additional data. And this continues ad infinitum, resulting in horizontal scale-out that gives you a virtually unlimited container for your database:
But the question then becomes, how does Cosmos DB know which data is stored in which partitions? I mean, if your container grows to hundreds or thousands of partitions, how does Cosmos DB know where to look for data when you run a query? Having a piece of data stored arbitrarily in any given partition within the collection means that Cosmos DB needs to check each individual partition when you query. This doesn’t seem very efficient, which is why Cosmos DB can be much smarter about this, and that’s where you come in. Because your job in all of this – and really your only job – is to define a partition key for the container.
Selecting a Partition Key
The single most important thing you need to do when you create a container is to define its partition key. This could be any property in your data, but choosing the best property will scale your data for storage and throughput in the most efficient way. And conversely, some properties will be a poor choice, and will thus impede your ability to scale. So even though you don’t need to handle partitioning yourself, understanding how Cosmos DB internally partitions your data, combined with an understanding of how your data will be most typically accessed, will help you make the right choice.
Partition key values are hashed
For each item you write to the container, Cosmos DB calculates a hash on the property value that you’ve designated as the partition key, and it uses that hash to determine which physical partition the item should be stored in. This means that all the items in your container with the same value for the property you’ve chosen for the partition key will always be stored physically together in the same partition, guaranteed. They will never, ever, be spread across multiple partitions.
Physical partitions host multiple logical partitions
However, this certainly does not mean that Cosmos DB creates one partition for each unique partition key value. Your container can grow elastically, but the partitions themselves have a fixed storage capacity. So one physical partition per partition key (i.e., logical partition) is far from ideal, because you’d potentially wind up having Cosmos DB create far more partitions than are actually necessary, each with a large amount of free space that will never be used.
But as I said, Cosmos DB is much smarter than that, because it also uses a range algorithm to co-locate multiple partition keys within a single partition. This is totally transparent to you however – your data gets partitioned by the partition key that you choose, but then which physical partition is used to store an item, and what other partition keys (logical partitions) happen to be co-located on that same partition – well that’s all insignificant and irrelevant as far as we’re concerned.
Two primary considerations
So now you understand that all the items with the same partition key will always live together in the same partition, which makes it easier to figure out what property in your data might or might not be a good candidate to use for the partition key.
First, queries will be most efficient if they can be serviced by looking at one partition, as opposed to spreading the query to work across multiple partitions. That’s why, typically, you always want to supply the partition key value with any query you want to run, so that Cosmos DB can go directly to the partition where it knows all the data that could possibly be returned by your query lives. You can certainly override this, and tell Cosmos DB to query across all the partitions in your collection, but you would definitely want to pick a partition key where such a query would be the exception, rather than the general rule, which would be that most typical queries in your application (say 80% or more) could be scoped to a single partition key value.
The same is true of updates, because – as I’ll blog about soon – you can write server-side stored procedures that update multiple documents in a single transaction, so that all the updates succeed or fail together as a whole, and this is possible only across multiple documents that have the same partition key value, which is another thing to think about when defining the partition key for a container.
You also want to choose a value that won’t introduce bottlenecks for storage or throughput. Something that can be somewhat uniformly distributed across partitions, so that storage and throughput can be somewhat evenly spread within the container.
Let’s dig a little deeper with a few different scenarios. Here I’ve got a container that uses city as the partition key:
Like I’ve been explaining, all the items for any city are stored together within the same partition, and each partition is co-locating multiple partition keys. It’s not a perfect distribution – and it never will be – but at least at this point, it seems we can somewhat evenly distribute our data within the container using city as the partition key. Furthermore, any query within a given city can be serviced by a single partition, and multiple items in the same city can be updated in a single transaction using a stored procedure.
Now, what happens when we need to grow this data? Say we start getting more data for Chicago, but can already see the partition hosting Chicago is starting to run low on headroom. So when you start adding more Chicago data, Cosmos DB will at some point – automatically and transparently – split the partition. That is, it will create a new partition, and move roughly half the data from the Chicago partition into the new partition. In this case, Berlin got moved out as well as Chicago, and now there is sufficient room for Chicago to continue growing:
But is city still the best choice to use here? If you have a few cities that are significantly larger than others, or that are queried significantly more often than others, then perhaps a finer granularity might be better, like zip code, for example. That will yield many more distinct values than city, and ideally, you want a partition key with many distinct values – like hundreds or thousands, at a minimum.
Choosing the Right Partition Key
Ultimately, only you will be able to make the best choice, because the best choice will be driven by the typical data access patterns in your particular application. I have already explained that the choice you make determines if a query can be serviced by a single partition – which is preferred, and which should be the most common case – or if multiple partitions need to be looked at. And it also determines which items can be updated together in a single transaction, using a stored procedure – and which cannot.
So with that in mind, let’s see what would make a good partition key for the different kinds of applications you might be building.
User Profile Data
Typical social networking applications store user profile data, and very often the user ID is a good partition key choice for this type of application. It means that you can query across all the posts of a given user, and Cosmos DB can service the query by looking at just the one partition where all of that user’s posts are stored. And, for example, you could write a stored procedure that updates the location property of multiple posts belonging to any user, and those updates would occur inside of a transaction – they would all commit, or they would all rollback, together.
IoT applications deal with the so-called Internet of Things… where you’re storing telemetry data from all sorts of devices – it could be refrigerators, or automobiles – whatever. The device ID is typically used for IoT apps, so for example, if you’re storing information about different cars, you would go with the VIN number, the vehicle identification number that uniquely identifies each distinct automobile. Queries against any particular car, or device, can then be serviced by a single partition, and multiple items for the same car or device can then be updated together in a transaction by writing a stored procedure.
In a multi-tenant architecture, you provide storage for different customers, or tenants. Each tenant needs to be isolated, so the tenant ID seems like a natural candidate to choose for the partition key. Because then each tenant can run their own queries against a single partition, and transactionally update their own data in that partition using a stored procedure, completely isolated from other tenants.
You also need to think about how your application writes to the container, because you want to make sure that they get spread somewhat uniformly across all the partition within the container. Say you’re building a social networking app. I mentioned that user ID might be a good choice, but what about creation date? Well, creation date is a bad choice – for any type of application, really – because of the way the application is going to write data.
So from a storage perspective, we do get nice uniformity by using the creation date for the partition key, and, as time marches on, Cosmos DB will automatically add new partitions for new data, as you’ve already seen:
But the problem is that – when the application writes new data, the writes will always be directed to the same partition, based on whatever day it is. This results in what’s called a hot partition, where we have a bottleneck that’s going to quickly consume a great deal more of the reserved throughput you’ve provisioned for the container. Specifically, Cosmos DB evenly distributes your provisioned throughput across all the physical partitions in the container. So in this case where there are four partitions, if you are provisioning (and thus paying for) 4,000 RU/sec (request units per second), you’ll only get the performance of about 1,000 RU/sec:
That’s why user ID is a much better choice. Because then, writes get directed to different partitions, depending on the user. So throughout the day, as user profile data gets written to the container, those writes are being much more evenly distributed, with user ID as the partition key.
And then you need to consider that, in some scenarios, you can’t settle on one good partition key for a single container. In these cases, you’ll create multiple containers in the database, each with different partition keys, and then store different data in each container, based on usage.
The multi-tenant scenario makes a perfect example:
We’re partitioning by tenant ID, but clearly tenant number 10 is much bigger than all the other tenants. That could be because all the other tenants are, let’s say, small mom-and-pop shops, while tenant number 10 is some huge Fortune 500 company like Toyota. Compared to tenant number 10, all the other tenants have modest storage and throughput requirements.
This too creates a hot partition. First, look at storage. Partitions have a fixed size limit, which happens to be 10 gigabytes today, though that will likely change to 100 gigabytes in the future. But that limit is almost irrelevant, because independent of that, there is very uneven distribution of throughput. Tenant number 10 is much busier than the other tenants, so most writes are going to that one partition. And this is to the detriment of all the other tenants, because the reserved throughput is provisioned for the entire container. Furthermore, given the significantly larger volume of data for tenant number 10, it’s likely that large tenant will benefit from partitioning on some fine-grained property within their own data, and not just on the entire tenant ID, which is the partition key for the container.
And that’s a perfect example of when you’ll create a second container for your database. In this case, we can dedicate the entire second container to tenant number 10, and – let’s say it’s Toyota – that container can be partitioned by VIN number:
As a result, we have good uniformity across the entire database. One container handles small-sized tenants. This container is partitioned by tenant ID, and might be provisioned for relatively low throughput that gets shared by all the tenants in the container. And the other individual container handles the large tenant, Toyota. This container is partitioned by VIN number, and might even reserve more throughput for all of Toyota than the first container reserves for all smaller clients combined.