Introduction to Notion's Growth
Notion is one of the fastest-growing software platforms ever created. But how does it manage to store billions of database entries so efficiently?
Let's dive into how Notion works under the hood.
How Notion Works Under the Hood
Every time you open a new Notion page, each block is rendered in a separate database row. For example, a checkbox will have a unique ID, the type of block it is, the child blocks it references, and the parent block it belongs to. Fun fact: if you use the arrow keys in Notion, it navigates by blocks. A decent-sized page can have 200 to 400 blocks.
In 2020, Notion had 1,000 users. Four years later, that number has skyrocketed to 100,000,000 users, each creating tens of thousands of blocks. Since Notion is online-only, any downtime would mean users lose access to their notes. In 2021, Notion was using a single Postgres database to handle all of this.
Single Database Limitations and Scaling Challenges
Once Notion hit about 20,000,000,000 blocks, things started to slow down. They needed a solution fast. After careful consideration, they decided to implement sharding.
Sharding Approach Explanation
Sharding is a common way to scale databases. Instead of continuously upgrading the host machine, sharding splits the database into smaller machines. This reduces traffic per machine and heavy queries. However, it requires custom application code to determine where each user's data is stored.
Notion's engineers decided to shard everything related to the block table, such as workspace IDs, discussions, and comments. They divided the data based on workspace IDs, as each block can only belong to one workspace. They transformed their one giant database into 32 separate database instances, each hosting 15 shards, resulting in 480 logical shards.
Database Migration Strategy
Notion used a combination of methods, including the double write method, which writes data to both the new and old databases. They also created an audit log for a catch-up worker to periodically add missing data. A machine with 96 CPUs took three days to backfill the entire Notion database. After verification, Notion went down for five minutes to switch over systems.
Data Lake Implementation
Postgres was being used for both online users and offline tasks like machine learning and analytics. To address this, Notion implemented a data lake. They extracted data from Postgres shards, loaded it into Snowflake using Fivetran, and transformed it into a single table for analytics.
Building a Custom Data Lake Solution
Notion built a custom data lake using Amazon S3, Elasticsearch, vector databases, and a key-value store. They used Apache Spark for data transformation and Apache Kafka for data streaming. Apache Hoodie simplified the process of building and managing data pipelines.
Facing New Scaling Challenges
Two years later, shards were hitting 90% utilization rates, and Pgbouncer was hitting connection limits. Notion had to re-shard their databases, increasing the number of machines from 32 to 96. They used Postgres logical replication to copy old data and built tools to organize incoming data.
Testing and Migration Process
Notion tested the new system with dark reads, which compared data from both databases. They transitioned one database at a time, ensuring no data loss and no downtime for users.
Conclusion
Scaling a platform like Notion is a monumental task that requires a team of skilled engineers. Notion's innovative approach to sharding, database migration, and custom data lake solutions has allowed them to keep up with their rapid growth. For more detailed insights, check out the links below.