Scaling Databases: Approaches and Best Practices

Examples from Social Media Platforms like Instagram, Facebook, or Twitter

Nov 06, 2024

As online platforms like Instagram, Facebook, or Twitter grow, managing database performance becomes crucial to ensure a smooth user experience. These platforms handle massive amounts of data, including user-generated content (such as photos, videos, and posts), and need to scale to accommodate millions (or even billions) of active users. This article will explore strategies for scaling databases, from simple solutions to more advanced techniques, and provide examples of how a social media platform might apply each approach.

Basic Approaches to Scale Databases

Before jumping into complex solutions like sharding, it’s important to consider simpler, more manageable approaches to scaling:

1. Optimize Database Queries to Reduce Load

One of the first steps in scaling is optimizing queries to ensure that every database call is as efficient as possible. Inefficient queries can increase latency and resource usage, especially when they access large tables or run frequently.

Example Optimization: Suppose you have a query that fetches user details and their last 10 posts. Initially, it may look like this:

SELECT users.*, posts.*
FROM users
JOIN posts ON users.id = posts.user_id
WHERE users.id = 123
ORDER BY posts.created_at DESC
LIMIT 10;

This query joins the entire users and posts tables and sorts them every time it runs, which can be slow for large datasets. To optimize, consider:
1. Selecting Specific Columns: Only fetch required fields rather than ``:

SELECT users.name, users.email, posts.content
FROM users
JOIN posts ON users.id = posts.user_id
WHERE users.id = 123
ORDER BY posts.created_at DESC
LIMIT 10;

1. Indexing Frequently Queried Columns: Add an index on user_id and created_at in the posts table to make retrieval faster (more on indexing next).
2. Avoiding Unnecessary Joins: If possible, store a reference to the last post in the users table to avoid repeated joins for frequently requested data.
3. Use Pagination

By refining these types of queries, you reduce resource use and improve response time.

2. Use Indexing to Speed Up Data Retrieval

Indexes help the database locate data quickly without scanning every row. Think of them as an organized structure that allows fast lookups. While they can speed up reads, they can slow down writes, so careful planning is essential.

Creating an Index: Suppose you frequently search by user_id and created_at in the posts table:

CREATE INDEX idx_user_created ON posts (user_id, created_at DESC);

This index will allow faster sorting by user_id and created_at, reducing the load on each query.
Choosing the Right Fields to Index: Prioritize fields that are frequently used in WHERE, JOIN, or ORDER BY clauses. Avoid indexing fields with low selectivity (e.g., fields with only a few unique values), as they won’t provide much performance gain.

Indexes significantly reduce read times, but they require maintenance with each insert or update, so balance your use of indexes with the frequency of updates.

3. Caching for Frequently Accessed Data

Caching stores frequently requested data in memory, allowing for faster retrieval. Use caching for static or rarely changing data, such as configuration settings or popular content.

Example Using Redis for Caching: Suppose your application frequently queries user profile data. Cache this data in Redis for faster access: Each time a user profile is requested, the application first checks Redis before querying the database, reducing the load on the database for popular profiles.

import redis

cache = redis.Redis(host='localhost', port=6379)
user_id = 123
cache_key = f"user_profile:{user_id}"

# Check if data is in cache
profile = cache.get(cache_key)
if not profile:
    # If not, fetch from database and store in cache
    profile = db.get_user_profile(user_id)  # Example database query
    cache.setex(cache_key, 3600, profile)   # Cache for 1 hour

4. Vertical Scaling (Adding More Resources)

Vertical scaling involves upgrading hardware resources, such as CPU, RAM, or storage, on your existing database server. It’s often a straightforward approach for improving performance, especially for small-to-medium workloads.

How to Scale Up: Suppose your application is hosted on AWS RDS, and you notice performance bottlenecks during peak hours. You can upgrade your instance type to one with more CPU and memory directly through the AWS console.
When to Use: Vertical scaling is beneficial for quick fixes, especially if your application hasn’t reached the maximum resource capacity of a single machine. However, it has limitations in terms of maximum hardware capabilities, making it a temporary or supplementary solution.

5. Read-Write Splitting (Replication)

Read-write splitting involves using a master-slave architecture where the master database handles write operations, and replica databases handle read queries. This is an effective way to distribute the load across multiple servers.

Consider a platform like Facebook that could use read-write splitting to ensure that high-frequency queries (like loading a user’s news feed or timeline) are handled by read replicas, while the master database manages writes, such as posting new status updates or uploading photos.

Master-Slave(Read-Write) Setup in MySQL:

Master Database: Handles both read and write operations.
Slave Database: Handles only read operations, replicating data from the master.

import mysql.connector
from mysql.connector import Error

# Define configurations for master and slave databases
master_config = {
    'host': 'master-db-host',  # Master DB host
    'user': 'your-username',
    'password': 'your-password',
    'database': 'social_media'
}

slave_config = {
    'host': 'slave-db-host',  # Slave DB host
    'user': 'your-username',
    'password': 'your-password',
    'database': 'social_media'
}

# Function to write to the master (for write operations like posting status, uploading photos)
def write_to_master(query, params=None):
    try:
        # Connect to the master database (for writes)
        connection = mysql.connector.connect(**master_config)
        cursor = connection.cursor()
        cursor.execute(query, params)  # Execute write query
        connection.commit()
        print("Write operation successful on master database.")
    except Error as e:
        print(f"Error writing to master database: {e}")
    finally:
        if connection.is_connected():
            cursor.close()
            connection.close()

# Function to read from the slave (for read operations like fetching news feed or timeline)
def read_from_slave(query, params=None):
    try:
        # Connect to the slave database (for reads)
        connection = mysql.connector.connect(**slave_config)
        cursor = connection.cursor()
        cursor.execute(query, params)  # Execute read query
        result = cursor.fetchall()
        return result
    except Error as e:
        print(f"Error reading from slave database: {e}")
    finally:
        if connection.is_connected():
            cursor.close()
            connection.close()

# Example usage for read-write splitting

# Write to master database (e.g., posting a new status update)
write_query = "INSERT INTO posts (user_id, content, created_at) VALUES (%s, %s, NOW())"
write_params = (12345, "This is my new status update.")
write_to_master(write_query, write_params)

# Read from slave database (e.g., fetching posts for a user's news feed)
read_query = "SELECT content, created_at FROM posts WHERE user_id = %s ORDER BY created_at DESC LIMIT 10"
read_params = (12345,)
posts = read_from_slave(read_query, read_params)

# Display fetched posts
for post in posts:
    print(f"Post: {post[0]} | Date: {post[1]}")

6. Load Balancing for Distributing Traffic

Load balancers distribute incoming requests across multiple servers, preventing any single database server from becoming overwhelmed by traffic. This ensures stability and improved response times, especially during high-traffic periods.

Using a Load Balancer: Most cloud providers (e.g., AWS, Google Cloud) offer managed load balancers that can balance requests to multiple read replicas. By configuring a load balancer in front of your replicas, requests are distributed based on availability or load, enhancing fault tolerance.
Example Setup: In AWS, you can create an Application Load Balancer, specify your read replicas as targets, and route requests to them. This spreads the traffic load, making your system more resilient to traffic spikes.

7. Materialized Views for Precomputed Data

Materialized views store the result of a complex query as a separate table, allowing repeated access to precomputed data without recalculating it every time. This is especially useful for queries with high computational cost.

Creating a Materialized View: Suppose you need a daily sales summary:

CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT date(created_at) AS sale_date, SUM(amount) AS total_sales
FROM transactions
GROUP BY sale_date;

Now, instead of recalculating the sum for each query, you can access daily_sales_summary, which updates periodically.
Usage Tip: Regularly refresh materialized views to ensure data accuracy, especially if the underlying data changes frequently. Some databases support automatic refresh intervals, or you can refresh manually using a scheduled job.

8. Connection Pooling to Manage Database Connections

Connection pooling allows applications to reuse database connections rather than opening and closing a new connection each time. This reduces the overhead associated with establishing connections, which becomes significant as traffic increases.

Setting Up a Connection Pool: In many frameworks (like Django or Node.js), you can configure a connection pool to maintain a certain number of open connections, which are shared among requests. In Python with psycopg2, you can use a library like psycopg2.pool to manage a connection pool:

from psycopg2 import pool

# Set up the connection pool
db_pool = pool.SimpleConnectionPool(1, 20, user="user", password="password", host="localhost", port="5432", database="mydb")

# Get a connection from the pool
connection = db_pool.getconn()

# Use the connection
cursor = connection.cursor()
cursor.execute("SELECT * FROM users;")
db_pool.putconn(connection)

Benefits: Connection pooling reduces latency caused by connection setup and teardown, making it easier for the database to handle high traffic without overwhelming the server with new connections.

9. Database Throttling to Control Load Peaks

Database throttling controls the rate at which requests are sent to the database, preventing overload during peak traffic times. This can help prevent bottlenecks, especially when handling sudden traffic spikes.

Throttling Implementation: Set up application-level throttling logic to queue non-essential or less urgent queries when the load is high. For example, if your web server handles request queuing, you can defer some requests to maintain database stability.
Example in Practice: Use a rate-limiting library or tool that enforces a maximum number of queries per second. Many managed databases or services like AWS RDS have built-in throttling options to limit resource usage during high-demand times.

Throttling gives you control over your database’s workload, ensuring that high-priority queries are not delayed by non-critical ones during peak times.

10. Archiving Old Data

Archiving involves moving older, less frequently accessed data to a separate storage system. By keeping frequently used data (hot data) in the primary database and moving rarely accessed data (cold data) elsewhere, you can improve overall system performance.

For a platform like Instagram, older posts or media that have not been interacted with for a long time could be archived in a separate database or external storage system (like AWS S3). This ensures that the operational database remains lean and responsive, while users can still access older content if needed.

Archiving Data

11. Table Partitioning

Table partitioning involves splitting a large table into smaller, more manageable pieces while still keeping them on the same server. Partitioning can be based on different criteria, such as time, user ID, or another logical grouping.

For a platform like Twitter, which stores massive amounts of tweets, partitioning the Tweets Table by time (e.g., daily or monthly partitions) can improve performance. This helps with queries that filter by date range and ensures that the table remains manageable as data grows.

CREATE TABLE tweets (
    tweet_id BIGINT PRIMARY KEY,
    user_id INT,
    tweet_content TEXT,
    tweet_date DATE,
    likes INT,
    retweets INT
)
PARTITION BY RANGE (YEAR(tweet_date), MONTH(tweet_date)) (
    PARTITION p2023_01 VALUES LESS THAN (2023, 2),  -- January 2023
    PARTITION p2023_02 VALUES LESS THAN (2023, 3),  -- February 2023
    PARTITION p2023_03 VALUES LESS THAN (2023, 4),  -- March 2023
    PARTITION p2023_04 VALUES LESS THAN (2023, 5),  -- April 2023
    PARTITION p_future VALUES LESS THAN (MAXVALUE)  -- Future tweets (after April 2023)
);

SELECT * FROM tweets
WHERE tweet_date BETWEEN '2023-01-01' AND '2023-01-31';

Benefits: Partitioning is particularly useful for large tables with predictable query patterns (e.g., time-based data). It improves both read and write performance by containing relevant data within smaller subsets of the overall table.

12. Horizontal Scaling (Sharding) to Split Data Across Servers

Sharding: When and How to Use It

Sharding is a method of splitting a large database into smaller, more manageable pieces (shards). While powerful, it introduces additional complexity.

Here’s when and how to approach sharding:

Sharding Strategies

1. Range-Based Sharding

Range-based sharding divides data based on a specified range of values. For example, user data can be split into different ranges, such as users with IDs 1-100,000 on one shard, and users with IDs 100,001-200,000 on another.

Consider a social media platform that has millions of users and posts. Range-based sharding could be used to distribute user data across multiple servers, for instance, user IDs 1–1,000,000 on one database and 1,000,001–2,000,000 on another. This ensures that no single database is overloaded, especially when handling high traffic.

User data could be split based on the range of follower counts to distribute the load efficiently. This is important because some users may have millions of followers, generating a lot of traffic.

Shard 1: Users with 0–10,000 followers
Shard 2: Users with 10,001–100,000 followers
Shard 3: Users with 100,001–1,000,000 followers
Shard 4: Users with over 1,000,000 followers

This approach ensures that servers handling high-traffic influencers or celebrities aren't overloaded and can better handle requests such as fetching posts or user profile data.

B. Key-Based Sharding

In key-based sharding, a hash function is used to determine how data is distributed across shards. The hash function is applied to a key (such as user ID or post ID) to decide which shard the data should reside on.

Key-based sharding could be applied to tweet data for a platform like Twitter. A tweet could be assigned to a specific shard based on its tweet ID, using a hash function to ensure an even distribution of tweets across multiple databases.

Platform could use key-based sharding to distribute the vast number of user interactions such as likes, comments, and posts based on post ID or comment ID. Each post or comment could be assigned to a shard using a hash function to ensure the data is distributed evenly.

Shard 1: Posts with post IDs hashed between 0–999
Shard 2: Posts with post IDs hashed between 1000–1999
Shard 3: Posts with post IDs hashed between 2000–2999

This method helps balance the load for interactions like "likes" and "comments," ensuring that Facebook can scale effectively as the number of posts grows.

C. Hybrid Sharding

Hybrid sharding combines both range-based and key-based strategies. This can help avoid some of the issues inherent in using a single approach, particularly when the data set has both hot and cold data.

Platforms like Facebook and hybrid sharding could be used for user content.

For example, posts might be partitioned by date (range-based), while user data might be sharded by user ID (key-based). This would ensure that both recent posts and user data are efficiently distributed across multiple servers.

User Data (Key-Based Sharding): Users can be distributed across shards based on their user ID. For example:
- Shard 1: Users with IDs 1–1,000,000
- Shard 2: Users with IDs 1,000,001–2,000,000
Tweet Data (Range-Based Sharding): Tweets can be distributed based on the time of posting. For example:
- Shard 1: Tweets posted before 2018
- Shard 2: Tweets posted between 2018 and 2022
- Shard 3: Tweets posted after 2022

Avoiding Common Pitfalls

When considering database sharding, keep the following points in mind:

Re-Sharding is Costly: Once you’ve chosen a sharding strategy, it’s difficult to change. Ensure you choose your sharding keys carefully to avoid hotspots and future issues.
Don’t Overcomplicate: Start with simpler solutions like replication and partitioning before resorting to sharding. Complex solutions should only be implemented when simpler ones no longer suffice.
Data Integrity: With sharding, maintaining consistency and ensuring transactions across multiple shards becomes more complex. Be mindful of the potential pitfalls and make use of distributed transaction handling tools when necessary.

13. Explore NoSQL Databases

For platforms with unstructured data or high write throughput, NoSQL databases like Cassandra or MongoDB can provide native support for horizontal scaling and sharding.

Consider a platform like Twitter, which handles large volumes of unstructured data (tweets, images, etc.). A NoSQL database such as Cassandra could be used to store tweet data, ensuring high availability and scalability by distributing data across multiple nodes.

14. Denormalization: Improving Read Performance to Handle Heavy Traffic and Large Datasets

Denormalization involves adding redundancy by storing related data together in the same table, reducing the need for joins, and speeding up query performance. This approach can be especially helpful in read-heavy, high-traffic applications where quick data retrieval is more important than storage efficiency.

When to Use Denormalization

Use denormalization for read-heavy workloads where frequent joins are slowing down response times. For instance, in a reporting system or e-commerce app, where users constantly access related information, denormalization can reduce query complexity and improve speed.

15. Using Database-as-a-Service (DBaaS) Solutions for Scalability

Database-as-a-Service (DBaaS) solutions like Amazon RDS, Google Cloud SQL, or Azure SQL Database provide managed scaling options that can automatically adapt to your application’s needs.

Example of DBaaS Scaling: In AWS RDS, you can enable Auto Scaling, allowing the service to adjust resources like storage or memory based on demand:
- Go to your RDS instance in the AWS console.
- Configure Auto Scaling under instance settings.
- Set up parameters for scaling, such as thresholds for CPU and memory usage.
Benefits: DBaaS solutions handle much of the operational work, like replication, load balancing, backups, and scaling. This approach can be ideal for applications that expect growth or unpredictable spikes in traffic.

DBaaS solutions offer both vertical and horizontal scaling features, allowing applications to grow without manual intervention.

16. Database Monitoring and Performance Tuning for Proactive Scaling

Constantly monitoring and tuning your database helps you anticipate scaling needs before they impact performance. Tools like New Relic, Datadog, and Grafana provide insights into query performance, resource usage, and system bottlenecks.

Monitoring Queries and Usage: Set up monitoring to observe slow-running queries, table sizes, and CPU/memory usage trends.
Example with Query Analysis: Use EXPLAIN in SQL to analyze and improve query execution plans: This statement provides a breakdown of execution time and helps pinpoint bottlenecks in the query.

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123 AND created_at > NOW() - INTERVAL '1 month';

Regular Tuning: Regularly tune indexes, remove redundant ones, and update configurations (e.g., cache sizes or max connections) based on workload trends.

Proactive monitoring and tuning allow you to spot potential issues and prepare scaling actions, ensuring smooth database performance as your data and traffic grow.

Conclusion

Scaling a database for platforms with heavy traffic and high load demands, such as social media or e-commerce, is a complex but essential task. By initially focusing on simpler scaling techniques—like vertical scaling, read-write splitting, and partitioning—these platforms can efficiently manage the data demands that come with rapid growth. When more advanced scaling is required, sharding (whether range-based, key-based, or hybrid) can be implemented to further distribute the load, albeit with increased complexity. Tools like Vitess or NoSQL databases can alleviate some of the challenges of manual sharding, and caching can significantly improve performance by reducing direct database load.

Ultimately, the objective is to maintain a balance between performance and simplicity, ensuring the platform scales seamlessly to handle millions or even billions of users, while delivering a fast and uninterrupted experience under heavy load.

Suraj’s Substack

Discussion about this post