Example

Skype Plans for PostgreSQL to Scale to 1 Billion Users

Skype uses PostgreSQL as their backend database. PostgreSQL doesn't get enough run in the database world so I was excited to see how PostgreSQL is used "as the main DB for most of [Skype's] business needs." Their approach is to use a traditional stored procedure interface for accessing data and on top of that layer proxy servers which hash SQL requests to a set of database servers that actually carry out queries. The result is a horizontally partitioned system that they think will scale to handle 1 billion users.

Skype's goal is an architecture that can handle 1 billion plus users. This level of scale isn't practically solvable with one really big computer, so our masked superhero horizontal scaling comes to the rescue.

Hardware is dual or quad Opterons with SCSI RAID.

Followed common database progression: Start with one DB. Add new databases partitioned by functionality. Replicate read-mostly data for better read access. Then horizontally partition data across multiple nodes..

In a first for this blog anyway, Skype uses a traditional database architecture where all database access is encapsulated in stored procedures. This allows them to make behind the scenes performance tweaks without impacting frontend servers. And it fits in cleanly with their partitioning strategy using PL/Proxy.

PL/Proxy is used to scale the OLTP portion of their system by creating a horizontally partitioned cluster:

- Database queries are routed by a proxy across a set of database servers. The proxy creates partitions based on a field value, typically a primary key.
- For example, you could partition users across a cluster by hashing based on user name. Each user is slotted into a shard based on the hash.
- Remote database calls are executed using a new PostgreSQL database language called plproxy. An example from Kristo Kaiv's blog:


First, code to insert a user in a database:
CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS text AS $$
BEGIN
    PERFORM 1 FROM users WHERE username = i_username;
    IF NOT FOUND THEN
        INSERT INTO users (username) VALUES (i_username);
        RETURN 'user created';
    ELSE
        RETURN 'user already exists';
    END IF;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

Heres the proxy code to distribute the user insert to the correct partition:
queries=#
CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS TEXT AS $$
    CLUSTER 'queries'; RUN ON hashtext(i_username);
$$ LANGUAGE plproxy;

Your SQL query looks normal:
SELECT insert_user("username");

- The result of a query is exactly that same as if was executed on the remote database.
- Currently they can route 1000-2000 requests/sec on Dual Opteron servers to a 16 parition cluster.

They like PL/Proxy approach for OLTP because:
- PL/Proxy servers form a scalable and uniform "DB-bus." Proxies are robust because in a redundant configuration if one fails you can just connect to another. And if the proxy tier becomes slow you can add more proxies and load balance between them.
- More partitions can be added to improve performance.
- Only data on a failed partition is unavailable during a failover. All other partitions operate normally.

PgBouncer is used as a connection pooler for PostgreSQL. PL/Proxy "somewhat wastes connections as it opens connection to each partition from each backend process" so the pooler helps reduce the number of connections.

Hot-standby servers are created using WAL (Write Ahead Log) shipping. It doesn't appear that these servers can be used for read-only operations.

More sophisticated organizations often uses an OLTP database system to handle high performance transaction needs and then create seperate systems for more non-transactional needs. For example, an OLAP (Online analytical processing) system is often used for handling complicated analysis and reporting problems. These differ in schema, indexing, etc from the OLTP system. Skype also uses seperate systems for the presentation layer of web applications, sending email, and prining invoices. This requires data be moved from the OLTP to the other systems.
- Initially Slony1 was used to move data to the other systems, but "as the complexity and loads grew Slony1 started to cause us greater and greater pains."
- To solve this problem Skype developed their on lighter weight queueing and replication toolkit called SkyTools.

The proxy approach is interesting and is an architecture we haven't seen previously. Its power comes from the make another level of indirection school of problem solving, which has advantages:

Applications are independent of the structure of the database servers. That's encapsulated in the proxy servers.

Applications do not need to change in response to partition, mapping, or other changes.

Load balancing, failover, and read/write splitting are invisible to applications.

The downsides are:

Reduced performance. Another hop is added and queries must be parsed to perform all the transparent magic.

Inability to perform joins and other database operations across partitions.

Added administration complexity of dealing with proxy configuration and HA for the proxy servers.

It's easy to see how the advantages can outweigh the disadvantages. Without changing your application you can slip in a proxy layer and get a lot of very cool features for what seems like a low cost. If you are a MySQL user and this approach interests you then take a look at MySQL Proxy, which accomplishes something similar in a different sort of way.

An Unorthodox Approach to Database Design : The Coming of the Shard

PostgreSQProducts - Scaling infinitely with PL/Proxy

PL/Proxy

Heroku also uses PostgreSQL.

MySQL Proxy

PostgreSQL cluster: partitioning with plproxy (part I) by Kristo Kaiv'.

PostgreSQL cluster: partitioning with plproxy (part II) by Kristo Kaiv'.

PostgreSQL at Skype.

Skytools database scripting framework & PgQ by Kristo Kaiv'.

PostgreSQL High Availability.