With Lavabit shutting down under murky circumstances, it seems fitting to repost an old (2009), yet still very good post by Ladar Levison on Lavabit's architecture. I don't know how much of this information is still current, but it should give you a general idea what Lavabit was all about.
Getting to Know You
What is the name of your system and where can we find out more about it?
Note: these links are no longer valid...
What is your system for?
Lavabit is a mid-sized email service provider. We currently have about 140,000 registered users with more than 260,000 email addresses. While most of our accounts belong to individual users, we also provide corporate email services to approximately 70 companies.
Why did you decide to build this system?
We built the system to compete against the other large free email providers, with an emphasis on serving the privacy conscious and technically savvy user. Lavabit was one of the first free email companies to provide free access via POP and later via IMAP. To this day over 90 percent of our users access the system using POP or IMAP.
How is your project financed?
The project was initially financed by the founders, but now lives off money collected via advertising and paid users. Ongoing development efforts are subsidized by our consulting business; quite simply we work on the code base for Lavabit when we have slowdowns in our consulting business.
What is your revenue model?
Offer a superior product and hope that its increasing use leads to advertising revenues, and paid account upgrades.
How do you market your product?
We rely on word of mouth to grow the service. Since most of what we provide is free, we can't justify the cost of advertising (at least right now).
How long have you been working on it?
The service has been running since the summer of 2004. Originally we called the service Nerdshack, but changed the name to Lavabit at the request of our users in December of 2005.
How big is your system? Try to give a feel for how much work your system does.
Every day the system handles approximately 200,000 email messages, while rejecting another 400,000 messages as spam. Lavabit currently averages about 12,000 daily logins, of which 80 percent are via POP, 10 percent are via IMAP and 10 percent are via the webmail system. The website itself sees about 2,500 unique visitors per day, resulting in approximately 170,000 page and file requests.
Number of unique visitors?
Approximately 12,000 unique visitors per day and approximately 28,000 unique visitors per month.
Those are the old numbers as of 2009. When Lavabit shutdown the updated numbers were: 40,000 people logging in every day and sending 1.4 million messages per week.
Number of monthly page views?
3,728,686 for Jan 2009
3,929,292 for Feb 2009
These numbers only consider HTTP requests.
What is your in/out bandwidth usage?
We currently send about 70 gigabytes per day through our upstream Internet connection. See this page for a graph:
How many documents, do you serve? How many images? How much data?
Our system currently handles approximately 180,000 inbound emails, and another 20,000 outbound emails per day. This translates into about 70 gigabytes of traffic.
How fast are you growing?
We see about 150 new user registrations per day.
What is your ratio of free to paying users?
We currently have approximately 1,500 actively paying customers.
What is your user churn?
Our daily login average has recently been growing by about 250 per month. We hope to grow a lot faster when our new website and webmail system launch later this year.
How many accounts have been active in the past month?
34,247 between 2/10/2009 and 3/10/2009.
How is your system architected?
What is the architecture of your system? Talk about how your system works in as much detail as you feel comfortable with.
For SMTP, POP and IMAP Connections
We use a 2-tier architecture. There is an application tier that runs our custom mail daemon and a support tier made up of NFS and MySQL servers. A hardware based load balancer (Alteon AD4) is used to split incoming SMTP, POP and IMAP connections across the 8 application servers (Dell 1650's with 4gb of RAM). The application servers also run memcached instances.
The application servers are used to handle the bulk of the processing load. The daemon itself is a single process, multi-threaded application written in C. Currently each daemon is configured to pre-spawn 512 threads for handling incoming connections. Another 8 threads are used to asynchronously pull ads from our advertising partner's HTTP API, and perform maintenance functions. Maintenance functions involve updating in-memory tables, expiring stale sessions, log file rotation and keeping the ClamAV signatures up to date.
From an architecture standpoint, each incoming connection gets its own thread. This allows us to use blocking IO. We currently rely on the Linux kernel to evenly split the processor among the connections.
We currently call our mail daemon "lavad" and it fluently speaks SMTP, POP and IMAP. The daemon is also responsible for applying all of our business logic and interfacing with the different open source libraries we use.
When accepting messages from the outside world via SMTP, the daemon will perform the following checks:
- If the recipient is valid
- Whether the incoming IP is listed on an RBL
- If the return path can be validated using SPF (libspf2)
- Against any size or rate limits for the account
- Against the user’s gray list
- For viruses (libclamav)
- For a valid domain key signature (libdomainkeys)
- Whether the message looks like spam based statistical token data (libdspam)
- And finally against any filters used to sort or delete messages matching a regular expressions
Whether a specific check is used depends on the user’s preferences, and the account plan they have. For example, the spam filter is limited to paid users (because of the load it places on the database).
Depending on the outcome of the different checks, the user can choose to label the message, reject it, or in some cases delete it silently. If the message needs to be bounced, a bounce message is only sent if the return path can be verified using a) SPF or b) if the sender is verified using domain keys, and the sender matches the return path.
As a final step, the message is encrypted using ECC (if applicable), compressed using LZO and then stored on the NFS server.
For POP connections, the process is relatively simple. The user authenticates and requests a message. The daemon loads the message, checks the hash to make sure the date hasn’t been corrupted, decompresses the data, and then decrypts it (if applicable) before sending it along to the client.
Because we need the plain text password to decrypt a user’s private key, we don’t support secure password authentication. We decided to support SSL instead (which encrypts everything; not just the password). We handle the SSL encryption at the application tier rather than on the load balancer because we feel the application tier is easier to scale.
On a side note are failure to support secure password authentication hasn’t stopped people from clicking the Secure Password Authentication checkbox in Outlook, and creating a support nightmare for us. Outlook doesn’t enable SMTP authentication by default either, so that creates another support nightmare for us. If any mail client developers read this; please start making port 587 the default instead of port 25, and auto detect SMTP authentication.
When retrieving messages for users that have the statistical spam filter enabled or users who have selected a plan with advertising, the daemon will also insert small text signature. The signature will have a link for training the server side spam filter and/or a small text advertisement.
For IMAP connections, the daemon also presents messages in folders and allows server side searches of messages. Currently searches are handled by reading in all of the message data from disk; which results in a large performance hit if the folder is large. If the user connects multiple times using the same credentials, the connections will share a centralized copy of their mailbox state, which also creates lock contention issues. Search is certainly one area where we need to improve.
For outbound messages, the daemon will authenticate the user's credentials against the database, apply any sending limits for the account (to prevent abuse), check whether the From address matches an email address associated with the credentials provided (to prevent spoofing), and finally the daemon checks whether the message contains a virus. Assuming all of the checks are passed, the message is cleaned up, signed using domain keys, and then relayed via our internal network to a Postfix server that handles relaying it to the final destination.
The daemon uses pools for sharing anything it can, including MySQL connections, ClamAV instances, cURL instances (for pulling ads), Memcached instances, libspf2 instances, etc. To keep deployments simple, we compile all of the open source libraries we rely upon into a single archive that is then dynamically loaded at runtime. We don’t compile the libraries directly into our application because doing so would require us to release the daemon under the GPL, and we don’t rely on dynamic linking since we don’t want any of the key libraries to be automatically updated by the operating system without us knowing.
For HTTP Connections
Like inbound mail connections, HTTP connections are split between two servers using the load balancer. Apache is currently used to handle the requests. While most of the website is static XHTML files, our registration engine is written in C (with libgd for the CAPTCHA images, and libcurl for processing credit cards using the PayFlow Pro HTTP API). All of our C applications rely on the Apache CGI interface.
The preferences portal is currently written in Perl and the webmail system is based on a popular open source client and is written in PHP. We modified the webmail system to fit more smoothly into our site. The webmail client currently connects to the mail system using IMAP, with each web server getting a dedicated IMAP server.
* What particular design/architecture/implementation challenges does your system have?
The Big Problem
While it is very easy to setup a mail system that reliably handles email for a few thousand users, it is incredibly difficult to scale that same system beyond a single server. This is because most email servers were originally designed for use on a single server. If you grow these same systems beyond a single server you typically need to use a database and/or a NFS server to keep everything synchronized between the different nodes. And while it is possible to build large database and NFS instances, it is also very expensive, and depending on the setup it can be very inefficient.
If you want to avoid the single database or NFS server problem, you can do so by adding a lot of complexity. For example, if you wanted to implement a very large Cyrus system the typical solution is to use LDAP for authentication, and then use an IMAP/POP reverse proxy to intercept incoming connections and forward them to the specific Cyrus server for that user. The problem with managing a system like this is the relatively high number of critical pieces that can fail. The following image visualizes what a system like this might look like:
For a full write up on this design, see http://www.linuxjournal.com/article/9804
The problem with systems designed this way is the large number of critical services. If a Cyrus server goes down, then all of the users hosted on that system are offline. You can mitigate this risk with a failover system that is periodically rsync’ed with the master, or by using a SAN, but these options are either inefficient or expensive. (There is at least one medium sized free/paid email company that uses a system like this, and presumably this is why the limits on their free plan are so low.)
At one point Yahoo Mail relied on a very large NetApp device to centrally store mail. And while the NetApp devices scale well, they are also very expensive. It was this high cost that kept Yahoo Mail from matching Gmail’s 1GB quota for almost a year. When they finally completed their move to a distributed architecture in 2007, they began offering unlimited storage.
It is precisely because of how easy it is to implement small mail systems, and so difficult to implement large mail systems that the Internet sees literally hundreds if not thousands of free email companies start and then fail each year. If the system becomes popular they have no easy way to scale it, so they are forced to either stop accepting free accounts, or shut down the system. Only a small number of companies have the financial and technical resources to build systems from the ground up to support a large user base. It is also why Lavabit followed the example of the large providers, and implemented a completely custom platform; it was the only way we could support 100k+ users with a cost basis low enough to make the business profitable.
The key to keeping any large system manageable and cost effective is to keep it simple. The fewer critical failure points you have in the system, the better things will run.
Current Problems with Our Implementation
We have locking issues when multiple IMAP connections try to access the same mailbox; only one thread is currently allowed access to the mailbox at a time, which can present a problem if the user makes a request that takes a long time to process (ie search or bulk fetches via IMAP).
Reading in the full message when only the header is needed has caused a performance problem. We also need to implement code for indexing messages and processing searches using an index instead of reading in all of the data for each search request. Unfortunately we don’t have dedicated search gurus on hand to help with this like our competitors.
The statistical spam filter currently stores all of its token and signature data in the database. Checking the tokens in a message against a database using SQL is very inefficient. And because of how difficult it is to scale MySQL databases, doing this for 200,000+ messages a day would be very expensive. At some point we will change the way token data stored so that the load is more evenly distributed across the cluster and then we will be able to offer the filter to everyone.
The webmail system does not keep IMAP connections open. As a result data is freed and then reloaded frequently. Also because the whole message is loaded when the webmail system is only requesting the header, a lot of unnecessary data is often pulled from the NFS servers.
Naturally were working to fix all of these issues.
What did you do to meet these challenges?
We implemented a custom mail system, which was designed from the ground up to efficiently handle a large number of users. A custom platform has also allowed us to implement a lot of custom business logic that would have been difficult, if not impossible using an off-the-shelf system.
How did your system evolve to meet new scaling challenges?
We started with a single server that used Postfix and Qpopper. We used amavisd for virus and spam filtering and a custom policy daemon to make sure users didn’t send to much mail, or spoof someone else’s address. This system worked well for about 4 months; but with a few thousand users the system started to choke. We had to turn off new user registrations until we could transition onto a custom multi-server platform.
Originally there was an application for handling SMTP connections and a separate one for POP connections. Each application would spawn multiple processes with multiple threads so that if one died from a segmentation fault another could take its place. (This is how Apache and Postfix are designed for both reliability and security.) Over time we combined the protocols into a single process which spawns a larger number of threads. We could do this because over time we worked out the memory bugs. In the last year we’ve only had 4 nodes die from segmentation faults, all of which were triggered by bugs in the libraries we use. (But we also haven’t released any major changes into production in the last year.) This single daemon design has made our use of things like database connections much more efficient.
Do you use any particularly cool technologies or algorithms?
The way we encrypt messages before storing them is relatively unique. We only know of one commercial service, and one commercial product that will secure user data using asymmetric encryption before writing it to disk. Basically we generate public and private keys for the user and then encrypt the private key using a derivative of the plain text password. We then encrypt user messages using their public key before writing them to disk. (Alas, right now this is only available to paid users.)
We also think the way our system is architected, with an emphasis on being used in a cluster is rather unique. We would like to someday release our code as free software. We haven’t yet because a) we don’t want anyone else building a competing system using our code, b) while we’ve moved more settings and logic into a configuration file over the last couple of years, there is still a lot of logic hard coded, and c) we’ve created the code specifically for Cent OS, and don’t have the resources to test and support it on other operating systems right now. We’ve spent some time looking for a company to sponsor open sourcing the code, but haven’t found one yet.
What did you do that is unique and different that people could best learn from?
One of the ways to gain an advantage over your competition is to invest the time and money needed to build systems that are better than what is easily available to your competition. It is the custom platform we developed that has allowed us to thrive while many other free email companies either stopped offering their service for free, or shut down altogether.
That said, you should always start by improving the components that will make the most difference to your users, and move on from there. For Lavabit that meant starting out with a custom mail platform, but continuing to use Postfix for outbound mail, MySQL for synchronization, and NFS for file storage.
This may be a good place to note that in 2004 the only major database system with production ready cluster support was Oracle and it remains a very expensive option (way beyond the budget of a free service like ours). Since then SQL Server and MySQL have both improved/added support for clustering (replication and failover is _not_ not the same as cluster support). And while the MySQL cluster implementation still needs work before developers can stop worrying about the scalability of the database, the world is getting closer to that point. Distributed caching (memcached) and distributed file systems (Lustre/GFS) have also matured since we started in 2004. Throw in cloud services like S3 into the equation and it is almost easy to implement a highly scalable website or service.
What lessons have you learned?
The only way to guarantee success is through hard work.
Why have you succeeded?
We are committed to providing a superior service and offering it on terms we think all users should be demanding. We are also committed to continually improving the service we offer.
What do you wish you would have done differently?
There are a number of areas in our platform we wish were implemented differently. In most cases we made the decisions we did because implementing them the "right" way would have taken longer.
A good example is the IO model were using. The asynchronous IO model used by lighttpd and memcached is more efficient than our current model, but we felt doing things this way would have taken longer while giving us little initial benefit. See this quite famous web page for a full write up on the issue:
We also wish we had finished the IMAP server earlier than October of 2007, and finished our custom webmail system by now.
What wouldn't you change?
We are happy with the decision to enter the email service business. Overcoming the challenges involved in building a reliable and scalable mail platform has been rewarding.
Personally I also enjoy knowing that the system I helped create is being used by 12,000 each day. Sometimes I find myself thinking "there are 1,000 people connected to this system right now." I like those numbers.
How much up front design should you do?
Collectively, the engineering team has spent thousands of hours doing research to help make our mail system better. This knowledge has been invaluable not only in improving the mail system, but also in helping our professional services clients.
The bottom line is that it is easier to make changes to a design document than it is to the code. What this means is that if you don’t clearly understand how something should be implemented, it pays to write design documents first. The hours you save in the end will far outweigh the hours you spend writing the documentation.
How are you thinking of changing your architecture in the future?
We’ve had a major update to our website and our application tier in the works for almost a year already. The details involved in this update (outside of what has already been mentioned here) are still a secret. I will definitely need to update this write up when we are ready to push the development tree into production.
What infrastructure do you use?
Which programming languages does your system use?
We still have a number of legacy web applications and maintenance scripts written in Perl, and the webmail system is currently in PHP, but these are all slated for conversion to C as time allows.
Our consulting projects typically involve development in C#; and we spent a lot of time thinking about whether to implement the system in C# back in 2004. In the end, we felt that Windows and .NET would not be a good choice for a scalable mail platform. We felt that the increased performance, the lack of licensing costs, and the availability of so many open source libraries for handling mail meant that the best choice for us was to go with Linux and C.
If we had to make a similar decision today, there is a chance we would not have chosen to go with C. Given the stability and efficiency of Windows 2008, the growing amount of open source C# code on the Internet and the availability of Mono as an alternative to .NET, we may have opted for C# instead.
In our experience, the decision on what platform to choose for a project can often be broken down into simple math. Building applications in C# is typically faster than C. For us that typically means 3 to 4 times faster than using C, and about 1.5 faster than using PHP. If you can figure out how much additional development time it will take to use one platform over another, it becomes easy to calculate whether the performance and license savings of one platform will offset the increased cost of development. In general hardware and software is cheap compared to development time, so the number of applications which can justify being built in C or C++ is getting very small.
On a side note, we think the productivity gap between IIS/.NET/SQL Server/Visual Studio and Apache/PHP/MySQL/Eclipse is largely a result of how well the Microsoft tools have been integrated with each other.
How many servers do you have?
We have 14 servers dedicated to the mail platform. We have 1 server dedicated to monitoring, and another 11 servers used for website hosting. Most of the websites we host were developed by the Lavabit team, so we don’t consider it part of our core business.
How is functionality allocated to the servers?
We move services from one server to another as necessary. Our long term goal is to create a global pool of servers than can handle everything, so fewer clusters are needed, and the load is distributed more evenly.
How are the servers provisioned?
We typically buy servers on eBay, and then install and configure them ourselves.
What operating systems do you use?
The mail system is currently using Cent OS 4. The application servers use the 32 bit version, and the database and storage servers use the 64 bit version.
We also use Windows 2003 for hosting other things not related to the mail service.
Which web server do you use?
Which database do you use?
Do you use a reverse proxy?
Our load balancer will route connections from the same IP to the same node, which tends to make caching easier, but we do not use a "true" reverse proxy to route connections based on user credentials.
Do you collocate, use a grid service, use a hosting service, etc?
We’ve created our own platform. We lease space inside a colo facility to host all of our equipment.
What is your storage strategy?
The storage servers have 3ware RAID cards, and use SATA drives. We are currently using the ext3 file system, and share files via NFS. The database servers have PERC 4 cards and use SCSI.
How much capacity do you have?
Overall, we are only using about 10 percent of our total processing power, and about 25 percent of our currently subscibed bandwidth. The two areas where we currently have issues are the IMAP servers dedicated to handling webmail connections, and disk throughput on the NFS servers. Throughput on the NFS servers is limited by the IO controller cards, and the file system in use (ext3). Ultimately we feel there is a lot of room for growth available to us by making our code more efficient.
The servers in the "earth" cluster are only averaging about 8 percent utilization. This 8 node cluster is used to handle inbound SMTP, POP and IMAP connections, and run memcached instances. A typical CPU graph for a server in that cluster looks like:
As you can see the servers allocated to the "earth" cluster still have plenty of room for growth. In contrast, if you compare this to a CPU graph from one of the servers in the "mars" cluster you will notice one of our current capacity problems. The "mars" cluster is made up of the two nodes dedicated to handling IMAP connections from the webmail system, and the CPU graphs for both nodes looks like: