Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.
Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?
-- Pair of ServerIron's
---- Squid Caches
------ Net App's
---- PHP App Servers
------ Storage Manager
------ Master-master shards
------ Dual Tree Central Database
------ Memcached Cluster
------ Big Search Engine
- The Dual Tree structure is a custom set of changes to MySQL that allows scaling by incrementally adding masters without a ring architecture. This allows cheaper scaling because you need less hardware as compared to master-master setups which always requires double the hardware.
- The central database includes data like the 'users' table, which includes primary user
keys (a few different IDs) and a pointer to which shard a users' data can be found on.
- Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog
- Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
- PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)
- Slice of the main database
- Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
- Shard assignments are from a random number for new accounts
- Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.
- Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
- Pulls my Information from cache, to get my shard location (say on shard-13)
- Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites?
- every page load, the user is assigned to a bucket
- if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.
- A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.
- Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
- Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
- Think of it such that you’ve got Lucene-like search
- EMT64 w/RHEL4, 16GB RAM
- 6-disk 15K RPM RAID-10.
- Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger).
- 2U boxes. Each shard has~120GB of data.
- ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare.
- Snapshots are taken every night across the entire cluster of databases.
- Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea.
- However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.
- Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags.
- Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.
- Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.
- What is the maximum something that every server can do ?
- How close are you to that maximum, and how is it trending ?
- MySQL (disk IO ?)
- SQUID (disk IO ? or CPU ?)
- memcached (CPU ? or network ?)
- Do you have event related growth? For example: disaster, news event.
- Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year.
- 40-50% more uploads on Sundays than the rest of the week, on average