Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it?
- People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question.
- Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges
- A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day.
- Have a large number friends and presence explodes. The number IMs not that
big of deal.
- Lab tests are good, but don't tell you enough.
- Did a backend launch before the real product launch.
- Simulate presence requests and going on-line and off-line for weeks
and months, even if real data is not returned. It works out many of the
kinks in network, failover, etc.
- Divide user data or load across shards.
- Google Talk backend servers handle traffic for a subset of users.
- Make it easy to change the number of shards with zero downtime.
- Don't shard across data centers. Try and keep users local.
- Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers.
- Different systems should have little knowledge of each other, especially when separate groups are working together.
- Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system.
- Abstract these complexities into a set of gateways that are discovered at runtime.
- RPC infrastructure should handle rerouting.
- Everything is abstracted, but you must still have enough knowledge of how they work to architect your system.
- Does your RPC create TCP connections to all or some of your servers? Very different implications.
- Does the library performance health checking? This is architectural implications as you can have separate system failing independently.
- Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select.
- Smooth out all spoke in server activity graphs.
- What happens when servers restart with an empty cache?
- What happens if traffic shifts to a new data center?
- Limit cascading problems. Back of from busy servers. Don't accept work when sick.
- Isolate in emergencies. Don't infect others with your problems.
- Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example.
- Add fault tolerance to every component of the system. Everything fails.
- Add ability to profile live servers without impacting server. Allows continual improvement.
- Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects.
- Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines.
- Make sure binaries are both backward and forward compatible so you can have old clients work with new code.
- Build an experimentation framework to try new features.
- Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines.