Amir Salihefendic, founder of Todoist and Wedoist, in How to create very reliable web services, has written an insightful post on strategies for creating reliable web services along with the tools needed to make it happen.
- Realtime monitor everything. Create a clear picture of what's going on at any time both past and present. Keep a log of errors and key application metrics. Visualize response times and other metrics for every computer in your network. Tools: statsd, Graphite, Pingdom, Cacti, Nagios.
- Be proactive. Don't optimize prematurely, but don't wait to optimize when in a crisis situation. Anticipate problems before they happen by monitoring and completely understanding your system. Think now about how to scale your system if load increases by several orders of magnitude.
- Be notified when crashes happen. Use tools like Pingdom and crash_hound to send notifications when problems do occur.
While not secrets in any sense, the article is well written and has many useful details. Worth a look.