As the Russian rouble exchange rate slumped two years ago, it drove us to think of cutting hardware and hosting costs for the Mail.Ru email service. To find ways of saving money, let’s first take a look at what emails consist of.
Indexes and bodies account for only 15% of the storage size, whereas 85% is taken up by files. So, files (that is attachments) are worth exploring in more detail in terms of optimization. At that time, we didn’t have file deduplication in place, but we estimated that it could shrink the total storage size by 36% since many users receive the same messages, such as price lists from online stores or newsletters from social networks containing images and so on. In this article, I’m going to describe how we implemented a deduplication system under the supervision of Albert Galimov.