Mollom Architecture - Killing Over 373 Million Spams at 100 Requests Per Second
Tuesday, February 8, 2011 at 12:18PM
Todd Hoff in Example

Mollom is one of those cool SaaS companies every developer dreams of creating when they wrack their brains looking for a viable software-as-a-service startup. Mollom profitably runs a useful service—spam filtering—with a small group of geographically distributed developers. Mollom helps protect nearly 40,000 websites from spam, including one of mine, which is where I first learned about Mollom. In a desperate attempt to stop spam on a Drupal site, where every other form of CAPTCHA had failed miserably, I installed Mollom in about 10 minutes and it immediately started working. That's the out of the box experience I was looking for.

From the time Mollom opened its digital inspection system they've rejected over 373 million spams and in the process they've learned that a stunning 90% of all messages are spam. This spam torrent is handled by only two geographically distributed machines that handle 100 requests/ second, each running a Java application server and Cassandra. So few resources are necessary because they've created a very efficient machine learning system. Isn't that cool? So, how do they do it?

To find out I interviewed Benjamin Schrauwen, cofounder of Mollom, and Johan Vos, Glassfish and Java enterprise expert. Proving software knows no national boundaries, Mollom HQ is located in Belgium  (other good things from Belgium: Hercule Poirot, chocolate, waffles).

Statistics

Platform

What is Mollom?

Mollom is a web service for filtering out various types of spam from user generated content: comments, forum posts, blog posts, polls, contact forms, registration forms, and password request forms. Spam determination is not only based on the posted content, but also on the past activity and reputation of the poster. Mollom's machine learning algorithms act as your 24x7 digital moderator, so you don't have to. 

How is it used?

Applications like Drupal, for example, integrate Mollom using a Module that installs itself into content editing integration points so content can be checked for spam before being written to the database. The process looks like:

Dashboard

Mollom includes a pretty nifty dashboard for each account that shows you how much ham has been accepted and how much spam has been rejected. The amount of spam that one sees in the graph is really depressing. 

Operations Process

Installation. Installation is quite easy for Drupal. Install it like any other module. Create an account on the Mollom website. Get a pair of security keys, configure these keys into the module, and select which parts of the system you want to protect with Mollum. That's about it.

Daily. I check regularly to see if spam has gotten through. It's not 100% so some spam does get through, but very little. If spam does get through there's a way to tell Mollom that this post was really spam and should be deleted. This is what you would have to do anyway, but in the process you are helping train Mollom's machine learning algorithms about what is spam.

Allows for anonymous user interaction. With a good spam checker it's possible to have a site where people can interact anonymously, which is what a lot of people using certain types of sites really prefer. Once you require registration engagement goes way down and registration doesn't stop spammers anyway.

Not Everything is Rosy

Dealing with false positives is Mollom's biggest downside. Spam detection is a difficult balancing act between rejecting ham and accepting spam. Mollom's machine learning algorithms seems to work quite well, but there is a problem sometimes with good posts being rejected, you get the dreaded: Your submission has triggered the spam filter and will not be accepted. Currently there is no recourse. Few things piss off a user more than having their glorious comment rejected as spam when it's obvious, to a human, that it's not. A user will only try a few times to get around the problem and then they will simply give up and walk away.

The problem is there is no way to fix the problem. To protect the machine learning algorithms from being gamed, Mollom does not allow you to present an example of a wrongly rejected chunk of content that should be accepted, though they are working on adding this in the future.

It's a tough decision. Static CAPTCHA systems, that is systems that only require a user pass a test to submit content, simply do not work once a site has been targeted for serious attack. User registration doesn't work. Moderating every post requires a very high burden, especially for "hobby" sites, given a site can have thousands of spams a day. And spam completely kills a site, so the risk of angering some users has to be balanced against having no users in the end because of a bombed-out site.

Business Model

Architecture

Future Directions

Lessons Learned 

Related Articles  

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.