advertise
« The Rise of the Virtual Cellular Machines | Main | Going global on EC2 »
Monday
May102010

Sify.com Architecture - A Portal at 3900 Requests Per Second

Sify.com is one of the leading portals in India. Samachar.com is owned by the same company and is one of the top content aggregation sites in India, primarily targeting Non-resident Indians from around the world. Ramki Subramanian, an Architect at Sify, has been generous enough to describe the common back-end for both these sites. One of the most notable aspects of their architecture is that Sify does not use a traditional database. They query Solr and then retrieve records from a distributed file system. Over the years many people have argued for file systems over databases. Filesystems can work for key-value lookups, but they don't work for queries, using Solr is a good way around that problem. Another interesting aspect of their system is the use of Drools for intelligent cache invalidation. As we have more and more data duplicated in multiple specialized services, the problem of how to keep them synchronized is a difficult one. A rules engine is a clever approach.

Platform / Tools

  • Linux
  • Lighty
  • PHP5
  • Memcached
  • Apache Solr
  • Apache ActiveMQ / Camel
  • GFS (clustered File System)
  • Gearman
  • Redis
  • Mule with ActiveMQ.
  • Varnish
  • Drools

Stats

  • ~150 million page views a month.
  • Serves 3900 Request / second.
  • Back-end is runs on 4 blades hosting about 30 VMs. 

Architecture

  • The system is completely virtualized. We have put to use most of VMs capabilities also, like we move VMs across blades when one blade is down or when the load needs to be redistributed. We have templatized the VMs and so we can provision systems in less than 20 minutes. It is currently manual, but in the next version of the system we are planning on automating the whole provisioning, commissioning, de-commissioning, moving around VMs and also auto-scaling.
  • No Databases
  • 100% Stateless
  • RESTful interface supporting: XML, JSON, JS, RSS / Atom
  • Writes and reads have different Paths. 
    • Writes are queued, transformed and routed through ActiveMQ/Camel to other HTTP services. It is used as an ESB (enterprise service bus).
    • Reads, like search, are handled from PHP directly by the web-servers.
  • Solr is used as an indexing / searching engine. If somebody asks for a file giving the key, it is directly served out of storage. If somebody says "give me all files where author=Todd," it hits Solr and then storage. Queries are performed using Apache Solr as our search-engine and we have a distributed setup for the same.
  • All files are stored in the clustered file system (GFS). Queries hit Solr and it returns the data that we want. If we need full data, we hit the storage after fetching the ids from the search. This approach makes the system completely horizontally scalable and there is zero dependency on a database. It works very well for us after the upgrade to latest version. We just run 2 nodes for the storage and we can add few more nodes if need be.
  • Lighty front ends GFS. Lighty is really very good for serving static files. It can casually take 8000+ requests per second for the kind of files we have (predominantly small XMLs and images). 
  • All of the latest NoSQL databases like CouchDB, MongoDB, Cassandra, etc. would just be replacements for our storage layer. None of them are close to Solr/Lucene in search capability. MongoDB is the best in the lot in terms of querying but the "contains" and the like searches needs to be done with a regex and that is a disaster with 5 million docs! We believe our Distributed file-system based approach more scalable than many of those NoSQL database systems for storage at this point.

Future

  • CouchDB or Hadoop or Cassandra for Event analytics (user clicks, real time graphs and trends).
  • Intelligent Cache invalidation using Drools. Data will be pushed through a queue and a Drools engine will determine which URLs need to be invalidated. It will go clear them in our cache engine or Akamai. The approach is like this. Once a query (URL) hits our backend, we will log that query. The logged query will then be parsed and pushed into the Drools system. The system would take that input and create rules dynamically into the system if it is not already existing. That's part A. Then our Content Ingestion system will keep pushing all content it is getting into a Drools queue. Once the data comes in, we will fire all the rules against the content. For every matched rule, generate the URLs and we will give a delete request to the cache servers (Akamai or Varnish) for those URLs. Thats part B. Part B is not as simple as mentioned above. There will be many different cases. For example, we support "NOW", greater than, less than, NOT, etc in the query, those will really give us big headache. 
    • There are mainly 2 reasons we are doing all this, very high cache-hit rates and almost immediate updates to end-users. And remember the 2 reasons have never got along well in the past!
    • I think it will perform well and scale. Drools is really good at this kind of problem. Also on analysis, we figured out the queries are mostly constant across many days. For example, we have close to 40,000 different queries a day and it will be repeating every day in almost same pattern. Only the data will change for that query. So, we could setup multiple instances and just replicate the rules in different systems, that way we can scale it horizontally too. 
  • Synchronous reads, but fewer layers, less PHP intervention and socket connections.
  • Distributed (write to different Shards) and asynchrounous writes using Queue/ESB(Mule).
  • Heavy caching using Varnish or Akamai.
  • Daemons for killing crons and stay more close to real-time.
  • Parallel and background processing using Gearman and automatic process additions for auto-scaling processing.
  • Realtime distribution of content using Kaazing or eJabberd to both end users and internal systems.
  • Redis for caching digests of content to determine duplicates.
  • We are looking at making the whole thing more easily administrable and turn on VMs and process from within the app-admin. We have looked at Apache Zookeeper and looking at RESTful APIs provided by VMWare and Xen and to do the integration with our system. This will enable us to do auto-scaling.
  • The biggest advantage we have is the bandwidth in the data center has not a constraint as we are ISPs ourselves. I'm looking at ways to use that advantage in the system and see how we can build clusters that can process huge amounts of content quickly, in parallel.

Lessons Learned

  • ActiveMQ proved disastrous many times! Very poor socket handling. We use to hit the TCP socket limits in less than 5 minutes from a restart. Though its claimed that its fixed in 5.0 and 5.2, it wasn't working for us. We tried in many ways to make it live longer, like a day at least. We hacked around by deploying old libraries with new releases and made it stay up longer. After all that, we deployed two MQs (message queues) to make sure at least the editorial updates of content is going through OK.
    • Later we figured out that problem was not only that, but using topics was also a problem. Using Topic with just four subscribers would just make MQ hang in a few hours. We killed the whole Topic based approach after huge hair loss and moved them all to a queue. Once the data comes in to the main queue, we push the data in to four different queues. Problem fixed. Of course over period of 15 days or something, it will throw some exception or OOME (out of memory error) and will force us to restart. We are just living with it. In the next version, we are using Mule to handle all of this and clustering at the same too. We are also trying to figure out a way to get out of the dependency in the order of messages, that will make it easier to distribute.
  • Solr
    • Restarts. We have to keep restarting it very frequently. Don't really know the reason yet, but because its has redundancies we better placed than the MQ. We have gone to the extent of automating the restarts by doing a query and if there is no response or time-outs, we restart Solr.
    • Complex Queries. For complex queries the query response time is really poor. We have about 5 million docs and lot of queries do return in less than a second, but when we have a query with a few "NOT"s and many fields and criteria, it takes 100+ secs. We worked around this by splitting the query into more simpler ones and merging the results in PHP space.
    • Realtime. Another serious issue we have is that the Solr does not reflect the changes committed in real-time. It takes anywhere between 4 mins to 10 mins! Given the industry we are in and the competition, 10 mins late news makes us irrelevant. Looked at Zoie-Solr plugin but our Ids are alpha-numeric and Zoie doesn't support that. We are looking at fixing that ourselves in Zoie.
  • GFS Locking issue. This used to be very serious issue for us. GFS will lock down the whole cluster and it will make our storage completely inaccessible. There was an issue with GFS 4.0 and we upgraded to 5.0 and it seems to be fine from then.
  • Lighty and PHP do not get along very well. Performance wise both are good but Apache/PHP is more stable. Lighty goes cranky some times with PHP_FCGI process hanging and CPU usage goes to 100%.

I'd really like to thank Ramki for taking the time write about how their system works. Hopefully you can learn something useful from their experience that will help you on your own adventures. If you would like to share the architecture for your fabulous system, both paying it forward and backward, please contact me and we'll get started.

 

Reader Comments (17)

Quite Interesting Info !!! Thx for Sharing !!! 3900 Requests per second is awesome Lot.

What is the Targeted Response Time ?

May 10, 2010 | Unregistered CommenterRaghuraman

Ramki,

Lighty used to have memory leaks. Nginx or Cherokee might be worth looking at. Consider MogileFS as a GFS replacement -- GFS is more of a POSIX file-system replacement but MogileFS might be "enough" and should be more scalable long term. Solr -- perhaps you need to shard your search engine, but it might be more work than it's worth. ;-) With ActiveMQ, have you considered transferring some of your lighter queuing needs over to Redis? It has atomic constructs that can be used for some types of queuing, and it will almost certainly be faster.

Thanks for such a great writeup!

Jamie

May 10, 2010 | Unregistered CommenterJamie

Exciting post! It's good to see someone build things practically rather than following the buzzwords.

What these "NoSQL" document stores add to your straightforward distributed file system is exactly the ability to keep searches in sync with the data. It's interesting to see that you actually pay a high price for this feature in comparison to "stupid" document stores (a file system). To me it seems that the lynchpin that keeps this working is that delay you get between Solr searches and your data. How much time are you willing to allow?

With this in mind, it seems like a good idea to me to use CouchDB as your distributed file system. I can't believe it would be any slower than whatever tool you are using. And... continue to use Solr! Fall back to CouchDB "views" only when you need synchronous searches.

This seems like almost the opposite of "NoSQL" wisdom, but it seems like the good practical conclusion from your experience.

May 10, 2010 | Unregistered CommenterTal

Are you handling 3900 Request / second directly on your 4 blade system or does that also include Akamai traffic?
How do you handle load-balancing??

"Back-end is runs on 4 blades hosting about 30 VMs. "

what kind of 'front-end' infrastructure do you have?
What kind of VM are you running? VMWare? Xen? KVM?

May 10, 2010 | Registered Commentermxx

Sounds like they're using HP Blade Matrix with VMware.

May 11, 2010 | Unregistered CommenterSal D.

"150 million page views a month"
"Serves 3900 Request / second."

Do the maths: 3900 * 60 *60 = 14.040.000 request per hour
At 8 hours per day : 14.040.000*8 = 112.000.000 per day
So, there is definitely something wrong with one of the values above

May 11, 2010 | Unregistered Commenterrb

Raghuraman: What is the Targeted Response Time ?

Ramki: We don't have a need to perform beyond thousands of rps in our backend system as many parts of the frontend system is less faster. Afterall, the weak link in the system defines the final reqs/s! It would just be brilliant if we can get the front-end to be around the same numbers!

Jamie:

Lighty used to have memory leaks. Nginx or Cherokee might be worth looking at.

Ramki: Lighty has been performing great for us till now though there are cases where lighty gives us headache like the ones mentioned above. Nginx is a nice alternative, we can try that.

Jamie: Consider MogileFS as a GFS replacement -- GFS is more of a POSIX file-system replacement but MogileFS might be "enough" and should be more scalable long term.

Ramki: The first version of the system was running in MogileFS and soon became a big problem for us. Given to me, storing the meta info of files in DB was a disaster. In just couple of weeks, the DB records were close 1.5 million. The queries were becoming very slow... But I loved the idea of WebDAV based approach, so we tried without MogileFS but our LB couldn't support HTTP 1.1 then, so we killed the whole idea and moved it to file system.

Jamie: Solr -- perhaps you need to shard your search engine, but it might be more work than it's worth. ;-)

Ramki: Yes, we have this on our cards.

Jamie: With ActiveMQ, have you considered transferring some of your lighter queuing needs over to Redis? It has atomic constructs that can be used for some types of queuing, and it will almost certainly be faster.

Ramki: We are kinda too deep into ActiveMQ. It mainly does routing, service abstractions and guarantees delivery for us. I wouldn't bet on redis for those. Hope I answered your question.

Tal: What these "NoSQL" document stores add to your straightforward distributed file system is exactly the ability to keep searches in sync with the data. It's interesting to see that you actually pay a high price for this feature in comparison to "stupid" document stores (a file system). To me it seems that the lynchpin that keeps this working is that delay you get between Solr searches and your data. How much time are you willing to allow?

Ramki: As mentioned in the post, the DFS is used as a key-value store. And we have setup MQ to write to storage first and then to Solr.While pushing into Q makes it async, the movement of data is sequential inside the Q. That way, there is no way you can get search to return something thats not there in the Storage. Also, given that the storage nodes are in local neighborhood, the data is reflected immediately across. Hope I have answered your question.


Tal: With this in mind, it seems like a good idea to me to use CouchDB as your distributed file system. I can't believe it would be any slower than whatever tool you are using. And... continue to use Solr! Fall back to CouchDB "views" only when you need synchronous searches.

Ramki: We are trying with CouchDB for analytics. We are just around 1000 reqs/ sec in dev env. With the storage and lighty setup we are able to do 8000+ reqs/s casually. Ofcourse, 1000 reqs / s is very good number too.

Tal: This seems like almost the opposite of "NoSQL" wisdom, but it seems like the good practical conclusion from your experience.

Ramki: Not sure if it is opposite of NoSQL. In principle we are closer to it. For eg: there is no "sql" :), there is almost no schema. I say almost because Solr forces us to have some schema, Distributed and key-value like storage.

Maxim: Are you handling 3900 Request / second directly on your 4 blade system or does that also include Akamai traffic?

Ramki: No Akamai here. Its all our backend which is not on akamai. The moment we get our caching engine and rules approach for url invalidation, we will be way way beyond 3900 r/s. Our dev test was 8500+ r/s with varnish. But remember, these numbers may not be exactly what we will get from the outside world, given to the latencies...

Maxim: How do you handle load-balancing??

Ramki: This is handled in a mind-bending / brain-numbing way now :). We have limited resources, but we need "No SPOF" and so we kinda loop within the same LB for different layers. We will probably have to live with this approach given the cost of H/W LBs. We are looking at S/W LB as alternatives but we need blades to deploy them!! :)

Tal: what kind of 'front-end' infrastructure do you have?

Ramki: We are migrating to VM based environments here too. But in like few months, we will be completely in VM and running on top of the system that we building now. The old system will be retired.


Hope I have answered the questions.

Please do keep the suggestions coming. I would love to have the community's feedback.

-ramki

May 11, 2010 | Unregistered CommenterRamki

Hi,
One thing I have investigated is the 3900 req per sec.

Your daily page value is 1544943 (http://www.websiteoutlook.com/www.sify.com).

so assuming 5 page view per user will give around 300K user per day or 12,500 user per hr OR 208 user per min.

or roughly 3 user per sec hitting the server.

so my question is can 3 users can generate 3900 request per sec ?

May 11, 2010 | Unregistered CommenterGK

RB & GK: As mentioned earlier, this whole article is about the backend and not the frontend. As you can see in my comment above "It would just be brilliant if we can get the front-end to be around the same numbers!"

The backend does not only get hit from the frontends. There are lot of Direct API calls to the backend.

So, the calculations above do not translate exactly to backend hits.

That said, we are working on the frontend already, hope we will hit numbers close to the backend.

If you guys do have some inputs on the same, it will be wonderful.

-ramki

May 11, 2010 | Unregistered CommenterRamki

The spread of tools is very interesting. What is your recommendation for a file system replicator ? How do you do that ?
We have a Windows SAN storage in one city. We want to replicate the files added and removed in real-time to another Windows box in another city. The files should be replicated only when completely written.

May 12, 2010 | Unregistered CommenterMohan Radhakrishnan

Jamie: "Lighty used to have memory leaks."

Used to have... in the past... some time ago... what is the point of that?
Apache had some problems in the past, nginx had some problems in the past.
Software is not perfect and gets fixed through updates.
I run a bunch of lightys which handle about the same amount of requests (8000 req/s) for static files without sweating.
And a few hundret req/s for PHP without problems aswell.

What a lot of people mistake for memory leaks is that Lighty buffers requests from backends such as PHP. So if you send 10mbyte data from PHP though Lighty, it will use 10mbyte. And it will re-use this memory for later requests.
Why it does that? Well, it helps speed up requests because you only have a limited number of PHP backends which would otherwise get occupied by slow clients. It frees PHP "slots" faster.
I'll take that behaviour over less ram usage any time.
The only thing to keep in mind is to not send big files over PHP through lighty which is a bad idea for several other reasons too anyways (use X-sendfile header for that or serve them directly).

So instead of going by "uh but I heard someone had a problem in the past.." in order to make decisions, one has to look at how it works *right now* and for your usecase.

Lesson I want to bring across is: use the tool that serves your task best right now (and in future of course).

I have seen those 100% CPU cases in PHP before, it's an odd one and in some cases updating PHP helped. Unfortunately PHP's FastCGI interface is not as stable as the mod_php one. But it workes reasonably well.

Ramki: Thanks for sharing your architecture and lessons learned, The database-less approach I found particularily interesting and refreshing. Good insight into the NoSQL idea probably not being a better fit. Kudos.

May 12, 2010 | Unregistered Commenterfrost

GK,

I'm not sure about the websiteoutlook stats but they appear a little low (by factor of 10) for a couple of sites I know audited numbers for.

Anyway 150 million page views per month (3 times the websiteoutlook figure) is an average of 57 page views a second. 3900 hits/second is 68 per page view ( Your 3 (really 3.57) users x 5 pages/visit x factor 3 undercount ) . The front page has a good 80 elements and even with a good cache rate it doesn't seem a big multiplier to take account of peak periods and a front end request giving rise to multiple back end requests.

The broad numbers look pretty plausible to me.

May 12, 2010 | Unregistered CommenterSimon

About the people doing pageview vs request calculations:

It's pretty common to count a pageview as a fully rendered page for a user which in turn might translate to anywhere between 1 and 50 requests against a backend, depending on the amount of backend information sources needed to render the page.

You can't calculated with them because you'd be calculating with apples and pears.
the views/month are a total over 30 days, the 3900 req/s are a peak value at a certain moment

Based on the amount of views and some assumptions on usage you can calculate an approximate average concurrent users which will never ever come close to the amount of concurrent users at peak.

Which version of GFS are you running GFS1 or GFS2 ?? And how much data are you storing in the filesystem ?
We used to store much of our data in GFS(1) but moved away from it due to scaleability issues.

Main problems where locking, inability to scale out the filesystem without downtime and bad/non-existing support for replication/redundancy. Since then we've grown to a whopping data volume of approx. 1 PetaByte.
I would also be curious about the underlying storage hardware you use to store the GFS data on, iscsi, fiber, AoE ?

Ramon

May 12, 2010 | Unregistered Commenterramon

Mohan Radhakrishnan: What is your recommendation for a file system replicator ? How do you do that ? We have a Windows SAN storage in one city. We want to replicate the files added and removed in real-time to another Windows box in another city. The files should be replicated only when completely written.

Ramki: We have been trying to solve this problem for quite some time now. As of now, we are running on only one datacenter, and the GFS cluster is mounted on 2-3 nodes in the local network and so every write is immediately visible to other nodes. Given that, we do not have a replication problem.

But we are already looking at options for deploying the application across two cities in India which are about 800 miles away and we want an Active-Active setup. With the context, we asked the storage providers to give real-time replication between the 2 cities. As ISPs, our latencies are relatively small (I thought), about 40 ms. None of the storage providers could do near real-time even with that latency. There was one company which said we can do it and we asked them to prove it. They deployed the box and it was a catastrophic failure! Ofcourse, this is all async replication, sync replication would be best way to bring down both your sites simultaneously! Jus kidding! :) Sync locks the source filesystem, syncs to the other site, other site locks the file system, write, confirms to source, source site releases the lock. In this duration the client who wrote the file is waiting... You get the point I guess.

To conclude:
- No storage providers can do it if your latency is more than 5 ms and real-time. And even if the latency is less for you, ask them to prove it first.
- Build your own replication system. Be async.

What we are looking at doing now is:

We already push the content to MQ. Now the content will be pushed to the other Queue in other city too. That way the data is replicated in both places almost immediately. And then we will accept the slight delays that will be there. I'm thinking the delay will be less than 5 secs which is way better than Storage providers recommended 10-15 mins replication. We have to test this ourselves anyway. One big advantage we have is, even the delete of a file is XML request, so that goes through the Q too.

For big binary files (Videos & Images), we take a slightly different approach. All of it written to temp directory and a daemon moves the file to the appropriate directory. This will give us little headache when we have 2 cities I think. Don't have a thought other than ESB / MTOM based approach and push the meta-data through queue... No concrete thought yet.


Ramon: Which version of GFS are you running GFS1 or GFS2 ?? And how much data are you storing in the filesystem ?
We used to store much of our data in GFS(1) but moved away from it due to scaleability issues.

Main problems where locking, inability to scale out the filesystem without downtime and bad/non-existing support for replication/redundancy. Since then we've grown to a whopping data volume of approx. 1 PetaByte.


Ramki: We ran into the same issues with GFS1 and we are on GFS2 now. But we still have the scaling problem when we have to run the nodes on VMs. Not more than 2 VMs are supported, 3rd node added the cluster sometime hangs. Maybe that was also GFS1 issue and we are living with fear of hanging... Not really sure why this is. Would you have some insight into it? The current data size is about 1.5 Tera. We havenot migrated our old content to the new system yet.

Ramon: I would also be curious about the underlying storage hardware you use to store the GFS data on, iscsi, fiber, AoE ?

Ramki: iSCSI

Hope that answers.

-ramki

May 13, 2010 | Unregistered CommenterRamki

You are also providing email to the users. Do you also keep authentication records without a Database?

Also, which mail server do you guys use?

May 16, 2010 | Unregistered CommenterShahid Butt

Hi Ramki,

Was wondering whether you guys looked at AWS (http://aws.amazon.com) or not? Would running on them simplify a lot of the VM management work you guys are doing?

They also provide a load balancing and auto scaling solution (http://aws.amazon.com/autoscaling/) and queue service (http://aws.amazon.com/sqs/).

May 24, 2010 | Unregistered CommenterTushar

Hi Tushar,

We looked at AWS for other projects. Biggest problem we have is the latency. We cannot live with that kind of latency.

Other than that:
- We have our own pipes
- We have our own Datacenters across cities.
- We already have many parts of our infra.

So it doesn't really make sense in our case.

-ramki

May 27, 2010 | Unregistered CommenterRamki

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>