advertise
« Sponsored Post: Surge, BioWare, Tungsten, deviantART, Aconex, Hadapt, Mathworks, AppDynamics, ScaleOut, Membase, CloudSigma, ManageEngine, Site24x7 | Main | Stuff The Internet Says On Scalability For June 24, 2011 »
Monday
Jun272011

TripAdvisor Architecture - 40M Visitors, 200M Dynamic Page Views, 30TB Data

This is a guest post by Andy Gelfond, VP of Engineering for TripAdvisor. Andy has been with TripAdvisor for six and a half years, wrote a lot of code in the earlier days, and has been building and running a first class engineering and operations team that is responsible for the worlds largest travel site. There's an update for this article at An Epic TripAdvisor Update: Why Not Run On The Cloud? The Grand Experiment

For TripAdvisor, scalability is woven into our organization on many levels - data center, software architecture, development/deployment/operations, and, most importantly, within the culture and organization. It is not enough to have a scalable data center, or a scalable software architecture. The process of designing, coding, testing, and deploying code also needs to be scalable. All of this starts with hiring and a culture and an organization that values and supports a distributed, fast, and effective development and operation of a complex and highly scalable consumer web site.

Stats as of 6/2011

  • Over 40M monthly unique visitors (Comscore), 20M members, 45M reviews and opinions
  • Over 29 points of sale, 20 languages
  • Our mobile offerings are available on iPhone, iPad, Android, Nokia, Palm, and Windows Phone, attracting 10M monthly users
  • Over 200M dynamic page requests per day, with all static assets such as js, css, images, video, etc served via a CDN
  • Over 2.5B distributed operations (services, database, memcached) are performed each day to satisfy the page requests
  • Over 120GB of log files (compressed) are streamed in each day
  • 30TB of data on Hadoop, projected to hit 100TB early next year- Daily patches for "need to go out today" features/fixes
  • Never, ever, have planned downtime, close to 4 9's of uptime
  • Separate deployment in Beijing to support daodao.com
  • Weekly release cycle, with daily "patches". Development cycle can be one day, can be one week, can be a month
  • Over two dozen small teams (slightly more than 100 engineers) working on over 50 simultaneous projects at a time, with 20-30 of these projects being deployed each week
  • Teams include: SEM, SEO, Content management, CRM, Mobile web, mobile apps, social networking (FB), hotels, restaurants, forums, travel lists, video platform, flights, vacation rentals, business listings, content distribution, API, China, APAC, sales and marketing campaigns, data center operations, dev ops, analytics and warehousing, QA

General Architecture

  • Open Source Linux, Apache, Tomcat, Java, Postgres, Lucene, Velocity, Memcached, JGroups
  • Keep it really simple - easier to build for, debug, deploy, maintain, and operate
  • Bank of very simple, stateless web servers, running simple Java and Velocity templates
  • Each "functional" area (media, members, reviews, travel lists, etc) is packaged as a service
  • Bank of "services" - each service is high level, business oriented API optimized for over the wire performance
  • Assume things fail. Either have plenty nodes in a cluster (web servers, service machines), or have true N+1 redundancy (databases and the datacenter itself)

Flow of control

  • Flow of control is very simple: request URL's are parsed, content is collected from various services, and then applied to a template.
  • Our URL structure has been very well thought out, and has been stable for a very long time, even with site redesigns and code refactorings.
  • Requests are routed through the load balancer to a web server. Requests are distributed at random within a cookie based "pool", a collection of servers. Pools are used for deployment management and A/B testing. Requests are essentially stateless - there is "session" information stored in the browser cookie. Logged in users will fetch additional state from the database.
  • A Java servlet parses the url and cookie info and determines the content it needs, making calls to the various services. The service API's are defined as Java interfaces, and a servlet will make anywhere from 0 to a dozen service requests. Service calls typically take from 2ms to 15ms. Service calls go through a software load balancer with tuneable retry logic that can be defined on a per method basis.
  • Each service has an API that is optimized for the business and/or usage pattern - for example, fetch the reviews for a member, or the reviews for a location. Most services are essentially large, intelligent caches in front of the database, for example, local LRU, memcached, database calls, in memory partial trees/graphs. Jgroups is occasionally used to keep caches in sync when needed. Most services run 4 instances on different shared/physical machines, sometimes service instances are given their own machines.
  • The set of content returned from the service calls is massaged and is organized by the Java code into sets of objects that are passed to a Velocity template (kind of like a weak JSP)
  • There is a variety of logic to select different forms of the servlet and/or the Velocity templates based on context which could include POS, language, features, etc. There is also infrastructure to select different code and template paths for A/B and other testing
  • Our web servers are arranged in "pools" to allow for A/B testing, and control over deployment. Our weekly release process will deploy to a small pool and let the code run for a few hours to make sure everything is working properly before deploying to all pools

Tech

  • Redundant pairs of BigIP, Cisco routers
  • Bank of 64 stateless web servers, Linux/Apache/Tomcat running Java servlets, processing 200M+ requests a day
  • Bank of 40 stateless service machines running ~100 instances of ~25 services, Jetty/Java, processing 700M+ requests a day
  • Postgres, running on 6 pairs (DRDB) machines, processing 700M+ queries a day
  • Memcached, 3 separate clusters, running on web servers, 350GB+. Configurable pools of Spy memcache client - asynchronous, NIO to scale requests. Optimization and other changes were done to spy java client to achieve scale and reliability.
  • Lucene, wrapped in our service infrastructure, 50M documents, 55GB indexes 8M searches a day, daily incremental updates, complete regeneration once a week. - Hide quoted text -
  • JGroups, when there is no other choice, for state synchronization between service machines.
  • Hadoop, 16 node cluster with 192TB raw disk storage, 192CPU, 10GB network
  • Java, Python, Ruby, PHP, Perl, etc used for tools and supporting infrastructure
  • Monitoring - cacti, nagios, custom.
  • Two data centers, different cities, in N+1 configuration, one taking traffic in R/W mode, the other in sync and in R/O mode, ready for fail-over 24/7, regular quarterly swap of active
  • Total of about 125 machines in each datacenter, all standard Dell equipement
  • Logging - 120GB (compressed) application logs, apache access logs, redundant logging of financially sensitive data - both streaming (scribe) and non streaming

Development

  • Fully loaded Mac or Linux machines, SVN, Bugzilla, 30" monitors
  • Hundreds of virtualized dev servers. Dedicated one per engineer, plus extras on demand
  • Weekly deployment cycle - entire code base gets released every Monday
  • Daily patches for those "need to get out today" features/fixes
  • Over 50 concurrent projects being worked on at a time by ~100 engineers, ~25 get deployed each week
  • Engineers work end to end - Design, Code, Test, Monitor. You design something, you code it. You code something, you test it.
  • Engineers work across entire stack - HTML, CSS, JS, Java, scripting. If you do not know something, you learn it.
  • Release process is shared and rotated among various senior engineers and engineering managers
  • Numerous test frameworks available: Unit, Functional, Load, Smoke, Selenium, Load, and a test lab. Its your code, choose what works best for you.
  • Numerous mechanisms to help you deliver top quality features: Design Review, Code Review, Deployment Review, Operational Review

Culture

  • TripAdvisor engineering can be best compared to running two dozen simultaneous startups, all working on the same code base and running in a common distributed computing environment. Each of these teams has their own business objectives, and each team is able to, and responsible for, all aspects of their business. The only thing that gets in the way of delivering your project is you, as you are expected to work at all levels - design, code, test, monitoring, CSS, JS, Java, SQL, scripting.
  • TripAdvisor engineering is organized by business function - there are over two dozen "teams", each responsible for working directly with their business counterparts on SEM, SEO, Mobile, Commerce, CRM, Content, Social Applications, Community, Membership, Sales and Marketing Solutions, China, APAC, Business Listings, Vacation Rentals, Flights, Content Syndication, and others. We do not have the traditional Architect, Coder, QA roles
  • Our Operations team is one team that is responsible for the platform that all of these other teams use: Datacenter, Software infrastructure, DevOps, Warehousing. You can think of Operations as our internal "AWS", delivering our 24/7 distributed compute infrastructure as a service, with code/dev/test/deploy all rolled into one. This team includes two technical operations engineers and two site operatons engineers who are responsible for the datacenters and software infrastructure.
  • Each of the teams operates in the way that best fits their distinct business and personal needs, our process is best described as "post agile/scrum".
  • Culture of responsibility - you own your project end to end, and are responsible for design, coding, testing, monitoring. Most projects are 1-2 engineers.
  • Log and measure - tons of data, lots of metrics
  • Hackers Week - every engineer gets one week per year to work on any project they want. You can team up with others to tackle larger projects
  • Engineering swaps. Pairs of engineers from different teams will swap positions for a few weeks to distribute knowledge and culture
  • Web Engineering Program. A new program for engineers who want to work a few months in many different teams.
  • Summer Fridays - time shift your weeks during the summer and free up your Fridays
  • Yearly charity day - the entire company goes out and contributes their day to a local charity - painting, gardening, etc
  • TripAdvisor Charitable Foundation. Funded with several million dollars, employees can apply for grants to a charitable organization that they are personally involved in.

Random thoughts on what we have learned, how we work

  • Scalability starts with your culture - how you hire, who you hire, and the expectations that you set.
  • Engineers. Hire smart, fast, flexible engineers who are willing to do any type of work, and are excited to learn new technologies. We do not have "architects" - at TripAdvisor, if you design something, your code it, and if you code it you test it. Engineers who do not like to go outside their comfort zone, or who feel certain work is "beneath" them will simply get in the way.
  • Hire people who get enjoyment out of delivering things that work. Technology is a means to an end, not an end to itself.
  • You own your code and its effects - you design, you test, you code, you monitor. If you break something, you fix it.
  • It is better to deliver 20 projects with 10 bugs and miss 5 projects by two days than to deliver 10 projects that are all perfect and on time.
  • Encourage learning and pushing the envelope - everyone who works here will make a number of mistakes the first few months, and will continue to occasionally do so over time. The important thing is how much you have learned and to not make the same mistakes over and over again.
  • Keep designs simple and focused on the near term business needs - do not design too far ahead. For example, we have rewritten our members functionality as we scaled from tens of thousands, to millions, to tens of millions. We would have done a poor job of designing for tens of millions when all we needed was tens of thousands. We delivered sooner with less problems at each stage, with the cost of the rewrites is small in comparison to what we learned at each stage.
  • You can go very far in scaling a database vertically, especially if you minimize use of joins, and especially if you can fit everything into RAM.
  • Shard only when necessary, and keep it simple. Our largest single table has well over 1B rows, and it easily scaled vertically until we needed to update/insert tens of millions of rows a day, hitting write IOP limits. At that point, we sharded it at the service level by splitting it into 12 tables, and currently run it on 2 machines each with 6 tables. We can easily scale it to 3, 4 ,6, and then 12 machines without changing our hash algorithm or data distribution, simply by copying tables. We have not had any measurable performance degradation (read or write), and the code to do this was small, easy to understand, and easy to debug. With over 700M database operations a day, we are not anywhere close to sharding any other tables.
  • Avoid joins when possible. Our content types (member, media, reviews, etc) are in separate databases, sometimes on shared machines, sometimes on its own machine. It is far better to do two queries (get the set of reviews with their member ids, then get all of the member from this set of ids and merge it at the app level) than do a join. By having the data in different databases, it is easy to scale to one machine per database. It is also easier to keep your content type scalable - we can add new content types in a very modular manner as each content type stands alone. This also aligns well with our service oriented approach, where a service is supported by a database.
  • Put end to end responsibility on a single engineer. When one person owns everything (CSS, JS, Java, SQL, scripting), there is no waiting, bottlenecks, scheduling conflicts, management overhead, or distribution of "ownership". More projects/people can be added in a modular way without affecting everyone else.
  • Services. Having a known set of chunky (optimized for the wire) protocols that are aligned to the business and usage patterns makes assembling pages easier, and allows you to scale out each service according to business needs. Big increase in search traffic ? Add more search servers. Also makes you think more carefully about usage patterns and your business.
  • Hardware - keep it really really simple. No fancy hardware - no SAN, proprietary devices (aside for networking equipment) - we run all commodity Dell. Assume any components will fail at any time and either have N+1 design or ample resources to make up for failures. Our bank of web servers can handle significant numbers failing - up to 50%. Databases are N+1 with duplicate hot (DRDB based) failover. Services run multiple instances on multiple machines. Load balancer and router each have a hot spare and auto-failover. We have two entire datacenters in different cities, one in active R/W mode handling all the traffic, the other in up to date, R/O mode ready for traffic at any time. We "fail over" every three months to insure the "backup" is ready at all times, and to provide for our continuous/incremental datacenter maintenance.
  • Software - keep it really, really, really simple. There are systems you do not want to write yourself - Apache, Tomcat, Jetty, Lucene, Postgres, memcached, Velocity. We have built everything else ourselves - distributed service architecture, web framework, etc. It is not hard to do, and you understand and control everything.
  • Process. Less is better. You need to use source code control. You need to be a good code citizen. You need to deliver code that works. You need to communicate your designs and their level of effort. Not much else is "needed" or "required". If you are smart, you will get your design reviewed, your code reviewed, and you will write tests and appropriate monitoring scripts. Hire people who understand that you want these things because they help you deliver better products, not because they are "required". If you make a mistake, and you will, own up to it, and get it fixed. It is also important to find your own mistakes, not rely on others to find them for you.
  • Design Reviews. All engineers are invited to a weekly design review. If you have a project that is going to impact others (database, memory usage, new servlets, new libraries, anything of significance) you are expected to present your design at design review and discuss it. This is not only a great way to provide guidance and oversight over the entire system, it is a great way for everyone to learn from each other and be aware of what is going on.

I'd really like to thank Andy Gelfond for this amazingly useful description of what they are doing over at TripAdivsor. Awesome job. With such attention to detail it's easy to see why TripAdvisor is so dang useful. Thanks. TripAdvisor is hiring.

Related Articles

 

 

Reader Comments (25)

Postgres, running on 6 pairs (DRDB) machines, processing 700M+ queries a day

It's DRBD (http://en.wikipedia.org/wiki/DRBD), right ?

June 27, 2011 | Unregistered CommenterProgga

Engineers work end to end - Design, Code, Test, Monitor. You design something, you code it. You code something, you test it.
Engineers work across entire stack - HTML, CSS, JS, Java, scripting. If you do not know something, you learn it.
The only thing that gets in the way of delivering your project is you, as you are expected to work at all levels - design, code, test, monitoring, CSS, JS, Java, SQL, scripting.

don't you end up with dissimilar code all over the place?
are you allowing your dev to design the way your apps interact with end users? don't you know that dev rarely design good user interfaces?
if the same dev is also responsible for supporting their code throughout its lifetime, don't you eventually get into a situation where that dev/team will do nothing but fix bugs in their existing code and don't have enough time to work on new projects?

June 27, 2011 | Registered Commentermxx

DRBD is correct, thank you Progga

Hi mxx,

Our system is not perfect - every team needs to decide what is important to them and make choices. Yes, we do end up with duplicate code sometimes, but it is not as bad as you would think. We depend on engineers to do the right thing, and we do a lot of code reviews. Dev's are not always responsible for their code during its "lifetime", you are responsible for your project,and various people may come in and do things with your code - this has its pros and cons. System works well for us, and, as importantly, keeps people engaged - you work across the entire stack, and you do your own software design and testing.

And, no, devs do not do UI or graphic design, that ended many years ago after we got enough wtf's on what our site looked like. "design" is the software design. Other teams do UI/graphics.

Andy

June 27, 2011 | Registered CommenterAndy Gelfond

1) Presumably database writes are replicated from the R/W datacenter to the R/O datacenter (through Postgresql 9.0 replication?) How do you handle the replication lag? For example, if I submit a hotel review, it gets written to the R/W datacenter. Now if I visit the page for that hotel again, I may be routed to the R/O datacenter, which might not yet have my newly submitted review because of replication lag. In this case the user might think his review has been lost. How do you handle that?

2) Why do you regenerate the Lucene index every week?

Thanks.

June 27, 2011 | Unregistered CommenterAndy

Hi Andy,

We use Postgres 8.x. Replication is done through an older query replication approach (similar to slony) which we have upgraded over the years (for example, by batching up queries for performance) . The active data center replicates to both the other data center and our "local" office datacenter. There is a lag - it could take a few minutes for everyone to catch up.

However, the data centers are in N+1 configuration - only one (the R/W) is getting traffic. The other (R/O) is live and can take traffic, but DNS points only to the active one - so we do not run into the problem you mentioned as we have two datacenters for reliability, not for scale.

We looked into log and/or streaming replication (I believe this is what Postgres 9 does), but it is not as reliable as we would like - corruption in the data transfer can have a significant effect. With query replication, not so, and with query replication we can also manually tweak the data if needed (which we have needed to do a few times). We also have a datacenter in Beijing which has poor latency and connectivity to it - query replication works better for us given its situation.

Lucene index regeneration - we found that incremental updates, over time, create index fragmentation and performance degradation, so we do a full rebuild once a week to keep it "clean". Also, index changes, whether incremental or full rebuild, are done offline, and we re-start lucene to use new index each day - we never were able to get "real-time" to work. We run a small bank of lucene-wrapped-as-a-service, so there is no downtime doing it this way. It is quite possible that there is a better way of doing this - we have tried a number of approaches, and this one has worked the best so far.

June 28, 2011 | Registered CommenterAndy Gelfond

"Each service has an API that is optimized for the business and/or usage pattern - for example, fetch the reviews for a member, or the reviews for a location. Most services are essentially large, intelligent caches in front of the database, for example, local LRU, memcached, database calls, in memory partial trees/graphs. Jgroups is occasionally used to keep caches in sync when needed. Most services run 4 instances on different shared/physical machines, sometimes service instances are given their own machines."

And:

"
Bank of 40 stateless service machines running ~100 instances of ~25 services, Jetty/Java, processing 700M+ requests a day

Postgres, running on 6 pairs (DRDB) machines, processing 700M+ queries a day
"

I'm a little confused: The services do lots of caching and yet the number of requests per day seems to match the number of queries per day on the databases. I'm guessing that perhaps a large number of the queries to the DB are not related to these services then?

June 28, 2011 | Unregistered CommenterDan Creswell

1. How did you initially load test and choose Tomcat for this type of load ?
2. How do you currently load test your entire stack ?
3. What CI setup and unit tests do you have ?
4. Do you monitor application events using JMX or something else ?Do you transmit log events to a browser over HTTP for monitoring the progress of your batch programs.

Thanks.

June 28, 2011 | Unregistered CommenterMohan Radhakrishnan

Hi there,

Thanks a lot for this awesome description!
Coming from a Web company I was pretty surprised to see Java as a backend. How happy are you with your choice of Java as a backend language? I really like it but I am having troubles to fight against the temptation of more "hype" languages such as Ruby, Python, Scala, JS, etc.

I a word like in thousands would you do it again based on Java and why or why not?

June 28, 2011 | Unregistered Commentermisterdom


are you allowing your dev to design the way your apps interact with end users? don't you know that dev rarely design good user interfaces?
if the same dev is also responsible for supporting their code throughout its lifetime, don't you eventually get into a situation where that dev/team will do nothing but fix bugs in their existing code and don't have enough time to work on new projects?

Developers would generally be handed a mockup from the graphic design and product marketing team. "Here's how it should look". An engineer's responsibility generally ends a week after release when it is clear that there aren't any immediate problems (especially performance problems) related to it. At that point you get to work on something else. Which well indeed might be a bug fix or feature enhancement to somebody else's project. All engineers are expected to be able to work in any are of the code base. There is much less longterm code ownership than I have seen in other places. Because of that, developers end up seeing enough of each others code that the code doesn't end up to dis-similar to read or work with.

June 28, 2011 | Unregistered Commentersteveo

Andy.

Thanks for the explanation.

>> we never were able to get "real-time" to work.

What type of real-time Lucene implementation did you try but couldn't get it to work?

Lucene 2.9 added NRT (Near Real Time). Are you using an earlier version or does Lucene NRT simply not work for you? Would love to learn more about your experience here.

June 28, 2011 | Unregistered CommenterAndy

Andy, you mention that your ops team includes two tech engineers and two site engineers - surely this can't be the whole team?! Could you perhaps provide more insight into the size of the team and how it is made up?

Thanks for a great write up.

Robbo

June 28, 2011 | Unregistered CommenterRobbo

Andy.

Thats an awesome stuff. One of the questions that I had, are you running an 4 9's uptime infrastructure completely with open source architecture? And How do you make sure your application response times are on target. Also, curious to know how do you monitor your infrastructure?

June 28, 2011 | Unregistered CommenterKrishnakanth

I will try to answer as many questions as I can:

Dan:
The number of service calls and database calls being approximately the same is coincidence, and the number of database calls is more of a guess (we log db calls in our Java code), but it is approximately correct. Many of our web pages make multiple service calls and many of our services make multiple database calls, but there are also about 1B+ memcached calls a day that keep the load of the database. Since we cache objects, not queries, many of these objects may represent several database calls. If we did not use memcached, we would probably do 2B+ queries a day. What I have always found interesting is how difficult it is to correlate all the numbers we collect from different source.

Mohan:
- When we started (11 years ago) we went with the standard Linux/Apache/Tomcat/Java/Postgres stack. I was not around then, but I don't think too much structured process was put around the decision, and I don;t think there were many other open source options at that time
- Load testing is an interesting area. We have different test harnesses and tests, a test lab, and sometimes we use our secondary datacenter. Sometimes we use the live datacenter (knocking out 1/2 of the web servers for instance to see what happens). Load testing is done where we feel it is needed, and to be honest, much more of an art. One thing we have learned is that you can never truly load test, even by replaying your log files. I am a big believer in assuming everything fails and making sure you have a way to recover.
- Not familiar with "CI"
- Monitoring is done through Nagios, Cacti, and a lot of custom scripts that parse our logs and display data. I can drill down by servlet, day, and hour and get the total time, cpu, database, service call data (both elapsed time and # of calls) for a page request.

misterdom:
- I am very happy with Java, especially for services. Great thread and state support is essential, imho, for building a performant service. I am not sure how you implement shareable in-process caches and handle threading with these other languages. Our service machines can handle many thousands of requests per second and still run at 15% load. If I were to do things again, I would consider some of the scripting languages for front end work (some of our experiments in this area have shown no real benefits with a number of downsides), but for server work threads and being able to keep state (shared memory) are critical for our use cases.

steveo:
Thanks SteveO :-)

Andy:
We tried incrementally updating the index while it was in use via the api, but had a lot of issues. This was a number of years ago, and I do not remember the specifics. It would be great for us to have NRT, but not critical.

Robbo:
We used to have one tech ops engineers for our two datacenters, but found we needed to add another one :-)
We have openings for two more, as we have plans for bigger build outs.

So, we have:
2 tech ops (datacenter) people for both datacenters and our development environment (lots and lots of xen servers)
2 "site ops" engineers who are responsible for the software infrastructure.
Release engineering is done by our site ops engineers and various technical managers (need to give them something interesting to do)
There are 10+ engineers from around the various teams that have access to the site and help with patches, releases, software development, databases, datacenter, etc.
It is important to how we work that the software teams literally have real "skin in the game" regarding operations.

June 28, 2011 | Registered CommenterAndy Gelfond

"Dan:
The number of service calls and database calls being approximately the same is coincidence, and the number of database calls is more of a guess (we log db calls in our Java code), but it is approximately correct. Many of our web pages make multiple service calls and many of our services make multiple database calls, but there are also about 1B+ memcached calls a day that keep the load of the database. Since we cache objects, not queries, many of these objects may represent several database calls. If we did not use memcached, we would probably do 2B+ queries a day. What I have always found interesting is how difficult it is to correlate all the numbers we collect from different source."

Thanks Andy.

Agreed, correlation is a key challenge in building understanding of our systems. I actually feel there's always going to be some approximation here as the numbers are produced from different layers with different amounts of context and it's the context we use to tie things together. Even simple stuff like using system clocks to determine ordering or the collection of relevant data in a time period is problematic.

Another challenge is capturing the information without impacting user perceived performance. One either avoids capturing data or accepts that some of it may be lost. I tend to favour the latter by default, to have some information is better than none.

Big topic, worthy of beer discussion!

Best,

Dan.

June 29, 2011 | Unregistered CommenterDan Creswell

Hi Krishnakanth:

We use a variety of monitoring techniques, including Gomez for external "third party" monitoring, and custom scripts and tools.
Our "4 9's" is not a uniform, across the board measurement. There is functionality that is important to our business (basic site being up, commerce, and reviews), which is really where the measure comes in. Our site is very complex a lot of different content types and features. If the basic site, commerce, and reviews are running, I consider the site "up". More importantly, so does my boss :-) I am not sure what the uptime is for everything else, but it is probably around 3 nines.

Regarding open source vs "commercial" solutions - fact is, they all fail. Never trust any component, especially SAN's. The more the component claims to be "fully reliable" the more trouble you can get yourself into because you end up trusting it. Design for failure.

Uptime is achieved in a number of ways:

Three basic rules: Simplicity, design around the fact that all components fail, and keep the software engineers and managers on the hook and in the loop for operational responsibility.

- We keep things very simple - commodity hardware, and just the basics in software (Linus/Apache/Tomcat/Java/Postgres/Lucene/Memcached/jgroups). We use only simple systems/software that we understand how they work at a deep level, have been field proven, have good operational support, and more importantly, we know where the warts are so we can manage around them.
- Keep operations as simple and visible (and automated) as possible, but make sure that you can intervene at any step.
- A lot of our engineers are knowledgable in operations, are "on call", and help with the live site and its operations. We have a very small ops team, but the "virtual" ops team is pretty large, and includes many of our managers. There are few engineers on the team that do not feel the "heat" of the live site.
- Assume everything fails. We not only have doubles (N+1 configuration) of all the "key" parts (routers, load balancers, databases, and other "critical" machines), but also banks of web servers and services that can tolerate multiple machines being down. And the entire datacenter is doubled in N+1 configuration. Not only has this saved us a couple of time (which you appreciate when you have a double BigIP failure and can get through this with 5-10 minutes of downtime) , but allows us to do serious machines/network maintenance and upgrades in safety.

June 29, 2011 | Registered CommenterAndy Gelfond

Hi Andy,

how strict is your 3-tiered architecture (web->services->db)? Is that enforced to keep the architecture simple or have you tried reducing the tiers in some places for more local simplicity / manageability?

How do developers work? Getting up some of those 25 services locally, a local database and develop a web server against them?

You said you'd load-balance your services: How do web and service layer communicate? (http, RMI or even Corba?) Why did you decide to software-load-balance in favour of hardware?

Thanks for sharing all this!
Stefan

June 30, 2011 | Unregistered CommenterStefan

Hi Stefan,

Good questions -

Our 3 tier architecture is not enforced. We like to see all db calls coming from services. Unfortunately, due to mostly legacy reasons, and the fact that not all of our major functionality is in services, and sometimes there are some simply things we need to do that are not worth putting together a service, web servers do make database calls. One of those "nice to have projects" that might get bumped in priority when it becomes important to address. We do have, however, what we call "DBR" db read databases. These are read only databases that get updated once a day, and are in a load balanced co figuration (I think we have 3 of them) since we have moved (I believe) all writable operations to services. So writs go to services, most reads do, and there is some data that gets read directly.

We have some common banks of services and databases, so if you are working only on the web tier, all you need is a front end. If you are working on a service, all you need (apart from either a web server or a test harness) is to run your own version of a service - your personal ini file can reach out and mix and match to services anywhere. Changes to service interfaces are required to be backwards compatible so this works out pretty well.

If you need to add/change a table, you do it to one of the shared databases, as these changes are also required to be backwards compatible. It is very rare that someone needs to run their own database server. For projects that for some kind of schema change that is not backwards compatible, there is a whole other process that gets kicked into action because now we are facing a huge live site deployment challenge. This happens occasionally.

Services are XML/HTTP. This was intended to be simple, allow us to see what is going on, and generate test cases easy. However, imho, XML becomes as opqaue as binary formats once you have anything worthwhile to encode. So we now serialize via xstream (huge pros/cons there). Service calls are defined by Java interfaces, and we have a proxy that will take method calls and dispatch them to a server based on configuration settings. Settings determine retry logic down to the method level (how long to wait, how many retires on one machine, how many machines to retry). Load balancing is done by generating ini file settings - for example, a given service will map typically to 4-8 machines, and different web servers will have different orders. If one is not working, it tries the next, and after time, will reset to the original configuration.

A little clunky (it was developed over the years), has its issues, but is very reliable and very very easy to tweak a machine by editing its config file. We can easily test new code for a service on the site by spinning up a new service machine and tweaking one web server to hit it. For local development, this is also very useful.

We went with software load balancing party due to historical incremental development, but also our load balancing needs are complex, need to be tuned, sometimes in ad hoc ways, need to be modified by developers for local use, and we also want to be able to turn logging on to provide details of what is going on. We have had discussions about moving this to a load balancer, but we do not understand how we can do this in a data driven (ini file) manner, allow each developer to have their own settings, or allow a web server to call out to any HTTP service (for example, not on the same network). There is also tremendous fear about changing code on a load balancer and how to properly test it.

June 30, 2011 | Unregistered CommenterAndy Gelfond

Nice post, thanks for your share.

July 1, 2011 | Registered CommenterRamon Li

Thanks, Andy,

especially your load-balancer insight is quite interesting. I am working for a quite similar site, but real estate and German-only, probably a tenth of your page views and we do have hardware-loadbalancers. The result is that this topic is a (distant) blackbox for developers and a sole operational thing, even with devops hardly understood by any dev.

Additionally there is a fair amount of issues related to "a developer could not test properly because development environment does not have Big IP load balancers" that you probably are able to avoid this way. I can only recommend to you to stick with that decision.

Best
Stefan

July 1, 2011 | Unregistered CommenterStefan

Design Review practice seems very interesting.
I agree we don't need "powerpoint" architects still some dev are more experienced. How do you manage of not having big ego clashes resulting in patchwork design?

July 2, 2011 | Unregistered CommenterUberto

Hi Andy,

You've mentioned your content is separated by content type (member, media, reviews, etc), what about your 20 languages? do you have one database per language? that said, do you have 20 databases for each content type? or do you have just one database (including all languages) for each content type?

Best,

Matias

July 6, 2011 | Unregistered CommenterMatias

Hi Uberto,

The type of project you work on and/or get to lead depends on you level experience and knowledge of our systems - this way, if you are new or just out of school, you either get to own your own small project, or work with a more senior engineer on a larger project. As you learn, you get "bigger" projects.

I look at design review the same way as code reviews - you want to do them not because they are expected, but because you truly feel they are good ways to help you deliver the best code. I find that the more successful engineers are the ones who actively seek out code and design reviews.

Regarding egos. A lot of this gets filtered out during the interview process by making it clear how we work - a lot of "very senior" engineers don't want to code or test, and they will move on to other companies. I find that when someone is forced discuss their designs in front of their peers, and that they need to do their own coding and testing, designs tend to be simpler, more practical, and cleaner.

While it is somewhat common that people are asked to go back and reconsider some aspect of their design, it is very rare for there to be a true disagreement. I can think of only twice where I have had to actually step in and make a decision - the team invariably comes to a decision that I either am happy with (I would probably do it the same way) or I am comfortable enough (maybe not the way I would do it, but it works well enough, at least imo :-) ).

I think that this has to do with two things: one is that everyone really wants to do the right thing and people have genuine respect for each others opinions, and we have a tremendous breadth and depth of experience to draw upon.

The second is that our "designs" are focused on approach/operations, which keeps things practical. Fact is, you can make any language work and scale - Python, Perl, Java, PHP, maybe even Ruby :-) - what we focus on is memory usage, where the code lives (web server, service, offline), where the data is stored (database, text files, log files, memory), what goes over the wire, what kind of caching (local, local LRU, memcached, etc), algorithmic complexity, and what the data schema is. We rarely get into classes and class structure, or specific code. And if you want to introduce new tech, you need to do your homework on both the tech and why our current tech cannot be made to work for your use case.

And yes, we sometimes end up with "differing designs" in different parts of the site. It is not as bad as you would think, and in the end, I prefer giving teams flexibility in their approach (within certain boundaries that have been worked out over the years) especially given that the teams need to feel true ownership of their work. Besides, it is surprising what you learn from having different approaches.


Hi Matlias,

Content is not split out by language, it is, however, tagged by language and POS origin.

July 9, 2011 | Registered CommenterAndy Gelfond

Hi Andy,

Thanks for answering.
I got it now, when you said "avoid joins" you're talking about joins between databases, joins between different content type, but obviously inside of each content type database there are a lot of tables (containing all info tagged by POS, language, etc) on which you have to do joins. That said, it's very expensive to do joins between different databases and even more if they are on different machines.
Then, you have your content available in many languages in many locations (maybe?) because you are using DRDB, wich lets you have, as you said, duplicate hot failover and at the same time maybe less latency to different countries because of your servers location. Am I correct?

July 14, 2011 | Unregistered CommenterMatias

Hi Andy,

regarding of content, did you implemented something for change data capture? could you comment something of that? It will be very interesting to know how do you deal with content versioning having thousands or millions of items (hotels, reviews, etc).

Best,

Matias

July 14, 2011 | Unregistered CommenterMatias

I think it's a good idea to allow for some divergent design and code duplication. Otherwise you'd end up with rigid standards, which would not allow your apps/code to evolve anymore. Saw this happening elsewhere, that's why I'm saying. Once rules become more important than creative, live thinking, evolution stops being tied to business.

Besides, the ugly face of DRY is strong coupling. IMO the optimum always lies somewhere in between extremes.

October 22, 2014 | Unregistered CommenterAnonymous Coward

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>