Java

Todd Hoff's picture

Yandex Architecture

Update: Anatomy of a crash in a new part of Yandex written in Django. Writing to a magic session variable caused an unexpected write into an InnoDB database on every request. Writes took 6-7 seconds because of index rebuilding. Lots of useful details on the sizing of their system, what went wrong, and how they fixed it.

Yandex is a Russian search engine with 3.5 billion pages in their search index. We only know a few fun facts about how they do things, nothing at a detailed architecture level. Hopefully we'll learn more later, but I thought it would still be interesting. From Allen Stern's interview with Yandex's CTO Ilya Segalovich, we learn:

Todd Hoff's picture

Mailinator Architecture

Update: A fun exploration of applied searching in How to search for the word "pen1s" in 185 emails every second. When indexOf doesn't cut it you just trie harder.

Has a drunken friend ever inspired you to create a first of its kind internet service that is loved by millions, deemed subversive by thousands, all while handling over 1.2 billion emails a year on one rickity old server? That's how Paul Tyma came to build Mailinator.

Mailinator is a free no-setup web service for thwarting evil spammers by creating throw-away registration email addresses. If you don't give web sites you real email address they can't spam you. They spam Mailinator instead :-)

I love design with a point-of-view and Mailinator has a big giant harry one: performance first, second, and last. Why? Because Mailinator is free and that allows Paul to showcase his different perspective on design. While competitors buy big Iron to handle load, Paul uses a big idea instead: pick the right problem and create a design to fit the problem. No more. No less. The result is a perfect system architecture sonnet, beauty within the constraints of form.

How does Mailinator carry out its work as a spam busting super hero?

Todd Hoff's picture

Google Architecture

Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

Future of EJB3 !! ??

What is the future of EJB3 in the industry , given the current trends ?
There are a lot of arguments regarding EJB3 being heavy weighted .....
Also, what could be the alternatives of EJB3 ?
How about the scalability, persistence, performance and other factors ?

Database-Clustering: a8cjdbc - update: version 1.3

The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL.

Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself...

Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php

I've downloaded the latest version and setup a environment with one virtual database and two database backends.
I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node.
Everything works fine - failover - recoverylog, etc... with to different backend database types.

So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc.
As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases?

greetings
Wolfgang

a8cjdbc - update verision 1.3

The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL.

Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself...

Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php

I've downloaded the latest version and setup a environment with one virtual database and two database backends.
I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node.
Everything works fine - failover - recoverylog, etc... with to different backend database types.

So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc.
As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases?

greetings
Wolfgang

Todd Hoff's picture

Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?

Todd Hoff's picture

eBay Architecture

Update: eBay Serves 5 Billion API Calls Each Month. Aren't we seeing more and more traffic driven by mashups composed on top of open APIs? APIs are no longer a bolt on, they are your application. Architecturally that argues for implementing your own application around the same APIs developers and users employ.

Who hasn't wondered how eBay does their business? As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost.

You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from.

Todd Hoff's picture

Flickr Architecture

Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

a8cjdbc - Database Clustering via JDBC

Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose?

Linkedin architecture

Hi,

An interesting post on Linkedin architecture:

http://furiouspurpose.blogspot.com/2007/11/qcon-linkedin-architecture.ht...

ID generator

Hi,

I would like feed back on a ID generator I just made. What positive and negative effects do you see with this. It's programmed in Java, but could just as easily be programmed in any other typical language. It's thread safe and does not use any synchronization. When testing it on my laptop, I was able to generate 10 million IDs within about 15 seconds, so it should be more than fast enough.

Take a look at the attachment.. (had to rename it from IdGen.java to IdGen.txt to attach it)

IdGen.java

Todd Hoff's picture

Paper: Dynamo: Amazon’s Highly Available Key-value Store

Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated.

Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues.

Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand.

From the abstract you can get a feel for what the paper is about:

Why most large-scale Web sites are not written in Java

There is a lot of information in the blogosphere describing the architecture of many popular sites, such as Google, Amazon, eBay, LinkedIn, TypePad, WikiPedia and others.

I've summarized this issue in a blog post here

I would really appreciate your opinion on this matter.

Todd Hoff's picture

Secrets to Fotolog's Scaling Success

Fotolog, a social blogging site centered around photos, grew from about 300 thousand users in 2004 to over 11 million users in 2007. Though they initially experienced the inevitable pains of rapid growth, they overcame their problems and now manage over 300 million photos and 800,000 new photos are added each day. Generating all that fabulous content are 20 million unique monthly visitors and a volunteer army of 30,000 new users each day. They did so well a very impressed suitor bought them out for a cool $90 million. That's scale meets success by anyone standards. How did they do it?

Todd Hoff's picture

Amazon Architecture

This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included.

Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles

Todd Hoff's picture

GoogleTalk Architecture

Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it?

Todd Hoff's picture

FeedBurner Architecture

FeedBurner is a news feed management provider launched in 2004. FeedBurner provides custom RSS feeds and management tools to bloggers, podcasters, and other web-based content publishers. Services provided to publishers include traffic analysis and an optional advertising system.

Syndicate content