Excellent article on using Hadoop in Amazon's services environment to solve real problems for very little money. It's excellent because it shows how the stack works together and it actually seems like something a real human could do.
Just thought I'd drop a brief suggestion to anyone building a large mail system. Our solution for scaling mail pickup was to develop a sharded architecture whereby accounts are spread across a cluster of servers, each with imap/pop3 capability. Then we use a cluster of reverse proxies (Perdition) speaking to the backend imap/pop3 servers . The benefit of this approach is you can use simply use round-robin or HA loadbalancing on the perdition servers that end users connect to (e.g. admins can easily move accounts around on the backend storage servers without affecting end users). Perdition manages routing users to the appropriate backend servers and has MySQL support. What we also liked about this approach was that it had no dependency on a distributed or networked filesystem, so less chance of corruption or data consistency issues. When an individual server reaches capacity, we just off load users to a less used server. If any server goes offline, it only affects the fraction of users assigned to that server. Best, Erik Osterman
All APIs are different. At its core, an API provides direct access deep into a web service (lower case - a service that is provided on the web) and turns it into a Web Service (upper case) that people can use as a building block. What makes it an API is the infrastructure that sits in front of it, attracts developers to use it, secures it from misuse and provides the metrics and management needed to turn an internal web service into a Web Service managed through an effective distribution channel, and providing strategic and/or financial benefit. While each API is different, the infrastructure I have described is consistent across virtually all of them, so it is neither economical nor effective to reinvent the wheel for each API someone wants to release. It is similar to the concept of an adserver - all websites have different content and fucntionality, but the concept of selecting and serving an ad, tracking it, and targeting it is pretty consistent across sites; as a result, there are many sites that use a handful of adserver providers. In addition to allowing companies to focus on their core business without having to build peripheral, non-core services, using a third-party service who is focused on providing that service allows you to benefit from ongoing development and enhancement, and from features that would be prohibitively expensive to build for just a single provider. As for an example? check out sites such as developer.trulia.com or developer.compete.com, our first two customers (we have many more, but I like to give props to our early adopters). In addition to documentation and community, they have developer key issuance, instant self-service developer provisioning, usage and rate throttling, and tracking. What you don't see, but our clients enjoy, is a dashboard where they can assign different access levels, rates or limits to each developer on a key-by-key basis, customize error messages and other API parameters, and see detailed reports of API usage on a developer-by-developer or overall basis. Building all of that takes time and money; we offer it as an instantly-deployable on-demand service with no up-front investment, and our customers seem to find it an excellent value. Oren Michels, CEO Mashery
Hi , someone can point me to some good resurce about how to bulid a multilanguage website ? the only resource i have found is this http://www.indiawebdevelopers.com/technology/multilanguage_support.asp thanks! p.s. great site ;)
MogileFS is an open source distributed filesystem. Its properties and features include: Application level, No single point of failure, Automatic file replication, Better than RAID, Flat Namespace, Shared-Nothing, No RAID required, Local filesystem agnostic.
memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Danga Interactive developed memcached to enhance the speed of LiveJournal.com, a site which was already doing 20 million+ dynamic page views per day for 1 million users with a bunch of webservers and a bunch of database servers. memcached dropped the database load to almost nothing, yielding faster page load times for users, better resource utilization, and faster access to the databases on a memcache miss.
As a developer, you are aware of the increasing concern amongst developers and site architects that websites be able to handle the vast number of visitors that flood the Internet on a daily basis. Scalable Internet Architecture addresses these concerns by teaching you both good and bad design methodologies for building new sites and how to scale existing websites to robust, high-availability websites. Primarily example-based, the book discusses major topics in web architectural design, presenting existing solutions and how they work. Technology budget tight? This book will work for you, too, as it introduces new and innovative concepts to solving traditionally expensive problems without a large technology budget. Using open source and proprietary examples, you will be engaged in best practice design methodologies for building new sites, as well as appropriately scaling both growing and shrinking sites. Website development help has arrived in the form of Scalable Internet Architecture.
I currently use BerkeleyDB as an embedded database http://www.oracle.com/database/berkeley-db/ a decision which was initially brought on by learning that Google used BerkeleyDB for their universal sign-on feature. Lustre looks impressive, but their white paper shows speeds of 800 files created per second, as a good number. However, BerkeleyDB on my mac mini does 200,000 row creations per second, and can be used as a distributed file system. I'm having I/O scalability issues with BerkeleyDB on one machine, and about to implement their distributed replication feature (and go multi-machine), which in effect makes it work like a distributed file system, but with local access speeds. That's why I was looking at Lustre. The key feature difference between BerkeleyDB and Lustre is that BerkeleyDB has a complete copy of all the data on each computer, making it not a viable solution for massive sized database applications. However, if you have < 1TB (ie, one disk) of total possible data, it seems to me that a replicated local key/value database is the fastest solution. I haven't found much discussion of people using this kind of technology for highly scalabable web sites. Over the years, I've had extremely good performance results with dbm files, and have found that nothing beats local data, access through C APIs, and btree or hash table implementations. I have never tried replicated/redundant versions of this approach, and I'm curious if others have, and what your experience has been.
In the Amazon Services architecture article the podcast mentions Mashery. I went to their site at http://www.mashery.com/, but I can't quite figure out what it is. They want to:
Unleash and manage channels for your API responsibly with Mashery’s combination of security, usage, access management, tracking, metrics, commerce, performance optimization and developer community tools.An example would help, because I am not getting it.
Update: Speed up Apache - how I went from F to A in YSlow. Good example of using YSlow to speed up a website with solid code examples. Every layer in the multi-layer cake that is your website contributes to how long a page takes to display. YSlow, from Yahoo, is a cool tool for discovering how the ingredients of your site's top layer contribute to performance. YSlow analyzes web pages and tells you why they're slow based on the rules for high performance web sites. YSlow is a Firefox add-on integrated with the popular Firebug web development tool. YSlow gives you: * Performance report card * HTTP/HTML summary * List of components in the page * Tools including JSLint