| Product |
Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose?
Appliances that:
* Ensures Non-Stop application availability
* Improves network and server maintainability
* Delivers Enterprise-grade gigabit content switching
* Offers true Application Acceleration
* Provides maximum throughput at minimal cost
Adam Wiggins of Heroku presented at the lollapalooza that was theCloud Computing Demo Night. The idea behind Heroku is that you upload a Rails application into Heroku and it automatically deploys into EC2 and it automatically scales using behind the scenes magic. They call this "liquid scaling." You just dump your code and go. You don't have to think about SVN, databases, mongrels, load balancing, or hosting. You just concentrate on building your application. Heroku's unique feature is their web based development environment that lets you develop applications completely from their control panel. Or you can stick with your own development environment and use their API and Git to move code in and out of their system.
For website developers this is as high up the stack as it gets. With Heroku we lose that "build your first lightsaber" moment marking the transition out of apprenticeship and into mastery. Upload your code and go isn't exactly a heroes journey, but it is damn effective.
I must confess to having an inherent love of Heroku's idea because I had a similar notion many moons ago, but the trendy language of the time was Perl instead of Rails. At the time though it just didn't make sense. The economics of creating your own "cloud" for such a different model wasn't there. It's amazing the niches utility computing will seed, fertilize, and help grow. Even today when using Eclipse I really wish it was hosted in the cloud and I didn't have to deal with all its deployment headaches. Firefox based interfaces are pretty impressive these days. Why not?
Adam views their stack as:
Update 2: A HSCALE benchmark finds HSCALE "adds a maximum overhead of about 0.24 ms per query (against a partitioned table)." Future releases promise much improved results.
Update: A new presentation at An Introduction to HSCALE.
After writing Skype Plans for PostgreSQL to Scale to 1 Billion Users, which shows how Skype smartly uses a proxy architecture for scaling, I'm now seeing MySQL Proxy articles all over the place. It's like those "get rich quick" books that say all you have to do is visualize a giraffe with a big yellow dot superimposed over it and by sympathetic magic giraffes will suddenly stampede into your life. Without realizing it I must have visualized transparent proxies smothered in yellow dots.
One of the brightest images is a wonderful series of articles by Peter Romianowski describing the evolution of their proxy architecture. Their application is an OLTP system executing 200 million transaction per month, tables with more than 1.5 billion rows, and a 600 GB total database size. They ran into a wall buying bigger boxes and wanted to move to a sharded architecture. The question for them was: how do you implement sharding?
Update: InfoQ interview on Hypertable Lead Discusses Hadoop and Distributed Databases. Hypertable differs from HBase in that it is a higher performance implementation of Bigtable.
Skrentablog gives the heads up on Hypertable, Zvents' open-source BigTable clone. It's written in C++ and can run on top of either HDFS or KFS. Performance looks encouraging at 28M rows of data inserted at a per-node write rate of 7mb/sec.
The Isilon IQ family of clustered storage systems was designed from the ground up to meet the needs of data-intensive enterprises and high-performance computing environments. By combining Isilon's OneFS® operating system software with the latest advances in industry-standard hardware, Isilon delivers modular, pay-as-you-grow, enterprise-class clustered storage systems. OneFS, with TrueScale™ technology, powers the industry's first and only storage system that enables linear or independent scaling of performance and capacity. This new flexible and tunable system, featuring a robust suite of clustered storage software applications, provides customers with an "out of the box" solution that is fully optimized for the widest range of applications and workflow needs.
* Scales from 4 TB ti 1 PB
* Throughput of up to 10 GB per seond
* Linear scaling
* Easy to manage
There's a new clustered file system on the spindle: Kosmos File System (KFS). Thanks to Rich Skrenta for turning me on to KFS and I think his blog post says it all. KFS is an open source project written in C++ by search startup Kosmix. The team members have a good pedigree so there's a better than average chance this software will be worth considering.
After you stop trying to turn KFS into "Kentucky Fried File System" in your mind, take a look at KFS' intriguing feature set:
Lustre® is a scalable, secure, robust, highly-available cluster file system. It is designed, developed and maintained by Cluster File Systems, Inc.
The central goal is the development of a next-generation cluster file system which can serve clusters with 10,000's of nodes, provide petabytes of storage, and move 100's of GB/sec with state-of-the-art security and management infrastructure.
Lustre runs on many of the largest Linux clusters in the world, and is included by CFS's partners as a core component of their cluster offering (examples include HP StorageWorks SFS, and the Cray XT3 and XD1 supercomputers). Today's users have also demonstrated that Lustre scales down as well as it scales up, and runs in production on clusters as small as 4 and as large as 25,000 nodes.
The latest version of Lustre is always available from Cluster File Systems, Inc. Public Open Source releases of Lustre are available under the GNU General Public License. These releases are found here, and are used in production supercomputing environments worldwide.
If you want to adopt a shard architecture, but don't want to start from scratch, you may want to consider Hibernate's sharding system.
Hibernate Shards is a framework that is designed to encapsulate and minimize this complexity by adding support for horizontal partitioning to Hibernate Core.
Hibernate Shards key features:
* Standard Hibernate programming model - Hibernate Shards allows you to continue using the Hibernate APIs you know and love: SessionFactory, Session, Criteria, Query. If you already know how to use Hibernate, you already know how to use Hibernate Shards.
* Flexible sharding strategies - Distribute data across your shards any way you want. Use one of the default strategies we provide or plug in your own application-specific logic.
* Support for virtual shards - Think your sharding strategy is never going to change? Think again. Adding new shards and redistributing your data is one of the toughest operational challenges you will face once you've deployed your shard-aware application. Hibernate Sharding supports virtual shards, a feature designed to simplify the process of resharding your data.
* Free/open source - Hibernate Shards is licensed under the LGPL (Lesser GNU Public License)
3PAR Remote Copy is a uniquely simple and efficient replication technology that allows customers to protect and share any application data affordably. Built upon 3PAR Thin Copy technology, Remote Copy lowers the total cost of storage by addressing the cost and complexity of remote replication.
Common Uses of 3PAR Remote Copy:
Affordable Disaster Recovery: Mirror data cost-effectively across town or across the world.
Centralized Archive: Replicate data from multiple 3PAR InServs located in multiple data centers to a centralized data archive location.
Resilient Pod Architecture: Mutually replicate tier 1 or 2 data to tier 3 capacity between two InServs (application pods).
Remote Data Access: Replicate data to a remote location for sharing of data with remote users.
Akamai transparently mirrors content (usually media objects such as audio, graphics, animation, video) stored on customer servers. Though the domain name is the same, the IP address points to an Akamai server rather than the customer's server.
In addition to image caching, Akamai provides services which accelerate dynamic and personalized content, J2EE-compliant applications, and streaming media.
Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
This service allows you to link directly to files at a cost of 15 cents per GB of storage, and 20 cents per GB transfer.
Update 30: Amazon SimpleDB - A distributed, highly-scalable, light-weight, query-able, attribute store by Sebastian Stadil. It introduces the CAP theorem and the basics of SimpleDB. Sebastian does a lot of great work in the AWS world and in what must be his limited free time, runs the AWS Meetup group.
AWStats is a free powerful and featureful tool that generates advanced web, streaming, ftp or mail server statistics, graphically. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages. It uses a partial information file to be able to process large log files, often and quickly. It can analyze log files from all major server tools like Apache log files (NCSA combined/XLF/ELF log format or common/CLF log format), WebStar, IIS (W3C log format) and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers.
Cacti is a network statistics graphing tool designed as a frontend to RRDtool's data storage and graphing functionality. It is intended to be intuitive and easy to use, as well as robust and scalable. It is generally used to graph time-series data like CPU load and bandwidth use.
The frontend is written in PHP; it can handle multiple users, each with their own graph sets, so it is sometimes used by web hosting providers (especially dedicated server, virtual private server, and colocation providers) to display bandwidth statistics for their customers. It can be used to configure the data collection itself, allowing certain setups to be monitored without any manual configuration of RRDtool.
From their website:
Simply put, Capistrano is a tool for automating tasks on one or more remote servers. It executes commands in parallel on all targeted machines, and provides a mechanism for rolling back changes across multiple machines. It is ideal for anyone doing any kind of system administration, either professionally or incidentally.
* Great for automating tasks via SSH on remote servers, like software installation, application deployment, configuration management, ad hoc server monitoring, and more.
* Ideal for system administrators, whether professional or incidental.
* Easy to customize. Its configuration files use the Ruby programming language syntax, but you don't need to know Ruby to do most things with Capistrano.
* Easy to extend. Capistrano is written in the Ruby programming language, and may be extended easily by writing additional Ruby modules.
If you are trying to create highly available file systems, especially across data centers, then ChironFS is one potential solution. It's relatively new, so there aren't lots of experience reports, but it looks worth considering. What is ChironFS and how does it work?
From http://directory.fsf.org/project/collectd/ :
'collectd' is a small daemon which collects system information every 10 seconds and writes the results in an RRD-file. The statistics gathered include: CPU and memory usage, system load, network latency (ping), network interface traffic, and system temperatures (using lm-sensors), and disk usage. 'collectd' is not a script; it is written in C for performance and portability. It stays in the memory so there is no need to start up a heavy interpreter every time new values should be logged.
From their website:
There are a number of times in which you find yourself needing performance data. These can include benchmarking, monitoring a system's general heath or trying to determine what your system was doing at some time in the past. Sometimes you just want to know what the system is doing right now. Depending on what you're doing, you often end up using different tools, each designed to for that specific situation. Features include:
From their website:
DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1.
DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there.
eAccelerator is a free open-source PHP accelerator, optimizer, and dynamic content cache. It increases the performance of PHP scripts by caching them in their compiled state, so that the overhead of compiling is almost completely eliminated. It also optimizes scripts to speed up their execution. eAccelerator typically reduces server load and increases the speed of your PHP code by 1-10 times.
From their website:
FAI is an automated installation tool to install or deploy Debian GNU/Linux and other distributions on a bunch of different hosts or a Cluster. It's more flexible than other tools like kickstart for Red Hat, autoyast and alice for SuSE or Jumpstart for SUN Solaris. FAI can also be used for configuration management of a running system.
You can take one or more virgin PCs, turn on the power and after a few minutes Linux is installed, configured and running on all your machines, without any interaction necessary. FAI it's a scalable method for installing and updating all your computers unattended with little effort involved. It's a centralized management system for your Linux deployment.
FastStats Log Analyzer enables you to:
* Determine whether your CPC advertising is profitable:
Are you spending $0.75 per click on Google or Overture, but only receiving $0.56 per click in revenue?
* Tune site traffic patterns:
FastStats's Hyperlink Tree View feature lets you visually see how traffic flows through your web site.
* High-performance solution for even the busiest web sites:
Our software has been clocked at over 1000 MB/min. Other popular log file analysis tools (we won't name names), run at 1/40th the speed.
We've been in the business for over 6 years, delivering value, quality, and good customer service to our clients. Our products are used for data mining at some of the world's busiest web sites -- why not give FastStats a try at your web site? FastStats log file analysis supports a wide variety of web server log files, including Apache logs and Microsoft IIS logs.
Flickr offers a free basic account with limited upload bandwidth and limited storage. Download bandwidth is unlimited. Upgrading to a paid Pro account for $25/year removes all upload and storage restrictions. Flickr's terms of use warn that "professional or corporate uses of Flickr are prohibited", and all external images require a link back to Flickr.