Database

Alternative Memcache Usage: A Highly Scalable, Highly Available, In-Memory Shard Index

While working with Memcache the other night, it dawned on me that it’s usage as a distributed caching mechanism was really just one of many ways to use it. That there are in fact many alternative usages that one could find for Memcache if they could just realize what Memcache really is at its core – a simple distributed hash-table – is an important point worthy of further discussion.

To be clear, when I say “simple”, by no means am I implying that Memcache’s implementation is simple, just that the ideas behind it are such. Think about that for a minute. What else could we use a simple distributed hash-table for, besides caching? How about using it as an alternative to the traditional shard lookup method we used in our Master Index Lookup scalability strategy, discussed previously here.

How to Organize a Database Table’s Keys for Scalability

The key (no pun intended) to understanding how to organize your dataset’s data is to think of each shard not as an individual database, but as one large singular database. Just as in a normal single server database setup where you have a unique key for each row within a table, each row key within each individual shard must be unique to the whole dataset partitioned across all shards.

There are a few different ways we can accomplish uniqueness of row keys across a shard cluster. Each has its pro’s and con’s and the one chosen should be specific to the problems you’re trying to solve.

Deploying MySQL Database in Solaris Cluster Environments

MySQL™ database, an open source database, delivers high performance and reliability while keeping costs low by eliminating licensing fees. The Solaris™ Cluster product is an integrated hardware and software environment that can be used to create highly-available data services. This article explains how to deploy the MySQL database in a Solaris Cluster environment. The article addresses the following topics:

* "Advantages of Deploying MySQL Database with Solaris Cluster" on page 1 discusses the benefits provided by a Solaris Cluster deployment of the MySQL database.
* "Overview of Solaris Cluster" on page 2 provides a high-level description of the hardware and software components of the Solaris Cluster.
* "Installation and Configuration" on page 8 explains the procedure for deploying the MySQL database on a Solaris Cluster.

This article assumes that readers have a basic understanding of Solaris Cluster and MySQL database installation and administration.

Todd Hoff's picture

Product: Amazon's SimpleDB

Update 34: Apparently Amazon pulled this article. I'm not sure what that means. Maybe time went backwards or something? Amazon dramatically drops SimpleDB pricing to $0.25 per GB per month from $1.50 per GB. This puts SimpleDB on par with Google App Engine. They also announced a few new features: a SQL-like SELECT API as well as a Batch Put operation to streamline uploading of multiple items or attributes. One of the complaints against SimpleDB is that programmers end up writing too much code to do simple things. These features and a much cheaper price should help considerably. And you can store lots of data now. GAE is still capped.
Update 33: Amazon announces Elastic Block Store (EBS), which provides lots of normal looking disk along with value added features like snapshots and snapshot copying. But database's may find EBS too slow. RightScale tells us Why Amazon’s Elastic Block Store Matters.
Update 32: You can now get all attributes for a property when querying. Previously only the ID was returned and the attributes had to be returned in separate calls. This makes the programmer's job a lot simpler. Artificial levels of parallelization code can now be dumped.
Update 31: Amazon fixes a major hole in SimpleDB by adding the ability to sort query results. Previously developers had to sort results by hand which was a non-starter for many. Now you can do basic top 10 type queries with ease.
Update 30: Amazon SimpleDB - A distributed, highly-scalable, light-weight, query-able, attribute store by Sebastian Stadil. It introduces the CAP theorem and the basics of SimpleDB. Sebastian does a lot of great work in the AWS world and in what must be his limited free time, runs the AWS Meetup group.

SQL Server 2008 Database Performance and Scalability

Microsoft SQL Server 2008 incorporates the tools and technologies that are necessary to implement relational databases, reporting systems, and data warehouses of enterprise scale, and provides optimal performance and responsiveness.
With SQL Server 2008, you can take advantage of the latest hardware technologies while scaling up your servers to support server consolidation. SQL Server 2008 also enables you to scale out your largest data solutions.

This white paper describes the performance and scalability capabilities of Microsoft® SQL Server® 2008 and explains how you can use these capabilities to:

* Optimize performance for any size of database with the tools and features that are available for the database engine, analysis services, reporting services, and integration services.

* Scale up your servers to take full advantage of new hardware capabilities.

* Scale out your database environment to optimize responsiveness and to move your data closer to your users.

Read the entire article about SQL Server 2008 Database Performance and Scalability at MyTestBox.com - web software reviews, news, tips & tricks.

Database question for upcoming project

We will be developing an RIA that will have a lot of database access. Think something like a QuickBooks but with about 50 transactions entered per hour per user. Users will be in the system for 7 to 9 hours a day and there will be around 20,000 users, all logged in at the same time.

Reporting will be done just like a QuickBooks style app plus a lot of extra things you don't do in QuickBooks.

Our operations is familiar with W2003 Server and MS SQL Server so they are recommending we stick with that. I originally requested Linux and PostgreSQL.

How far can a single database server get me? If we have a 4 processor, 8 core, 128gb server, how far am I going to get before I need to shard or do something else? I know there are a lot of factors involved but in general for this size of a site, what should the strategy be?

I've read almost all articles on this website but most of the applications are not RIA type of apps with this type of usage or they are architectures for sites with millions of users which we also won't have.

Todd Hoff's picture

Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)

Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads.
Update: interesting related thread on Lamda the Ultimate.

A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt:

In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:
* Disk oriented storage and indexing structures
* Multithreading to hide latency
* Locking-based concurrency control mechanisms
* Log-based recovery

Todd Hoff's picture

The Mother of All Database Normalization Debates on Coding Horror

Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow.

Here are a few of the pithier points (summarized):

Todd Hoff's picture

How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale

Update 3: ReadWriteWeb says Google App Engine Announces New Pricing Plans, APIs, Open Access. Pricing is specified but I'm not sure what to make of it yet. An image manipulation library is added (thus the need to pay for more CPU :-) and memcached support has been added. Memcached will help resolve the can't write for every read problem that pops up when keeping counters.
Update 2: onGWT.com threw a GAE load party and a lot of people came. The results at Load test : Google App Engine = 1, Community = 0. GAE handled a peak of 35 requests/second and a sustained 10 requests/second. Some think performance was good, others not so good. My GMT watch broke and I was late to arrive. Maybe next time. Also added a few new design rules from the post.
Update: Added a few new rules gleaned from the GAE Meetup: Design By Explicit Cost Model and Puts are Precious.

How do you structure your database using a distributed hash table like BigTable? The answer isn't what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don't. You--pause for dramatic effect--duplicate data instead of normalize it. *shudder*

Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don't know how that decision was made, but it must have gone against every fiber in their relational bones...

Todd Hoff's picture

Rumors of Signs and Portents Concerning Freeish Google Cloud

Update 2: Rumor no more. Google Jumps Head First Into Web Services With Google App Engine. The quick and dirty of it: developers simply upload their Python code to Google, launch the application, and can monitor usage and other metrics via a multi-platform desktop application. There were 10,000 developer slots open and of course I was too late. More as the cobra strikes.
Update: TechCrunch reports Google To Launch BigTable As Web Service next week. It competes with Amazon's SimpleDB. Though it won't be truly comparable until they also release an EC2 and S3 equivalent. An internet hit for each data access is a little painful. As Jimmy says in Goodfellas, "That's the way. You don't take no sh*t from nobody. "

First Dave Winer hallucinates a pig on the mean streets of Walnut Creek that told him Google's long foretold cloud offering will be free for bloggers of "modest needs." GigaOM then says a free cloud service is how Google could eat Amazon's bacon for lunch.

The reason for this free cloud buffet is said to be the easier integration of acquisitions who must presumably be in the Google cloud to be taken out. All the free stuff Google offers earns almost no money. They make money on search. Hosting every last CPU cycle on earth has to be costly. What's the return? Cheaper integration of new startups that will also provide no new revenue?

Perhaps I am simply not clever enough to see the revolutionary brilliance in this line of thought. Though I would be quite pleased to have Google shareholders subsidize my projects.

Folknologist thinks Google may keep costs down by requiring developers to code to a Cloud Virtual Machine based on Java byte codes...

Scaling Out MySQL

This post covers two main options for scaling-out MySql and compare between them. The first is based on data-base clustering and the second is based on In Memory clustering a.k.a Data Grid. A special emphasis is given to a pattern which shows how to scale our existing data base without changing it through a combination of Data Grid and data base as a background service. This pattern is referred to as Persistency as a Service (PaaS). It also address many of the fequently asked question related to how performance, reliability and scalability is achieved with this pattern.

Database War Stories #3: Flickr

[Tim O'Reilly] Continuing my series of queries about how "Web 2.0" companies used databases, I asked Cal Henderson of Flickr to tell me "how the folksonomy model intersects with the traditional database. How do you manage a tag cloud?"

Todd Hoff's picture

Microsoft's New Database Cloud Ready to Rumble with Amazon

Update: Zdnet says Ozzie signals Microsoft’s surrender to the cloud. CD ROMs are to the internet as the internet is to the cloud and Microsoft aims to scratch and claw its way into this paradigm shift as well.

The gloves are off. The tag line for Microsoft's new SQL Server Data Service is Your Data, Any Place, Any Time. Thems fighten' words. Microsoft is itch'n for a fight! Who will be Amazon's second?

The service description:
SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning.

Sounds like a fast uppercut aimed squarely at SimpleDB's jaw. As a developer what do you need to know?

limit on the number of databases open

Have a few doubts.. here are the qs
1) is there any limit on the number of databases that can be accessed simultaneously? (MySQL)

2) will it be a problem to scale in the future if there are large number of small databases(2-5 MB) each?

Too many databases

Hi,

I am using drupal for my clients website, and was thinking is it possible to host all ( about 500) of them on the same server(maybe VPS or dedicated).
Here is the situation..... Each clients website has a database with about 50 tables each, all the databases are small in size about 2-5 MB .... and the websites are low traffic websites with say.. 50 hits/day on avg.... that means about 2000 queries/db/day ..... (avg 40 queries per hit)....

Wanted to know if it is possible to have so many databases about 500 on the same server?

what are the things that i should look into if i should make this happen?

Database-Clustering: a8cjdbc - update: version 1.3

The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL.

Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself...

Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php

I've downloaded the latest version and setup a environment with one virtual database and two database backends.
I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node.
Everything works fine - failover - recoverylog, etc... with to different backend database types.

So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc.
As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases?

greetings
Wolfgang

a8cjdbc - update verision 1.3

The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL.

Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself...

Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php

I've downloaded the latest version and setup a environment with one virtual database and two database backends.
I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node.
Everything works fine - failover - recoverylog, etc... with to different backend database types.

So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc.
As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases?

greetings
Wolfgang

a8cjdbc - Database Clustering via JDBC

Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose?

Todd Hoff's picture

Paper: Dynamo: Amazon’s Highly Available Key-value Store

Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated.

Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues.

Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand.

From the abstract you can get a feel for what the paper is about:

Todd Hoff's picture

Product: Sequoia Database Clustering Technology

Sequoia is a transparent middleware solution offering clustering, load balancing and failover services for any database. Sequoia is the continuation of the C-JDBC project. The database is distributed and replicated among several nodes and Sequoia balances the queries among these nodes. Sequoia handles node and network failures with transparent failover. It also provides support for hot recovery, online maintenance operations and online upgrades.

Syndicate content