Alan Watts once observed how after we accepted Descartes' separation of the mind and body we've been trying to smash them back together again ever since when really they were never separate to begin with. The database normalization-denormalization dualism has the same mobius shaped reverberations as Descartes' error. We separate data into a million jagged little pieces and then spend all our time stooping over, picking them and up, and joining them back together again. Normalization has been standard practice now for decades. But times are changing. Many mega-website architects are concluding Watts was right: the data was never separate to begin with. And even more radical, we may even need to store multiple copies of data.
The Great Data Ownership Wars: The Database vs. The ApplicationA not so subtle clue as to who won the data wars is to look at the words used. Data that are split up are considered "normal." Those who keep their data whole are considered "de-normal." All right, that's not what those words mean, but it was to good to pass up. :-) Traditionally the database owns the data. Referential integrity, triggers, stored procedures, and everything else that keeps the data safe and whole is in the database. Applications are prevented from screwing up the data. And this makes sense until you scale. Centralizing all behavior in the database won't mega-scale as the web does, which is why Ebay went completely the other way. Ebay maintains data integrity through a service layer that encapsulates all data access. The service layer handles referential integrity, managing replicated copies, doing joins, and so on. It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle. All this sharding and denormalization and duplicating at one levels feels so wrong because it's so different than we were all taught. And unless you are a really large website you probably don't need to worry about this level of complexity. But it's a really fascinating and unexpected evolution in design. Scaling to handle the world wide web requires techniques and strategies that are often at odds with our years of experience. It will be fun to see where it all leads.
We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle). The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable. Could you point me to some examples or articles I could review to design a solution for such this context?
Hey, I do have a website that I would like to scale. Right now we have 10 servers but this does not scale well. I know how to deal with my apache web servers but have problems with sql servers. I would like to use the "scale out" system and add servers when we need. We have over 100Gb of data for mysql and we tried to have around 20G per server. It works well except that if a server goes down then 1/5 of the user can't access the website. We could use replication but we would need to at least double sql servers to replicate each server. And maybe in the future it's not gonna be enough we would need maybe 3 slaves per master ... well I don't really like this idea. I would prefer to have 8 servers that all deal with data from the 5 servers we have right now and then we could add new servers when we need. I looked at NFS but that does not seem to be a good idea for SQL servers ? Can you confirm?
Is there any alternative to LIKE '%...%' OR LIKE '%...%' in MySQL if you have to offer partial string matching on a large dataset?
Playing like the big boys may be getting cheaper. The big boys, like YouTube, farm the serving of their most popular videos to a third party CDN. A lot of people were surprised YouTube didn't serve all their content themselves, but it makes sense. It allows them to keep up with demand without a large hit for infrastructure build out, much like leasing computers instead of buying them. The problem has been CDNs are expensive. Om Malik reports in Akamai & the CDN Price Wars that may be changing. CDN service could be becoming affordable enough that you might consider using them as part of your scaling strategy. Akamai, once the clear leader in the CDN field, is facing strong competition from the likes of Limelight Networks, Level 3, Internap, CDNetworks, Panther Express and EdgeCast Networks. This commoditization may be bad for their stock prices, but it's good for website builders looking for new scaling strategies. EdgeCast, for example, passes on the cost savings when when their bandwidth costs drop. Other services lock you into fix cost contracts. So competition is good. New cheaper, faster, and easier possibilities for scaling your website are coming on line. Maybe CDNs can help you.
We are currently building a high traffic portal like myspace. What is the qps that we have to keep in mind and develop the site so that it can be scalable as the traffic grows?
Cacti is a network statistics graphing tool designed as a frontend to RRDtool's data storage and graphing functionality. It is intended to be intuitive and easy to use, as well as robust and scalable. It is generally used to graph time-series data like CPU load and bandwidth use. The frontend is written in PHP; it can handle multiple users, each with their own graph sets, so it is sometimes used by web hosting providers (especially dedicated server, virtual private server, and colocation providers) to display bandwidth statistics for their customers. It can be used to configure the data collection itself, allowing certain setups to be monitored without any manual configuration of RRDtool.
This scalability strategy is brought to you by Erik Osterman: My recommendations for anyone dealing with explosive growth on a limited budget with lots of cachable content (e.g. content capable of returning valid expiration headers) is employ a reverse proxy as mentioned in this article. In the last week, we had a site get AP'd, triggering 100K unique visitors to a single IIS server in under 5 hours. It took out the IIS server. Placing a single squid infront of the server handled the entire onslaught with a max server load of 0.10 on a modest Intel IV 3Ghz. It's trivial to implement for anyone interested...
Excellent article on using Hadoop in Amazon's services environment to solve real problems for very little money. It's excellent because it shows how the stack works together and it actually seems like something a real human could do.