(Please bare with me, I'm a new, passionate, confident and terrified programmer :D ) Background: I'm pre-launch and 1 year into the development of my application. My target is to be able to eventually handle millions of registered users with 5-10% of them concurrent. Up to this point I've used auto-increment to assign unique identifiers to rows. I am now considering switching to a non-sequential strategy. Oh, I'm using the LAMP configuration. My reasons for avoiding auto-increment: 1. Complicates replication when scaling horizontally. Risk of collision is significant (when running multiple masters). Note: I've read the other entries in this forum that relate to ID generation and there have been some great suggestions -- including a strategy that uses auto-increment in a way that avoids this pitfall... That said, I'm still nervous about it. 2. Potential bottleneck when retrieving/assigning IDs -- IDs assigned at the database. My reasons for being nervous about non-sequential IDs: 1. To guarantee uniqueness, the IDs are going to be much larger -- potentially affecting performance significantly My New Strategy: (I haven't started to implement this... I'm waiting for someone smarter than me to steer me in the right direction) 1. Generate a guaranteed-unique ID by concatenating the user id (1-9 digits) and the UNIX timestamp(10 digits). 2. Convert the resulting 11-19 digit number to base_36. The resulting string will be alphanumeric and 6-10 characters long. This is, of course, much shorter (at least with regard to characters) then the standard GUID hash. 3. Pass the new identifier to a column in the database that is type CHAR() set to binary. My Questions: 1. Is this a valid strategy? Is my logic sound or flawed? Should I go back to being a graphic designer? 2. What is the potential hit to performance? 3. Is a 11-19 digit number (base 10) actually any larger (in terms of bytes) than its base-36 equivalent? I appreciate your insights... and High Scalability for supplying this resource!
I am looking for a way to distribute files over servers in different physical locations. My main concern is that I have bandwidth limitations on each location, and wish to spread the bandwidth load evenly. Atm. I just have 1:1 copies of the files on all servers, and have the application pick a random server to serve the file as a temp fix... It's a small video streaming service. I want to spoonfeed the stream to the client with a max bandwidth output, and support seek. At present I use php to limit the network stream, and read the file at a given offset sendt as a get parameter from the player for seek. It's psuedo streaming, but it works. I have been looking at MogileFS, which would solve the storage part. With MogileFS I can make use of my current php solution as it supports lighttpd and apache (with mod_rewrite or similar). However I don't see how I can apply MogileFS to check for bandwidth % usage? Any reccomendations for how I can solve this?
At the core of the new real-time web, which is really really old, are continuous queries. I like how this paper proposed to handle dynamic demand and dynamic resource availability by making the underlying system adaptable, which seems like a very cloudy kind of thing to do. Abstract:
The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators – e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Longrunning CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producerconsumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case
A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns.
Table below shows out-of-the-box (no custom OS rewrites or networking tuning required) performance with 10G networking hardware and one single-socket UltraSPARC T2-based server with 8 cores and 8 threads per core (64 threads on a chip)...
Object Size / Ops/Sec / Bandwidth
100 bytes / 530,000 / 1.2 Gb/s
2048 bytes / 370,000 / 6.9 Gb/s
4096 bytes / 255,000 / 9.2 Gb/s
Check out the link for more details!
Film buffs will recognize Django as a classic 1966 spaghetti western that spawned hundreds of imitators. Web heads will certainly first think of Django as the classic Python based Web framework that has also spawned hundreds of imitators and has become the gold standard framework for the web. Mike Malone, who worked on Pownce, a blogging tool now owned by Six Apart, tells in this very informative EuroDjangoCon presentation how Pownce scaled using Django in the real world. I was surprised to learn how large Pounce was: hundreds of requests/sec, thousands of DB operations/sec, millions of user relationships, millions of notes, and terabytes of static data. Django has a lot of functionality in the box to help you scale, but if you want to scale large it turns out Django has some limitations and Mike tells you what these are and also provides some code to get around them. Mike's talk-although Django specific--will really help anyone creating applications on the web. There's a lot of useful Django specific advice and a lot of general good design ideas as well. The topics covered in the talk are:
Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig. Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3: Scaling Hadoop to 4000 nodes at Yahoo!. 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2: Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides. Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity in Data Systems at Scale, Handling Large Datasets at Google: Current Systems and Future Directions, Mining the Web Graph. and Sherpa: Hosted Data Serving. Update: Kevin Burton points out Hadoop now has a blog and an introductory video staring Beyonce. Well, the Beyonce part isn't quite true. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. It replicates much of Google's stack, but it's for the rest of us. Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders: For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy. The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy It's hard work. And it needs to be commoditized, just like the hardware has been... Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. The obvious question of the day is: should you build your website around Hadoop? I have no idea. There seems to be a few types of things you do with lots of data: process, transform, and serve. Yahoo literally has petabytes of log files, web pages, and other data they process. Process means to calculate on. That is: figure out affinity, categorization, popularity, click throughs, trends, search terms, and so on. Hadoop makes great sense for them for the same reasons it does Google. But does it make sense for your website? If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? Maybe not, but the clustered file system is great. You get high bandwidth with the ability to transparently extend storage resources. Perfect for when you have lots of stuff to store. YouTube would seem like it could use a distributed job mechanism, like you can build with Amazon's services. With that you could create thumbnails, previews, transcode media files, and so on. When they have Hbase up and running that could really spike adoption. Everyone needs to store structured data in a scalable, reliable, highly performing data store. That's an exciting prospect for me. I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. I wonder what animal O'Reilly will use on their Hadoop cover?
Making the world's knowledge computableToday's Wolfram|Alpha is the first step in an ambitious, long-term project to make all systematic knowledge immediately computable by anyone. You enter your question or calculation, and Wolfram|Alpha uses its built-in algorithms and growing collection of data to compute the answer.
Answer Engine vs Search EngineWhen Wolfram|Alpha launches later today, it will be one of the most computationally intensive websites on the internet. The Wolfram|Alpha computational knowledge engine is an "answer engine" that is able to produce answers to various questions such as
- What is the GDP of France?
- Weather is Springfield when David Ortiz was born
- 33 g of gold
- LDL vs. serum potassium 150 smoker male age 40
- life expectancy male age 40 finland
- highschool teacher median wage
- Abour 10,000 CPU cores at launch
- 10+ trillion of pieces of data
- 50,000+ types of algorithms
- Able to handle about 175 million queries per day
- 5+ million lines of symbolic Mathematica code
The Computers Powering Computable KnowledgeThere is no way to know exactly how much traffic to expect, especially during the initial period immediately following the launch, but the Wolfram|Alpha team is working hard to put reasonable capacity in place. As Stephen writes in the Wolfram|Alpha blog Alpha will run in 5 distributed colocation facilities. What computing power have they gathered in these facilities for launch day? Two supercomputers, just about 10,000 processor cores, hundreds of terabytes of disks, a heck of a lot of bandwidth, and what seems like enough air conditioning for the Sahara to host a ski resort. One of their launch partners, R Systems, created the world’s 44th largest supercomputer (per the June 2008 TOP500 list - it is listed as 66th per the latest Top500 list). They call it the R Smarr. It will be running Wolfram|Alpha on launch day! R Smarr has a Sum Rmax of 39580 GFlops using Dell DCS CS23-SH, QC HT 2.8 GHz computers, 4608 cores, 65536 GB of RAM and Infiniband interconnect. Dell is another of the launch partners with a data center full of quad-board, dual-processor, quad-core Harpertown servers. What does it all add up to? The ability to handle 175 million queries (yielding maybe a billion) per day—over 5 billion queries (encompassing around 30 billion calculations) per month.
The Launch of Wolfram|AlphaWatch a live webcast of the Wolfram|Alpha system being brought online for the first time on
- Friday, May 15, beginning at 7pm CST
The First Killer App of The New Kind of ScienceThe Genius behind Wolfram|Alpha is Stephen Wolfram. He is best know for his ambitious projects: Mathematica and A New Kind of Science (NKS). May 14, 2009 marks the 7th anniversary of the publication of his book A New Kind of Science. Stephen explains is his blog post: But for me the biggest thing that’s happened this year is the emergence of Wolfram|Alpha. Wolfram|Alpha is, I believe, going to be the first killer app of NKS.
StatusThat it should be possible to build Wolfram|Alpha as it exists today in the first decade of the 21st century was far from obvious. And yet there is much more to come. As of now, Wolfram|Alpha contains 10+ trillion of pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains. Built with Mathematica—which is itself the result of more than 20 years of development at Wolfram Research—Wolfram|Alpha's core code base now exceeds 5 million lines of symbolic Mathematica code. Running on supercomputer-class compute clusters, Wolfram|Alpha makes extensive use of the latest generation of web and parallel computing technologies, including webMathematica and gridMathematica.
How Mathematica Made Wolfram|Alpha Possible?Wolfram|Alpha is a major software engineering development to make all systematic knowledge immediately computable by anyone. It is developed and deployed entirely with Mathematica—in fact, Mathematica has uniquely made Wolfram|Alpha possible. Here's why.
- Computational knowledge and intelligence
- High-performance enterprise deployment
- One coherent architecture
- Smart method selection
- Dynamic report generation
- Database connectivity
- Built-in, computable data
- High-level programming language
- Efficient text processing and linguistic analysis
- Wide-ranging, automated visualization capabilities
- Automated importing
- Development environment
- About Wolfram|Alpha
- Wolfram|Alpha Blog
- Introduction to Wolfram|Alpha screencast
- A New Kind of Science
- Wolfram Research - Mathematica
- How Mathematica Made Wolfram|Alpha Possible?
- The Secret behind the Computational Engine in Wolfram|Alpha
- More info expected after the launch... stay tuned!
An interesting post on DataCenterKnowledge!
- 1&1 Internet: 55,000 servers
- Rackspace: 50,038 servers
- The Planet: 48,500 servers
- Akamai Technologies: 48,000 servers
- OVH: 40,000 servers
- SBC Communications: 29,193 servers
- Verizon: 25,788 servers
- Time Warner Cable: 24,817 servers
- SoftLayer: 21,000 servers
- AT&T: 20,268 servers
- iWeb: 10,000 servers
- How about Google, Microsoft, Amazon, eBay, Yahoo, GoDaddy, Facebook? Check out the post on DataCenterKnowledge and of course here on highscalability.com!
Is there any type of server technology that allows visitors to a website to become part of the server? Like with bittorrent, users share some of their bandwidth, so would this be possible with web servers where a person goes to a website, downloads and runs the software which makes their internet connection and cpu and hdd become part of the web server?
GemFire Enterprise is in-memory distributed data management platform that pools memory (and CPU, network and optionally local disk) across multiple processes to manage application objects and behavior. With the 6.0 release, GemFire has reached a stage of maturity in its evolution. GemStone touts this version as the true 'best of breed' distributed caching technology, solving scalability issues in all industries.