The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.
Art of Distributed
Part 1: Rethinking about distributed computing modelsI ‘m getting a lot of questions lately about the distributed computing, especially distributed computing model, and MapReduce, such as: What is MapReduce? Can MapReduce fit in all situations? How we can compares it with other technologies such as Grid Computing? And what is the best solution to our situation? So I decide to write about the distributed computing article in two parts. First one about the distributed computing model and what is the difference between them. In the second part I will discuss the reliability, and distributed storage systems. Download the article in PDF format. Download the article in MS Word format. I wait for your comments, and questions, and I will answer it in part two.
We are planning to be the first company to do a one million user load test and are looking for someone willing to be the first to have been subjected to such a test! Is YOUR site scalable enough? How do you KNOW? http://capcalblog.blogspot.com. Randy Hayes CapCal
The abstract for the talk given by Bob Ippolito, co-founder and CTO of Mochi Media, Inc:
Building large systems on top of a traditional single-master RDBMS data storage layer is no longer good enough. This talk explores the landscape of new technologies available today to augment your data layer to improve performance and reliability. Is your application a good fit for caches, bloom filters, bitmap indexes, column stores, distributed key/value stores, or document databases? Learn how they work (in theory and practice) and decide for yourself.Bob does an excellent job highlighting different products and the key concepts to understand when pondering the wide variety of new database offerings. It's unlikely you'll be able to say oh, this is the database for me after watching the presentation, but you will be much better informed on your options. And I imagine slightly confused as to what to do :-) An interesting observation in the talk is that the more robust products are internal to large companies like Amazon and Google or are commercial. A lot of the open source products aren't yet considered ready for prime-time and Bob encourages developers to join a project and make patches rather than start yet another half finished key-value store clone. From my monitoring of the interwebs this does seem to be happening and existing products are starting to mature. From all the choices discussed the column database Vertica seems closest to Bob's heart and it's the product they use. It supports clustering, column storage, compression, bitmapped indexes, bloom filters, grids, and lots of other useful features. And most importantly: it works, which is always a plus :-) Here's a summary of some of the points talked about in the presentation:
Distributed Database Musings
Simple Key-Value Store
Memcached Key-Value Stores
Tokyo Cabinet/Tyrant Key-Value Store
Redis Data Structure Store
CouchDB Document Database
MongoDB Document Database
MonetDB Column Database
LucidDB Column Database
Vertica Column Database
The GigaOM Network today announces its second Structure conference after the runaway success of the 2008 event. The Structure 09 conference returns to San Francisco, Calif., on June 25th, 2009. Structure 09 (http://structureconf.com) is a conference designed to explore the next generations of Internet infrastructure. Over a year ago, The GigaOM Network Founder Om Malik saw that the platforms on which we have done business for over a decade were starting to provide diminishing returns, and smart money was seeking new options. Structure 09 looks at the changing needs and rapid growth in the Internet infrastructure sector, and this year's event will consider the impact of the global economy. "I cannot remember a time when a new technology had so much relevance to our industry as cloud computing does in the current economic climate," said The GigaOM Network Founder Om Malik. "We all need to find ways to leverage what we have and cut costs without compromising future options. Infrastructure On Demand and Cloud Computing are very strong avenues for doing so and we will look for what practicable advice we can bring to our audience." "Structure 08 was a great experience for our audience and partners, and I am very pleased to be bringing it back again this year," said Malik. "Along with GigaOM Lead Writer Stacey Higginbotham and the conference program committee, I am bringing together what I intend to be one of the most authoritative programs for the cloud computing and Internet infrastructure space." The GigaOM Network is also announcing early speaker selections. Confirmed speakers include: Marc Benioff - Chairman and CEO, Salesforce.com Paul Sagan, President and CEO, Akamai Werner Vogels, CTO, Amazon Russ Daniels, VP and CTO, Cloud Services Strategy, HP Raj Patel, VP of Global Networks, Yahoo! Jonathan Heiliger, VP, Technical Operations, Facebook Greg Papadopoulos, CTO and EVP - Research and Development, Sun Microsystems Jack Waters, President, Global Network Services and CTO, Level 3 Communications Michael Stonebraker, PhD, CTO and Co-Founder, Vertica Systems David Yen, EVP and GM, Data Center Business Group, Juniper Networks Vijay Gill, VP Engineering, Google Yousef Khalidi, Distinguished Engineer, Microsoft Corporation Tobias Ford, Assistant VP, IT, AT&T Richard Buckingham, VP of Technical Operations, MySpace Lew Tucker, VP and CTO, Cloud Computing, Sun Microsystems Lloyd Taylor, VP Technical Operations, LinkedIn Michael Crandell, CEO and Founder, RightScale Jim Smith, General Partner, MDV-Mohr Davidow Ventures Bryan Doerr, CTO, Savvis Doug Judd, Principal Search Architect, Zvents Brandon Watson, Director, Azure Services Platform, Microsoft Jeff Hammerbacher, Chief Scientist, Cloudera Jason Hoffman, PhD, CTO, Joyent Mayank Bawa, CEO, Aster Data James Urquhart, Market Manager, Cloud Computing and Infrastructure, Cisco Systems Kevin Efrusy, General Partner, Accel Lew Moorman, CEO and Founder, Rackspace Joe Weinman, Strategy and Business Development VP, AT&T Business Solutions Peter Fenton, General Partner, Benchmark Capital David Hitz, Founder and Executive Vice President, NetApp James Lindenbaum, Co-Founder and CEO, Heroku Joseph Tobolski, Director of Cloud Computing, Accenture Steve Herrod, CTO and Sr. VP of R&D, VMware Further Details can be found at the Structure 09 Website http://events.gigaom.com/structure/09/ High Scalability readers can register with a $50 discount at http://structure09.eventbrite.com/?discount=HIGHSCALE
Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times.
FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not.It's not all just map-reduce and add more servers until your attic is full.
The Presentations of the MySQL Conference & Expo 2009 held April 20-23 in Santa Clara is available on the above link.
- Beginner's Guide to Website Performance with MySQL and memcached by Adam Donnison
- Calpont: Open Source Columnar Storage Engine for Scalable MySQL DW by Jim Tommaney
- Creating Quick and Powerful Web Applications with MySQL, GlassFish, and NetBeans by Arun Gupta
- Deep-inspecting MySQL with DTrace by Domas Mituzas
- Distributed Innodb Caching with memcached by Matthew Yonkovit and Yves Trudeau
- Improving Performance by Running MySQL Multiple Times by MC Brown
- Introduction to Using DTrace with MySQL by Vince Carbone
- MySQL Cluster 7.0 - New Features by Johan Andersson
- Optimizing MySQL Performance with ZFS by Allan Packer
- SAN Performance on a Internal Disk Budget: The Coming Solid State Disk Revolution by Matthew Yonkovit
- This is Not a Web App: The Evolution of a MySQL Deployment at Google by Mark Callaghan
There are a lot of questions about the server components, and how to choice and/or build perfect server with consider the power consumption. So I decide to write about this topic.
- What kind of components the servers needs
- The Green Computing and the Servers components.
- How much power the server consume.
- Choice the right components: Processors, HDD, RAID, Memory
- Build Server, or buy?
Hello highscalability world. I just discovered this site yesterday in a search for a scalability resource and was very pleased to find such useful information. I have some questions regarding distributed caching that I was hoping the scalability intelligentsia trafficking this forum could answer. I apologize for my lack of technical knowledge; I'm hoping this site will increase said knowledge! Feel free to answer all or as much as you want. Thank you in advance for your responses and thank you for a great resource! 1.) What are the standard benchmarks used to measure the performance of memcached or mySQL/memcached working together (from web 2.0 companies etc)? 2.) The little research I've conducted on this site suggests that most web 2.0 companies use a combination of mySQL and a hacked memcached (and potentially sharding). Does anyone know if any of these companies use an enterprise vendor for their distributed caching layer? (At this point in time I've only heard of Jive software using Coherence). 3.) In terms of a web 2.0 oriented startup, what are the database/distributed caching requirements typically needed to get off the ground and grow at a fairly rapid pace? 4.) Given the major players in the web 2.0 industry (facebook, twitter, myspace, PoF, Flickr etc, I'm ignoring google/amazon here because they have a proprietary caching layer) what is the most common, scalable back-end setup (mySQL/memcached/sharding etc)? What are its limitations/problems? What features does said setup lack that it really needs? Thank you so much for your insight!
We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up.
In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time.
Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.
Read more about this study here (PDF - you can download also)