SmugMug's CEO & Chief Geek Don MacAskill smugly (hard to resist) gushes over finally finding, after a long and arduous quest, their "best bang-for-the-buck storage array." It's the Dell MD300. His in-depth explanation of why he prefers the MD3000 should help anyone with their own painful storage deliberations. His key points are: The price is right; DAS via SAS, 15 spindles at 15K rpm each, 512MB of mirrored battery-backed write cache; You can disable read caching; You can disable read-ahead prefetching; The stripe sizes are configurable up to 512KB; The controller ignores host-based flush commands by default; They support an ‘Enhanced JBOD’ mode. His reasoning for the desirability each option is astute and he even gives you the configuration options for carrying out the configuration. This is not your average CEO. Don also speculates that a three tier system using flash (system RAM + flash storage + RAID disks) is a possible future direction. Unfortunately, flash may not be the dream solution it has been thought to be. StorageMojo talks about this in Flash vs disk at DISKCON 2007.
My company is developing a centralized web platform to service our clients. We currently use about 3Mb/s on our uplink at our ISP serving web pages for about 100 clients. We'd like to offer them statistics that mean something to their businesses and have been contemplating writing our own statistics code to handle the task. All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP.Net 2.0 to handle the gather of the data. That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. Google Analytics is out of the question because we do not want our hosting infrastructure dependant upon a remote server. Web Trends et al. are too expensive for our clients. I'm thinking of a couple of options. 1) Writing log data directly to a SQL Server 2000 db and having a Windows Service come in periodically to summarize and aggregate the data to the reporting server. I'm not sure this will scale with higher load and that the aggregation process will timeout because of the number of inserts being sent to the table. 2) Write the log data to a structure in memory on the web server and periodically flush the data to the db. The fear here is that the web server goes down and we lose all the data in memory. Other fears are that the IIS processes and worker threads might mangle one another when contending for the memory system resource. 3) Don't use memory and write to a file instead. Save the file handler as an application variable and use it for all accesses to the file. Not sure about threading issues here as well and am reluctant to use anything which might corrupt a log file under load. 4) Add comment data to the IIS logs. This theoretically should remove the threading issues but leaves me to think that the data would not be terribly useful once its in the IIS logs. The major driver here is that we do not want to use any of the web sites and canned reports built into 90% of all statistics platforms. Our users shouldn't have to "leave" the customer care portal we're creating just to see stats for their sites. IFrames are not an option. I'm looking for a solution that's not entirely complex, nor is it overly expensive and it will give me the access to the data we need to record on page views. It has to scale with volume. Thoughts are appreciated. Derek
There's a new clustered file system on the spindle: Kosmos File System (KFS). Thanks to Rich Skrenta for turning me on to KFS and I think his blog post says it all. KFS is an open source project written in C++ by search startup Kosmix. The team members have a good pedigree so there's a better than average chance this software will be worth considering. After you stop trying to turn KFS into "Kentucky Fried File System" in your mind, take a look at KFS' intriguing feature set:
Sequoia is a transparent middleware solution offering clustering, load balancing and failover services for any database. Sequoia is the continuation of the C-JDBC project. The database is distributed and replicated among several nodes and Sequoia balances the queries among these nodes. Sequoia handles node and network failures with transparent failover. It also provides support for hot recovery, online maintenance operations and online upgrades.
Features in a nutshell
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.
If you have a lot of static content to store and you aren't looking forward to setting up and maintaining your own giganto SAN, maybe you can push off a lot of the hard lifting to a CDN? Jesse Robbins at O'Reilly Radar posts that you have a lot more options now because the number of Content Distribution Networks have doubled since last year. In fact, Dan Rayburn says there are now 28 CDN providers in the market. Hopefully you can find reasonable pricing at one of them. Other than easing your burden, why might a CDN work for you? Because it makes your site faster and customers like that. How can a CDN so dramatically improve your site's performance? Steve Saunders, author of High Performance Web Sites: Essential Knowledge for Front-End Engineers, has using a CDN has one of his "Thirteen Simple Rules for Speeding Up Your Web Site." About CDNs Steve says:
Remember that 80-90% of the end-user response time is spent downloading all the components in the page: images, stylesheets, scripts, Flash, etc. This is the Performance Golden Rule, as explained in The Importance of Front-End Performance. Rather than starting with the difficult task of redesigning your application architecture, it's better to first disperse your static content. This not only achieves a bigger reduction in response times, but it's easier thanks to content delivery networks. ... At Yahoo!, properties that moved static content off their application web servers to a CDN improved end-user response times by 20% or more. Switching to a CDN is a relatively easy code change that will dramatically improve the speed of your web site.It's at least worth looking into if looking for a performance boost or are concerned about storing so many buckets of bits.
Hi, Can someone teach me how you implement network switch fail over since we are paranoid for single point of failure. For example, you have: a dozen web servers -> switch -> DB cluster that switch is a SPOF. How does one implement dual switch in a fail over fashion?
Hi, Every application server has its own session management implementations for supporting high scalability. But an application architect/developer has to design and implement the application to make the best use of it. What are the guiding principles and pattern for session state management? Websphere System management red book mentions that "Session management performance is optimum when session data per user is around 2Kb. It degrades if session data is more than that". I have following questions. 1. How do you measure session data per user? 2. It is generally recommended that you should keep all the session state in database and keep only the keys in HttpSession object. Then everytime a web request is processed, session data is fetched from the database. This way all the data remains in memory only till the request is processed and actual data in HttpSession is very less. (Only few keys). What is the general practice? At what point you should be switching from keeping data in HttpSession to database? Are websites like Amazon or eBay follow this? 3. Is there any open source framework which helps you do session management in a way mentioned in point no. 2? Thanks, Unmesh Thanks, Unmesh
This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included. Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles Site: http://amazon.com
I have a few apache servers ( arround 11 atm ) serving a small amount of data ( arround 44 gigs right now ). For some time I have been using rsync to keep all the content equal on all servers, but the amount of data has been growing, and rsync takes a few too much time to "compare" all data from source to destination, and create a lot of I/O. I have been taking a look at MogileFS, it seems a good and reliable option, but as the fuse module is not finished, we should have to rewrite all our apps, and its not an option atm. Any ideas? I just want a "real time, non resource-hungry" solution alternative for rsync. If I get more features on the way, then they are welcome :) Why I prefer to use a Distributed File System instead of using NAS + NFS? - I need 2 NAS, if I dont want a point of failure, and NAS hard is expensive. - Non-shared hardware, all server has their own local disks. - As files are replicated, I can save a lot of money, RAID is not a MUST. Thnx in advance for your help and sorry for my english :)