My first post, please be gentle. I know it is long. You are all like doctors - the more info, the better the diagnosis. ----------- What is the best way to store a list of all of your friends in the memcached cache (a simple boolean saying “yes this user is your friend”, or “no”)? Think Robert Scoble (26,000+ “friends”) on Twitter.com. He views a list of ALL existing users, and in this list, his friends are highlighted. I came up with 4 possible methods: --store in memcache as an array, a list of all the "yes" friend ID's --store your friend ID's as individual elements. --store as a hash of arrays based on last 3 digits of friend's ID -- so have up to 1000 arrays for you. --comma-delimited string of ID's as one element I'm using the second one because I think it is faster to update. The single array or hash of arrays feels like too much overhead calculating and updating – and even just loading – to check for existence of a friend. The key is FRIEND[small ID#]_[big ID#]. The value is 1. This way there are no dupes. (I add u as friend, it always adds me as ur friend...I remove u, u remove me). Store with it 2 additional flags: One denotes start of entries. One denotes end of entries. As friends are added, the end flag position relative to new friends will become meaningless, but that is ok (I think). To see if someone is your friend, the system checks if both start and end flags exist. If both exist, it can check for existence of friend ID - if exists, then friend. Start flag is required. If start flag is pushed out of cache, we must assume some friends were also pushed out. Currently, the system loads from DB in a daemon in the background after you log in (if two flags are not already set). Until the two flags are set, it does db lookups. There is no timeout on the data in cache. Adding/removing friends to your account adds/removes to/from memcache - so, theoretically, it might never have to pre-load anything. Downside of my method is if the elements span multiple servers and one dies, you loose some of your friends (that's the upside of using arrays). I don't know how to resolve if the lost box didn't contain either of the flags -- in that case, the users' info will NEVER get refreshed. This is my concern. Any ideas? Thanks so much!!!
It's surprising that the blogosphere hasn't picked up the biggest difference in pricing: Google's datastore is less than a tenth of the price of Amazon's SimpleDB while offering a better API.If money matters to you then the burn rate under GAE could be convincingly lower. Let's compare the numbers: GAE pricing: * $0.10 - $0.12 per CPU core-hour * $0.15 - $0.18 per GB-month of storage * $0.11 - $0.13 per GB outgoing bandwidth * $0.09 - $0.11 per GB incoming bandwidth SimpleDB Pricing: * $0.14 per Amazon SimpleDB Machine Hour consumed * Structured Data Storage - $1.50 per GB-month * $0.100 per GB - all data transfer in * $0.170 per GB - first 10 TB / month data transfer out (more on the site) Clearly Google priced their services to be competitive with Amazon. We may see a response by Amazon in the near feature, but the database storage cost for GAE is dramatically cheaper at $0.15 - $0.18 per GB-month vs $1.50 per GB-month. Interestingly, Google's price is the same as Amazon's S3 (file storage) pricing. Google seems to think of database storage as more like file storage. That makes a certain amount of sense because BigTable is a layer on the Google File System. File system pricing may be the more appropriate price reference point. On SimpleDB a 1TB database costs $1,500/month and BigTable costs in the $180/month range. As you grow into ever larger data sets the difference becomes even more compelling. If you are a startup your need for funding just dropped another notch. It's hard to self-finance many thousands of dollars a month, but hundreds of dollars is an easy nut to make. Still, Amazon's advantage is they support application clusters that can access the data for free within AWS. GAE excels at providing a scalable two tier architecture for displaying web pages. Doing anything else with your data has to be done outside GAE, which kicks up your bandwidth costs considerably. How much obviously depends on your application. But if your web site is of the more vanilla variety the cost savings could be game changing.
Justin.tv is looking to hire a Scaling Engineer to help scale their video cluster, IRC server, web app, monitoring and search services. I've never seen this job title before. A quick search that showed only a few previous instances of it being used. Has anyone else seen Scaling Engineer as a job title before? It's a great idea. Scaling is certainly a worthy specialty of it's own. Why there's a difficult lingo, obscure tools, endlessly subtle concepts, a massive body of knowledge to master, and many competing religious factions. All a good start. Next I see a chain of Scalability Universities. Maybe use all those Starbucks that are closing down. Contact me for franchise opportunities :-)
Now you can buy more cores on EC2 without adding more machines:
Hi, I want to implement a search engine with lucene. To be scalable, I would like to execute search jobs asynchronously (with a job queuing system). But i don't know if it is a good design... Why ? Search results can be large ! (eg: 100+ pages with 25 documents per page) With asynchronous sytem, I need to store results for each search job. I can set a short expiration time (~5 min) for each search result, but it's still large. What do you think about it ? Which design would you use for that ? Thanks Mat
The following technical Webinar could be of interest to the community. WHO:
- Farhan "Frank" Mashraqi, Director of Business Operations and Technical Strategy, Fotolog Inc
- Monty Taylor, Senior Consultant, Sun Microsystems
- Jimmy Guerrero, Sr Product Marketing Manager, Sun Microsystems - Database Group
- Designing and Implementing Scalable Applications with Memcached and MySQL web presentation.
- Thursday, May 29, 2008, 10:00 am PST, 1:00 pm EST, 18:00 GMT
- The presentation will be approximately 45 minutes long followed by Q&A.
Hi, We're looking for a highly scalable way of scanning documents being uploaded and downloaded from our web application. I believe services like gmail and hotmail are using bespoke solutions from companies like Trend, but are there some quality "off the shelf" products out there that can easily be scaled out and have a "loose" API (HTTP based) for application integration? Once again, thanks for any input.
Customer: - Name - Country Product: - Code - Name - Description Purchases: - Reference to Product Entity - Reference to Customer Entity - Date of orderAnyone from a relational background would look at this schema and give it a big thumbs up. With a little effort we can imagine the original physical purchase order that has now been normalized into three different tables. To recreate the original purchase order a join on purchases, produce and customer is needed. Read speed is not optimized, safety is optimized. Here’s what the same schema looks like optimized for reading:
Purchase: - Customer Name - Customer Country - Product Code - Product Name - Purchase Order Number - Date Of OrderThe three original tables have been folded into one entity. Now a purchase order can be read in one get operation. No join necessary. Notice how the entity looks more like an original purchase order. It is also what would probably be cached and is what our model would probably look like. But what if you want to update a product name or a customer name? Those attributes are duplicated in all entities. Here’s where the protection offered by the relational model comes in. Only one entity needs updating in a normalized model. In BigTable you have to remember everywhere a customer name and product name and change every instance to new values. It’s not a simple, safe, or reliable approach. But it does optimize for read speed and scalability. For an application with a high proportion of updates to reads this approach wouldn’t make sense. But on the web reads usually dominate. How often do you really change a customer name or a product name? Seldom. How often do you read them? All the time. Designing to scale for reads and taking the pain on writes takes some getting used to. It’s a massive change to standard relational tactics. But this is what it takes to scale web applications, even if it feels a little strange at first.
Update 2: EBay's Randy Shoup spills the secrets of how to service hundreds of millions of users and over two billion page views a day in Scalability Best Practices: Lessons from eBay on InfoQ. The practices: Partition by Function, Split Horizontally, Avoid Distributed Transactions, Decouple Functions Asynchronously, Move Processing To Asynchronous Flows, Virtualize At All Levels, Cache Appropriately. Update: eBay Serves 5 Billion API Calls Each Month. Aren't we seeing more and more traffic driven by mashups composed on top of open APIs? APIs are no longer a bolt on, they are your application. Architecturally that argues for implementing your own application around the same APIs developers and users employ. Who hasn't wondered how eBay does their business? As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost. You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from. Site: http://ebay.com
What's Inside?This information was adapted from Johannes Ernst's Blog
This website has been a great resource for helping me to understand the successful (and failed) scalable network designs from organizations that have actually done it, but I haven't seen any explicite explanations about secure remote administration of these systems. I understand that the *nix people love to SSH, and the windows gang has their RDP, but how does one go about creating a network architecture that both allows one to manage their systems and does its best to avoid hacker interest? As I imagine, no big website will have the SSH/RDP/FTP ports open on the web server, so how is it that they go about remotely administering their geographically diverse groups of servers securely?