Hi, I'm interested in peoples thoughts on the best choice for a database clustering solution. I have a database that is mostly varchars and numbers that doesn't store any binary data at all. It's used at about 70% read and 30% writes - though we're using memcached at the moment so it's not really hit that hard. We're currently using mysql with m/cluster, but are interested in a new solution. Possible candidate so far are unicluster (which doesn't seem mature yet.) or DRBD. Had anyone had a similar experience and can make any suggestions? Thanks
This presentation by Michael Radwin describes why Yahoo! had standardized on PHP going forward. It describes how after reviewing all the web technologies including their own internal ones, PHP was choosen. It shows that not only technical reasons , but also business and development processes were taken into account.
In March, 2000, I did a talk about how we scaled with semi-static files while splitting data from presentation. For dynamic pages we used mod_perl doing an internal redirect with the XML on the style templates. Since then Apache 2.0 contains the concept of filters to allow for similar functionality.
Colm MacCarthaigh, Network Architect at Joost, gave this presentation at the UK Network Operators' Forum Meeting in Manchester on April 3rd, 2007.
Perdition is a fully featured POP3 and IMAP4 proxy server. It is able to handle both SSL and non-SSL connections and redirect users to a real-server based on a database lookup. Perdition supports modular based database access. ODBC, MySQL, PostgreSQL, GDBM, POSIX Regular Expression and NIS modules ship with the distribution. The API for modules is open allowing arbitrary modules to be written to allow access to any data store. Perdition has many uses. Including, creating large mail systems where an end-user's mailbox may be stored on one of several hosts, integrating different mail systems together, migrating between different email infrastructures, and bridging plain-text, SSL and TLS services. It can also be used as part of a firewall. The use of perditon to scale mail services beyond a single box is discussed in high capacity email.
Another scalability strategy brought to you by Erik Osterman: Just thought I'd drop a brief suggestion to anyone building a large mail system. Our solution for scaling mail pickup was to develop a sharded architecture whereby accounts are spread across a cluster of servers, each with imap/pop3 capability. Then we use a cluster of reverse proxies (Perdition) speaking to the backend imap/pop3 servers . The benefit of this approach is you can use simply use round-robin or HA load balancing on the perdition servers that end users connect to (e.g. admins can easily move accounts around on the backend storage servers without affecting end users). Perdition manages routing users to the appropriate backend servers and has MySQL support. What we also liked about this approach was that it had no dependency on a distributed or networked file system, so less chance of corruption or data consistency issues. When an individual server reaches capacity, we just off load users to a less used server. If any server goes offline, it only affects the fraction of users assigned to that server.
From a reader:
> Was reading through your very interesting/useful site. >Most of the architectures are non j2ee-Does that mean that >there aren't enough websites that are scalable(with youtube > like userbase) built with j2ee tech-would like to know if there > are any and their architecture as >well.eBay uses Java, but in a very pragmatic way. They use servlets, an application server, the JDK, and they do the rest themselves. They skip JSP, entity beans, and JMS. When you need to scale putting all your eggs in one basket is a risky strategy. Why use JSP when you can do better? When use entity beans when you can do better? Use servlets because they are a very effective way of handling http requests. Use Java because it is fast, runs everywhere, and has a boat load of libraries you can use to build your build your custom system. Probably the major reason J2EE is absentee is simply LAMP. LAMP is just so incredibly functional for most 2-tier shared nothing sites they don't need a better infrastructure for writing an application tier. Personally, I pretty excited about GWT which uses Java and servlets. We'll see if that starts to take off a little bit more.
Hi, I am interested in some experienced advice for choosing switches for a colocated 2-tier architecture. I have the hardware chosen for the webservers, app servers, and db servers, but need some advice on the network switch in between: colocation port -> firewall(load balancer) -> 2+ web servers (app servers) -> gigabit switch -> DB server(possibly cluster for future expansion) the question is that I am just starting out, i wonder which rackmount gigabit switch to select for the private LAN between the app server -> DB servers. Do I need managed for that? Cisco switches are the best, but they are the most expensive...I am looking at possibly using Dell/Netgear gigabit switches. Thanks for any input
Amazon's EC2 sounds good, but how do you make use of all that throbbing CPU power? A few companies are stepping up to fill the how-to gap. Elastra provides unlimited on-demand creation of MySQL and PostgresSQL instances for $.50/server/hour. They contend their clusters perform "nearly" as well as a local database deployed using local storage. RightScale says they "enable you to run your entire web business on Amazon Web Services with reliability, scalability and performance – and pushbutton control of complex system administration tasks." This includes web servers, DNS, and MySQL services. Prices start at $500 a month. Later I'll write more about these and other related services like 3tera, but these services are the canary in the coal mine, the face of change, the bellwether of the new data center. How we build scalable web sites is about to change.
This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done. To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now. So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server, caching server, file server, and a few networks switches and routers along the way. Where did the fault happen? What went wrong? All you have at this point are your logs. You can't turn on more logging because the Heisenberg already happened. You can't stop the system because your system must always be up. You can't deploy a new build with more logging because that build has not been tested and you have no idea when the problem will happen again anyway. Attaching a debugger to a process, while heroic sounding, doesn't help at all. What you need to be able to do is trace though all relevant logs, pull together a time line of all relevant operations, and see what happened. And this is where trace/info etc is useless. You don't need function/method traces. You need a log of all the interesting things that happened in the system. Knowing "func1" was called is of no help. You need to know all the parameters that were passed to the function. You need to know the return value from the function. Along with anything else interesting it did. So there are really no logging levels. You need to log everything that will help you diagnose any future problem. What you really need is a time machine, but you don't have one. But if you log enough state you can mimic a time machine. This is what will allow you to follow a request from start to finish and see if what you expect to be happening is actually happening. Did an interface drop a packet? Did a reply timeout? Is a mutex on perma-lock? So many things can go wrong. Over time systems usually evolve to the point of logging everything. They start with little or no logging. Then problem by problem they add more and more logging. But the problem is the logging isn't systematic or well thought out, which leads to poor coverage and poor performance. Logs are where you find anomalies. An anomaly is something unexpected, like operations happening that you didn't expect, in a different order than expected, or taking longer than expected. Anomalies have always driven science forward. Finding and fixing them will help make your system better too. They expose flaws you might not otherwise see. They broaden you understanding of how your system really responds to the world. So step back and take a look at what you need to debug problems in the field. Don't be afraid to add what you need to see how your system actually works. For example, every request needs to have assigned to it a globally unique sequence number that is passed with every operation related to the request so all work for a request can be tied together. This will allow you to trace the comment add from the client all the way through the system. Usually when looking at log data you have no idea what work maps to which request. Once you know that debugging becomes a lot easier. Every hop a request takes should log meta information about how long the request took to process, how big the request was, what the status of the request was. This will help you pinpoint latency issues and any outliers that happen with big messages. If you do this correctly you can simulate the running of system completely from this log data. I am not being completely honest when I say there are no debugging levels. There are two levels: system and developer. System is logging everything you need to log to debug the system. It is never turned off. There is no need. System logging is always on. Developers can add more detailed log levels for their code that can be turned on and off on a module by module basis. For example, if you have a routing algorithm you may only want to see the detailed logging for that on occasion. The trick is there are no generic info type debug levels. You create a named module in your software with a debug level for tracing the routing algorithm. You can turn that on when you want and only that feature is impacted. I usually have a configuration file with initial debug levels. But then I make each process have a command port hosting a simple embedded web server and telnet processor so you can change debug levels and other setting on the fly through the web or telnet interface. This is pretty handy in the field and during development. I can hear many of you saying this is too inefficient. We could never log all that data! That's crazy! No true. I've worked on very sensitive high performance real-time embedded systems where every nanosecond was dear and they still had very high levels of logging, even in driver land. It's in how you do it. You would be right if you logged everything within the same thread directly to disk. Then you are toast. It won't ever work. So don't do that. There are lots of tricks you can use to make logging fast enough that you can do it all the time: