Today, most banks have migrated their internal software development from C/C++ to the Java language because of well-known advantages in development productivity (Java Platform), robustness & reliability (Garbage Collector) and platform independence (Java Bytecode). They may even have gotten better throughput performance through the use of standard architectures and application servers (Java Enterprise Edition). Among the few banking applications that have not been able to benefit yet from the Java revolution, you find the latency-critical applications connected to the trading floor. Why? Because of the unpredictable pauses introduced by the garbage collector which result in significant jitter (variance of execution time). In this post Frederic Pariente Engineering Manager at Sun Microsystems posted a summary of a case study on how the use of Sun Real Time JVM and GigaSpaces was used in the context of of a customer proof-of-concept this summer to ensure guaranteed latency per message under 10 msec, with no code modification to the matching engine.
hi, What are the practices followed, tools used to measure session memory requirement per user? Thanks, Unmesh
Every day brings news of either more failures of the financial systems or out-right fraud, with the $50 billion Bernard Madoff Ponzi scheme being the latest, breaking all records. This post provide a technical overview of a solution that was implemented for one of the largest banks in China. The solution illustrate how one can use Excel as a front end client and at the same time leverage cloud computing model and mapreduce as well as other patterns to scale-out risk calculations. I'm hoping that this type of approach will reduce the chances for seeing this type of fraud from happening in the future.
Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. In addition to storing, Ringo should be able to retrieve individual or small sets of data items with low latencies (<10ms) and provide a convenient on-disk format for bulk data access. Ringo is compatible with the map-reduce framework Disco and it was started at Nokia Research Center Palo Alto.
This article is a primer, intended to shine some much needed light on the logical, process oriented implementations of database scalability strategies in the form of a broad introduction. More specifically, the intent is to elaborate on the majority of these implementations by example.
The SHOP.COM Cache System is now available at http://code.google.com/p/sccache/ The SHOP.COM Cache System is an object cache system that... * is an in-process cache and external, shared Cache * is horizontally scalable * stores cached objects to disk * supports associative keys * is non-transactional * can have any size key and any size data * does auto-GC based on TTL * is container and platform neutral It was built in-house at SHOP.COM (by me) and has powered our website for years. We are open-sourcing it in the hope that it will be useful to others and to get some help in its maintenance. This is our first open source attempt and we'd appreciate any help and comments.
I thought with the job situation these days that people might be interested in some open jobs at Facebook. Here's what's available:
Facebook is hiring! We are looking for a Systems Engineer/Architect and Site Reliability Engineer. I have attached the job descriptions below. If you are interested, please contact Michelle Bostock mbostock-at-facebook.com. Thanks and Happy Holidays! Systems Architect Palo Alto, CA Description Facebook is seeking a seasoned Systems Architect to join the Operations team. The position is full-time and is based in our main office in downtown Palo Alto and will report to the Manager of Systems Operations. Responsibilities * Analyze application flow and infrastructure design to improve performance and scalability of the site * Collaborate on design of services infrastructure from servers to networking * Monitor, analyze, and make recommendations as appropriate to improve site stability and availability * Evaluate hardware and software technologies to improve site efficiency and performance * Troubleshoot and solve issues with hardware, applications, and network components * Lead team efforts from design to implementation, prioritize tasks and resources while interacting with Engineering and Operations * Document current and future configuration processes and policies * Participate in 24x7 on-call support Requirements * B.S. in Computer Science or equivalent experience * 4+ years of experience in Operations with large web farms * Extensive knowledge of web architecture and technologies, including Linux, Apache, MySQL, PHP, TCP/IP, security, HTTP, LDAP and MTAs * Strong background/interest in application and infrastructure design * Scripting and programming skills * Excellent verbal and written communication skills
Site Reliability Engineer Palo Alto, CA Description Facebook is seeking talented operations engineers to join the Site Reliability Engineering team. The ideal candidate will have strong communication skills, a passion for tinkering with Linux, and an almost insane fondness for fast-paced, seat-of-your-pants troubleshooting and crisis management. The position is full-time and is based in our main office in downtown Palo Alto. This position reports to the Manager of Site Reliability Engineering. Responsibilities * Monitor the stability and performance of the website * Remotely troubleshoot and diagnose hardware problems * Debug issues with Linux software, applications and network * Resolve technical challenges encountered in LAMP technologies * Develop and maintain monitoring tools and automation systems * Predict and respond to utilization variances across multiple datacenters * Identify and triage all outage related events * Facilitate communication, coordinate escalation, and work with subject matter experts to implement critical fixes * Automate and streamline processes * Track issues and run reports Requirements * 2-3 years+ Linux support/sys admin experience in an Internet operations environment * BA/BS in Computer Science or a related field, or equivalent experience * Working knowledge of Linux, Cisco, TCP/IP, Apache and mySQL * Experience working with network management systems and monitoring tools, such as Nagios, Ganglia and Cacti * Competency in Shell, PHP, Perl or Python. C is a plus * Solid understanding of web services architecture and commonly employed technologies * A sense of urgency in responding to and resolving critical issues that relate to the performance of the site and/or core infrastructure * Excellent verbal and written communication skills * Participation in a shifted coverage schedule, including working nights and on-call rotations
How to scale MySQL on a 32 core system with 256 threads? Diagonal scalability in a box. An impressive benchmark that achieved more than 79,000 SQL queries per second on a single 4 RU server! Is this real? If so what is the role of good old horizontal scalability? The goals of the benchmark:
- Reach a high throughput of SQL queries on a 256-way Sun SPARC Enterprise T5440
- Do it 21st century style i.e. with MySQL and ZFS , not 20th century style i.e with OraSybInf... and VxFS
- Do it with minimal tuning i.e as close as possible as out-of-the-box
Our latest strategy is taken from a great post by Paul Saab of Facebook, detailing how with changes Facebook has made to memcached they have:
...been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.
To scale Facebook has hundreds of thousands of TCP connections open to their memcached processes. First, this is still amazing. It's not so long ago you could have never done this. Optimizing connection use was always a priority because the OS simply couldn't handle large numbers of connections or large numbers of threads or large numbers of CPUs. To get to this point is a big accomplishment. Still, at that scale there are problems that are often solved.
Some of the problem Facebook faced and fixed:
When you read Paul's article keep in mind all the incredible number of man hours that went into profiling the system, not just their application, but the entire software hardware stack. Then add in the research, planning, and trying different solutions to see if anything changed for the better. It's a lot of work. Notice using a nifty new parallel language or moving to a cloud wouldn't have made a bit difference. It's complete mastery of their system that made the difference.
A summary of potential strategies:
You can find their changes on github, the hub that says "git."
This is an interesting and still relevant research paper by Jim Gray, Prashant Shenoy at Microsoft Research that examines the rules of thumb for the design of data storage systems. It looks at storage, processing, and networking costs, ratios, and trends with a particular focus on performance and price/performance. Jim Gray has an updated presentation on this interesting topic: Long Term Storage Trends and You. Robin Harris has a great post that reflects on the Rules of Thumb whitepaper on his StorageMojo blog: Architecting the Internet Data Center - Parts I-IV.