Java

Learn How to Exploit Multiple Cores for Better Performance and Scalability

InfoQueue has this excellent talk by Brian Goetz on the new features being added to Java SE 7 that will allow programmers to fully exploit our massively multi-processor future. While the talk is about Java it's really more general than that and there's a lot to learn here for everyone.

Brian starts with a short, coherent, and compelling explanation of why programmers can't expect to be saved by ever faster CPUs and why we must learn to exploit the strengths of multiple core computers to make our software go faster.

Some techniques for exploiting multiple cores are given in an equally short, coherent, and compelling explanation of why divide and conquer as the secret to multi-core bliss, fork-join, how the Java approach differs from map-reduce, and lots of other juicy topics.

The multi-core "problem" is only going to get worse. Tilera founder Anant Agarwal estimates by 2017 embedded processors could have 4,096 cores, server CPUs might have 512 cores and desktop chips could use 128 cores. Some disagree saying this is too optimistic, but Agarwal maintains the number of cores will double every 18 months.

An abstract of the talk follows though I would highly recommend watching the whole thing. Brian does a great job.

Why is Parallelism More Important Now?

Coarse grain concurrency was all the rage for Java 5. The hardware reality has changed. The number of cores is increasing so applications must now search for fine grain parallelism (fork-join)

As hardware becomes more parallel, more and more cores, software has to look for techniques to find more and more parallelism to keep the hardware busy.

Clock rates have been increasing exponentially over the last 30 years or so. Allowed programmers to be lazy because a faster processor would be released that saved your butt. There wasn't a need to tune programs.

That wait for faster processor game is up. Around 2003 clock rates stopped increasing. Hit the power wall. Faster processors require more power. Thinner chip conductor lines were required and the thinner lines can't dissipate the increased power without causing overheating which effects the resistance characteristics of the conductors. So you can't keep increasing clock rate.

Fastest Intel CPU 4 or 5 years ago was 3.2 Ghz. Today it's about the same or even slower.

Easier to build 2.6 Ghz or 2.8 Ghz chips. Moore's law wasn't repealed so we can cram more transistors on each wafer. So more processing power could be put on a chip which leads to putting more and more processing cores on a chip. This is multicore.

Multicore systems are the trend. The number of cores will grow at exponential rate for the next 10 years. 4 cores at the low end. The high end 256 (Sun) and 800 (Azul) core systems.

More cores per chip instead of faster chips. Moore's law has been redirected to multicore.

The problem is it's harder to make a program go faster on a multicore system. A faster chip will run your program faster. If you have a 100 cores you program won't go faster unless you explicitly design it to take advantage of those chips.

No free lunch anymore. Must now be able to partition your program so it can run faster by running on multiple cores. And you must be able keep doing that as the number of cores keeps improving.

We need a way to specify programs so they can be made parallel as topologies change by adding more cores.

As hardware evolves platforms must evolve to take advantage of the new hardware. Started off with course grain tasks which was sufficient given the number of cores. This approach won't work as the number cores increase.

Must find finer-grained parallelism. Example sorting and searching data. Opportunities around data. The data can for sorting can be chunked and sorted and the brought together with a merge sort. Searching can be done in parallel by searching subregions of the data and merging the results.

Parallel solutions use more CPU in aggregate because of the coordination needed and that data needs to be handled more than once (merge). But the result is faster because it's done in parallel. This adds business value. Faster is better for humans.

What has Java 7 Added to Support Parallelism?

Example problem is to find the max number from a list.

The course grained threading approach is to use a thread pool, divide up the numbers, and let the task pool compute the sub problems. A shared task pool is slow as the number increases which forces the work to be more course grained. No way to load balance. Code is ugly. Doesn't match the problem well. The runtime is dominated by how long it takes the longest subtask to run. Had to decide up front how many pieces to divide the problem into.

Solution using divide and conquer. Divide set into pieces recursively until the problem is so small the sequential solution is more efficient. Sort the pieces. Merge the results. 0(n log n), but problem is parallelizable. Scales well and can keep many CPUs busy.

Divide and conquer uses fork-join to fork off subtasks and wait for them to complete and then join the results. A typical thread pool solution is not efficient. Creates too many threads and creating threads are expensive and use a lot of memory.

This approach portable because it's abstract. It doesn't know how many processors are available It's independent of the topology.

The fork-join pool is optimized for fine grained operations whereas the thread pool is optimized for course grained operations. Best used for problems without IO. Just computations using CPU that tend to fork off sub problems. Allows data to be shared read-only and used across different computations without copying.

This approach scales nearly linearly with the number of hardware threads.

The goal for fork-join: Avoid context switches; Have as many threads as hardware threads and keep them all busy; Minimize queue lock contention for data structures. Avoid common task queue.

Implementation uses Work-Stealing. Each thread has a work queue that is a double ended queue. Each thread pulls work from the head of queue and processes it. When there's nothing do it steals work from the tail of another queue. No contention for the head because only one thread access it. Rare contention on tail because stealing is infrequent as the stolen work is large which takes them time to process. Process starts with one task. It breaks up the work. Other tasks steal work and start the same process. Load balances without central coordination, few context switches, little coordination.

The same approach also works for graph traversal, matrix operations, linear algebra, modeling, generate moves and evaluate the result. Latent parallelism can be found in a lot of places once you start looking.

Support higher level operations like ParallelArray. Can specify filtering, transformation, and aggregation options. Not a generalized in-memory database, but has a very transparent cost model. It's clear how many parallel operations are happening. Can look at the code and quickly know what's a parallel operation so you will know the cost.

Looks like map reduce except this is scaling across a multicore system, one single JVM, whereas map reduce is across a cluster. The strategy is the same: divide and conquer.

Idea is to make specifying parallel operations so easy you wouldn't even think of the serial approach.

The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software By Herb Sutter

Intuition, Performance, and Scale by Dan Pritchett

"Multi-core Mania": A Rebuttal by Ted Neward

CPU designers debate multi-core future by Rick Merritt

Multicore puts screws to parallel-programming models by Rick Merritt

Challenges in Multi-Core Era – Part 1 and Part 2 by Gaston Hillar.

Running multiple processes to understand multicore CPUs power.

Learning to Program all Over Again by Vineet Gupta