« Stuff The Internet Says On Scalability For January 2nd, 2015 | Main | Sponsored Post: MemSQL, Campanja, Hypertable, Sprout Social, Scalyr, FoundationDB, AiScaler, Aerospike, AppDynamics, ManageEngine, Site24x7 »

Linus: The whole "parallel computing is the future" is a bunch of crock.

Linus Torvalds in his usual politically correct way made a typically understated statement about “pushing the whole parallelism snake-oil” that generated almost no response whatsoever.

Well, not quite. His comment on Avoiding ping pong has generated hundreds of responses, both on the original post and on Reddit.

The contention:

The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage. Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics).

Nobody is ever going to go backwards from where we are today. Those complex OoO [Out-of-order execution] cores aren't going away. Scaling isn't going to continue forever, and people want mobility, so the crazies talking about scaling to hundreds of cores are just that - crazy. Why give them an ounce of credibility?

Where the hell do you envision that those magical parallel algorithms would be used?

The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.

So give up on parallelism already. It's not going to happen. End users are fine with roughly on the order of four cores, and you can't fit any more anyway without using too much energy to be practical in that space. And nobody sane would make the cores smaller and weaker in order to fit more of them - the only reason to make them smaller and weaker is because you want to go even further down in power use, so you'd still not have lots of those weak cores.

Give it up. The whole "parallel computing is the future" is a bunch of crock.

An interesting question to ponder on the cusp of a new year. What will programs look like in the future? Very different than they look today? Or pretty much the same?

From the variety of replies to Linus it's obvious we are in no danger of arriving at consensus. There was the usual discussion of the differences between distributed, parallel, concurrent, and multithreading, with each succeeding explanation more confusing than the next. The general gist being that how you describe a problem in code is not how it has to run.  Which is why I was not surprised to see a mini-language war erupt. 

The idea is parallelization is a problem only because of the old fashioned languages that are used. Use a better language and parallelization of the design can be separated from the runtime and it will all just magically work. There are echoes here of how datacenter architectures are now utilizing schedulers like Mesos to treat entire datacenters as a programmable fabric. 

One of the more interesting issues raised in the comments was a confusion over what exactly is a server? Can a desktop machine that needs to run fast parallel builds be considered a server? An unsatisfying definition of a not-server may simply be a device that can comfortably run applications that aren't highly parallelized. 

I pulled out some of the more representative comments from the threads for your enjoyment. The consensus? There is none, but it's quite an interesting discussion...


In other words: "4 cores should be enough for anybody"


Parrallelism ≠ Multithreading ≠ Distributed computing

just sayin

Linus Torvalds:

I can imagine people actually using 60 cores in the server space, yes. I don't think we'll necessarily see it happen on a huge scale, though. It's probably more effective to make bigger caches and integrate more of the IO on the server side too.

On the client side, there are certainly still workstation loads etc that can use 16 cores, and I guess graphics professionals will be able to do their photoshop and video editing faster. But that's a pretty small market in the big picture. There's a reason why desktops are actually shrinking.

So the bulk of the market is probably more in that "four cores and lots of integration, and make it cheap and low-power" market.

But hey, predicting is hard. Especially the future. We'll see. 

Patrick Chase:

While you're right that the specific issue Gabriele raised can be mitigated by a different language choice, that isn't generally true. There have been no fundamental technological breakthroughs that make parallelism easier/cheaper to achieve than it was, say, a decade ago. No magical compilers, no breakthrough methodologies or languages, and therefore no substantive mitigation of Amdahl's law.

While there is certainly pressure from the semiconductor process side that limits the pace of improvement of serial performance, I don't see anything that would actually shift the equilibrium/optimum in favor of smaller cores.

I therefore think 'anon' had it about right: We'll continue to muddle along by improving per-core performance where we can and increasing parallelism where we must. It's not a terribly sexy vision, but sometimes reality works that way. 

Maynard Handley:

We have not even really started on this re-education of programmers so that they do things in a better way (where "better" means using abstractions that are a better match for parallel programming). Our languages, APIs, and tools are still in an abysmal state, like we're using Fortran and so trying to force recursion or pointers into the language is like pulling teeth. Our tools don't have the sort of refactoring that today makes us not care about needing to add a new parameter to a chain of function calls. etc etc.

Patrick Chase:

It all depends on volume and rate of change. When volumes are low and/or the algorithm set is unstable you use commodity HW, of which the flavor of the decade is the GPU. When the volumes get higher and the algorithms more stable it pays to do custom HW (the fixed costs of a moderately complex ASIC on a modern processes are O($10M) or higher when all is said and done, so you can do the math and figure out when it makes sense).

If the volumes are high but the algorithms are only partially stable, then it can make sense to do an ASIC that incorporates both fixed-function HW and programmability in the form of DSPs/GPUs/whatever. That's why Qualcomm sprinkles those 'Hexagon' doohickeys in all of their Snapdragons.


It is possible that Linus is both right and wrong at the same time. In fact I'd observe that people have been expecting for a while that fundamentally new languages were going to be necessary to really take advantage of the parallel hardware, so it's not like this is a new thought. We don't just need languages that make it "possible, if you put a lot of work into it and fundamentally fight the language's default structure", we need languages that make it the "easy default case to parallelize where possible". We are, at best, just starting to walk through that door. We do not have these languages in hand. (And I say this despite some fairly extensive experience in the languages you'd consider today's best options. We're just starting.)


Considering he's talking about MASSIVE parallelism (i.e. hundreds of cores) in a general purpose CPU I'd tend to agree with this.

Massively parallel GPUs, and the like, he's good with. 

Gabriele Svelto:

In today's mobile world "fast enough" might not be the end of it all as far as optimization goes. Since most computing devices now run on batteries making something that's fast enough for the user perception faster is actually an important goal: it allows you to save power. In this respect certain parallel methods provide a bad trade-off: all else being equal they can improve the overall speed of execution at the cost of increased computation (and often communication) over an equivalent serial approach. Now in practice not everything outside of computation is equal so you've got fixed costs to amortize too and overall faster execution might be more efficient nevertheless; but this just to point out that there's more variables to take into account nowadays when deciding if a workload needs to be sped up or not.


Parallelism is also still important on mobile, where a lot of stuff is going. The reason being is that getting the complete job done in a shorter time so you can go into power-save mode is nicer on the battery than partially using the CPU for longer. So you saturate it with parallel tasks to get everything done sooner.


But massive parallelism is not important on mobile. I don't think anyone is arguing against anything but massive parallelism (at least he mentioned "hundreds of processors" in his rant, IIRC. Believe me, I went and read it in the same mindset is you espouse in that I expected another mindless rant from Linus and was more than ready to slap it down. But when I got to that part I had to agree.

My argument isn't that most apps aren't parallel, my argument is that most apps don't NEED to be. There are certainly apps that do need to be.

I believe Linus is basically arguing against software development managers who gratuitously slap parallelism into their app, just because it becomes a feature that they can sell up the management chain. 


Despite all the hoopla, compilers mostly suck at optimization. About the only thing that does a worse job of optimization is the average programmer…

Higher level expression of programs is a good idea. Good enough that a ten fold reduction in performance is absolutely worth it in 99% of cases. And much of the computing world is built on far worse performance than that. Just consider the vast use of scripting languages in web servers. 


Servers of all sorts, games, artificial intelligence, image / video processing, analytics, simulation and modelling

I'm struggling to think of any application other than simple GUI's where parallel processing couldn't be helpful (if it was easy)


We have waited. We have seen. The parallel everything idea really took hold when clock speeds stopped increasing 10 years ago. Since then we have seen multiple cores with shared cache. It is tempting since the cache takes a lot of space and synchronizing dedicated caches without slowing them down is difficult. The problem is that each core then doesn't run the same speed as a single core would. It has to fight over the limited shared cache and suffers more cache misses and slows down.

Give me a modern single core processor with a huge cache. It will run circles around a similarly sized 4-core processor and won't require special programming techniques.


There has been a trend for increasingly versatile GPGPU. I think there was a case for a sliding scale between a single-thread optimised machine and the massively parallel GPU - sony tried to fill something in the middle with CELL (admittedly flawed in many ways), but increasingly general GPGPU capable GPUs and SIMD+multicore CPUs took that space from either end and did a better job. making use of 'gather' instructions for example is still an exercise in parallelism.


Depending on size of files and what you were doing with them, you likely would have gotten even better gains from using async IO instead of making a worker thread per core.


Linus speaks his mind in an unkind way, but I agree with much of what he says.

In my opinion, it's not that parallelism is necessarily bad; it's that programmers adopt it because they believe it will make their task run faster, without being aware of the overhead which comes with it - task switching, cache thrashing, lock contention, etc. And parallelism is harder to model mentally, so whole new categories of bugs arise. If your needs are well-suited to parallel implementations, it's worth doing.


haha, It absolutely is covered. A large number of small, clearly independent tasks is the "trivial" (I use that word lightly) case where parallelization is obvious... I hope I'm not being too liberal with Linus's words when I interpret this sort of thing as what he meant by "server side". If he has 4 cores, he can get all four blasting on 1/4 of the workload without worrying about changing the workload itself, since the units of work are (presumably) mostly independent of each other.

The post does not say "parallel is bad", it says that you should respect the law of diminishing returns. My gut feeling is that we are at or near "peak parallel" for the time being, until some large innovation comes along that turns computing on its head. I couldn't begin to speculate what that would be, but I think it would have to be fundamental. Put another way: http://xkcd.com/619/


They're related. Threading is often a means of achieving parallelism.

Parallelism could be multiple discrete processes all working on the same task. Even if those processes are running on entirely different systems. But in general, parallelism is any situation where the work is broken up into multiple workflows that execute at the same time.

Threading is a technique where a single process has multiple paths of execution, and there's some shared memory and separate stacks and all that jazz. Threading normally implies concurrency, but it isn't a rule. For example, running a multi-threaded program on a CPU with a single core probably means you're not parallel.

In short:

Parallel = Multiple things happening at once for a common goal

Threaded = Often a specific implementation of parallelism, but not necessarily.


Ummm. What? So if I want to do some "simulation or other scientific work" I should need to go out and buy/rent a server? Problem solved? No. This stuff might be better done on the server/cluster/cloud/whatever right now but the extent to which that's true only points to a weakness in the current "consumer" computers offerings.

If we redefine things so "consumer" computers are only for browsing the web, watching videos, and or playing some games then sure... but then appliances such as phone/tablet is probably good enough for that at this point so who really needs a "consumer" computer? It's also not the kind of computer use I'm interested in and I think there's a real market for computers that are good for... computing...


I am no expert, but I think the point Linus is trying to make here is that parallelizing work at the CPU core level has already reached a point of diminishing returns when dealing with desktop software, which is an unpredictable heterogenous workload.

In specific applications, where the code is written for it, such as graphics, scientific research, compiling, etc., there are gains to be made, but probably not enough to push the industry as a whole toward more and more cores at the cost of heat, electricity, processor lifespan, etc.

You can already buy machines with multiple multi-core CPUs if you need it for your application, but those are edge cases and the hardware is still relatively expensive.


The problem is parallelisation != threading, yet all the time people say things like "you shouldn't deal with parallelisation because you'll have to deal with locks, etc".

There are plenty of really good abstractions over parallelisation out there that makes it much easier to work with, such as actor systems, Future monads, whatever go's thing is called, etc.

To me it seems people talk about bad experiences with something when they haven't fully explored it. It's like when people say static typing is too restrictive when they only tried Java


Multitasking is a completely different thing than writing parallel code. With multitasking, the CPU and OS are executing completely different programs on separate threads. In parallel code, the same program is executing the same task on multiple threads.


Imagine a shipwright in 1800, confidently predicting that neither larger nor faster ships are possible... due to the limitations of wood and sails, ignorant of what his descendants will do with metal, steam, composites, combustion...

Similarly, we aren't still making our CPUs out of metal gears... Don't confuse a shortage of opportunities for incremental improvement with a shortage of any opportunities.


Doesn't he realise we're stuck around 3.5GHz, and the only reason we went to multi core for the home user was to get more power out of a stalled clock speed processor?

There's faster cores, but the trash rate of them makes them unaffordable.


Usually there are trade-offs involved with hitting massive levels of parallelism that involve sacrificing per-thread performance in favor of squeezing a higher number of simpler cores onto the same piece of silicon. Most consumers (myself included) would probably prefer the former over the latter for everyday computing needs.


Linus, you forgot about science. My colleagues and I routinely use 12 to hundreds of cores, because it is the fastest way for us to do our work.


That's true given how we have gone about putting software together for the last 30-50 years. However, when we reach the physical atomical limit of how small we can make transistors - also the limit for power and die-size (e.g. the end of Moore's Law), we will need to drastically change how we engineer software and software engineering processes if we are to see any further significant evolution in computing - even low-end consumer computing.

Something worth noting - take one core running at say 800Mhz (say) - its gonna get hot and consume a bit of power. Take 8 of those same cores each running at 100Mhz and they are not going to get as hot as their single cousin - though both (theoretically) can execute the same volume of execution cycles - but the 8 cores together will incur less thermal losses - thus will draw less power. This alone is worth learning how to better parallelize software. And yes, there are thread overheads to consider, too - but doesn't it make sense to put effort into learning how to minimise these overheads?


You also have to remember Amdahl's law. If you read further he also comments about using 4-6 cores for the average user makes a lot of sense. He also comments further on, that he is involved in a project that is working on optimizing performance for multiple cores, and how much work it takes to just squeeze a few more %'s of performance, and for a lot of use cases the effort doesn't justify the small gains.

Linus isn't saying, oh we should forget parallel completely, he's combating the idea that parallel will save everything, and we should just port over everything to parallel.


It's also worth noting that parallel programming is a fairly narrow and specific subset of multi-core computing. In this case, similar type of problems are broken up and processed in parallel. Not many algorithms can exploit this. Whereas concurrent programming is more immediately useful for a broader selection of problems. In this case, multi-core computing takes form as a collection of non-blocking and asynchronous design patterns. I think concurrent programming will become increasingly dominant in the future. 
I generally rather dislike Linus, but to be fair, he does make a good point here. The main application for lots of smaller cores is when you talk about very distributed systems, up to the point where you're trying to simulate a neural net. And at that point you're not going to be using 200,000 PCs. He's just saying that in the user space, you'd rather have 4-8 fast cores over 30 slow cores, so there's not all that much of an inherent advantage to parallelize past that. Things are only so parallelizable before you start hitting bottlenecks anyway.
I'm torn. On one hand, I love how callously dismissive you are of the trinkets of the software industry, with all its smug, self-important delusions... on the other, if stupidly simple UIs hadn't been built, this would right now be a usenet group with a couple hundred readers.
I will vote your post up, down, touch the ground, swipe, pinch zoom, double tap.
I mean, if we keep going down this route we'll end up writing reactive applications by using an interpreted scripting language to modify a markup syntax for static documents and then bootstrap a rendering style sheet over that mess to get it to show up right, while using a static file transfer protocol to issue commands to the file server to run its own interpreted script on a virtual machine running on an operating system that runs on another virtual machine that abstracts away the hardware to run an entire OS stack on top of another host OS running the same goddamn kernel. We wouldn't want to go down THAT road would we?
Bigger caches are of limited value once the working set is larger than the cache size[1]. In the low-latency space we often do crazy things to keep the entire application in cache but this is not the mainstream. Moving to large pages and having L2 support for these large pages can be more significant than actual cache size as we can now see for some large memory applications running on Haswell.

Rather than going parallel it is often more productive moving to cache friendly or cache oblivious (actually very cache friendly) data structures. It is very easy to make the argument that if a small proportion of the effort that went into Fork-Join and parallel streams was spent on providing better general purpose data structures, i.e. Maps and Trees, that are cache friendly then mainstream applications would benefit more than they do from FJ and parallel streams that are supposedly targeted at "solving the multi-core problem". It is not that FJ and parallel streams are bad. They are really an impressive engineering effort. However it is all about opportunity cost. When we choose to optimise we should choose what gives the best return for the investment.

A lot can be achieved with more course grain parallelism/concurrency. The Servlet model is a good example of this, or even how the likes of PHP can scale on the server side. Beyond this pipeling is often a more intuitive model that is well understood and practised extensively by our hardware friends.

When talking about concurrent access to data structures it is very important to separate query from mutation. If data structures are immutable, or support concurrent non-blocking reads, then these can scale very well in parallel and can be reasoned about. Concurrent update to any remotely interesting data structure, let alone full model, is very very complex and difficult to manage. Period. Leaving aside the complexity, any concurrent update from multiple writers to a shared model/state is fundamentally limited as proven by Universal Scalability Law (USL)[2]. As an industry we are kidding ourselves if we think it is a good idea to have concurrent update from multiple writers to any model when we expect it to scale in our increasing multi-core world. The good thing is that the majority of application code that needs to be developed is queries against models that do not mutate that often.

A nasty consequence of our industry desire to have concurrent access to shared state is that we also do it synchronously and that spreads to a distributed context. We need to embrace the world as being asynchronous as a way to avoid the latency constraints in our approaches to design and algorithms. By being asynchronous we can be non-blocking and set our applications free to perform better and be more resilient due to enforced isolation. Bandwidth will continue to improve at a great rate but latency improvements are leveling off.

My new years wishlist to platform providers would be the infrastructure to enable the development of more cache friendly, immutable, and append only data structures, better support for pipeline concurrency, non-blocking APIs (e.g. JDBC), language extensions to make asynchronous programming easier to reason about (e.g. better support for state machines and continuations), and language extensions for writing declarative queries (e.g. LINQ[3] for C# which could be even better). Oh yes, and don't be so shy about allowing lower level access from the likes of Java, we are well beyond writing applets that run in browser sandboxes these days!

Related Articles


Reader Comments (22)

A [mis]quote comes to mind:

I think there is a world market for maybe five computers.

As to one of Linus' claims:

The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.

IBM seems to care about a flavor of parallelism that's not "graphics" and doesn't seem to be doomed only for "server side": http://www.research.ibm.com/articles/brain-chip.shtml

December 31, 2014 | Unregistered CommenterTristan Slominski

The world has seen parallel computer before. There are some problems where they work extremely well, but one runs into a few problems, like the need for separate memory and communication between the CPUs. How are the CPUs configured to talk to each other? In many cases you end up with a hardware configuration that is no longer general purpose. The Transputer and the Connection Machine all worked on these problems 30 years ago.

January 1, 2015 | Unregistered CommenterDogzilla

Parallelism is hardly usefull without specialisation of cores. Because then it would make sense to optimize for it.

January 2, 2015 | Unregistered CommenterDavid Hofstee

I wonder whether there is some scope to integrate more of glibc directly into the hardware.

1. Transistors are now effectively, free (we can have as many as we like for a trivial increase in die area, and as long as they are idle, they draw nearly zero power).

2. Functions such as printf() can take a surprisingly long time to execute: 10s of microseconds.

So, why not create dedicated hardware for some of the tedious tasks and core functions. Imagine the tradeoff: add another 100k transistors, but get single-cycle execution of parts of glibc.

Examples that occur to me are: printf(), atoi(), strtof(), md5(), various string functions, numpy.loadtxt(), grep, SIMD instructions that could process arrays in 1k chunks, ...

Of course, the code would have to be bug-free, and it would tie the implementation to a very specific version of the function - but it might still make a good trade-off, especially if it allowed us to bring clock-speeds down to radically save power. In practice, we might not have a full glibc function libary in hardware, but would implement large chunks of the functions, such that excution of (say) printf with 10 variables could be done in a few CPU cycles.

January 2, 2015 | Unregistered CommenterRichardN

Linus is correct to keep Linux in the realm of the average users' platforms. It is senseless to relinquish control of Linux to a select few with a massively parallel system and expect the average user to show any benefits. There is significant risk of a set of average contributors mucking up the added layer of software and wrecking parallel havoc everywhere.

Keeping Linux on conventional systems is a smart choice. The hardware's lanes are paved in stone and will not differ much between systems. The external bridging hardware must be where much of the agony comes from, so who is proposing customized bridging 100's of CPUs?

The smarter choice is to force hardware to improve their power utilization (who needs to power every bit of a 64 x 64 add/subtract?) and to pave more lanes in their CPUs so that the common enthusiast can continue contributing to Linux.

January 2, 2015 | Unregistered CommenterJim Swartzendruber

He seems perfectly logical in context. Linux is generally a desktop or user system.
For an end user, they're generally interested in speed and most things do come down to the raw speed of a few cores.
It does seem kind of stupid to put too much work optimizing the Linux desktop experience for a bunch of weak, tiny cores, unless it's a virtualized desktop I guess.
Even then I dislike virtualized desktops with a bunch of weak tiny cores because many things happen in sequence and would be going so much faster with a few really good cores. Maybe if they improved that experience but that seems to be how it is, Linux or Windows, a bunch of weak cores just can't compete with 3-4 really nice ones, for day to day, except in very narrow situations that lend themselves to it, like graphics or things you would specificly use a server or distributed processing for.

January 2, 2015 | Unregistered CommenterA

Paradoxically Linus is writing this with a well functionning brain able to perform amazing parrallel tasks (like programming) with an energy power around 30W only. The brain inconscious data processing capability, about a million times the conscious processing capability (the one more or less working like a sequential computer), is highly parallel and distributed. Many people would like we could build computers working like the brain with matching energy and space requirements, but we are far from this goal yet.

January 2, 2015 | Unregistered CommenterDanielP

Just because we are being caught in a sequential programming mindset does not mean that there is no room for parallel programming. If you are looking at a two dimensional array of data and think of a nested loop you ARE caught in a sequential programming mindset. Additionally, famous people, including Dijkstra, have poopooed some algorithms that are inefficient when execute sequentially to the point where researcher, or programmers, are not even looking any more for good parallel execution. Take bubble sort. Not sure it was Dijkstra but somebody suggested to forbid it. Yes, on a sequential computer bubble sort is indeed inefficient but guess what. If communication does matter and if you are using a massively parallel architecture (i.e., not 4 cores) bubble sort becomes quite efficient because you only need to talk to your data neighbors. Likewise there are AI algorithms that can be shown to be behave really well when conceptualized and executed in parallel. Collaborative Diffusion is an example:


January 2, 2015 | Unregistered CommenterAlexander Repenning

"4 cores should be enough for anybody" sounds a lot like "640 KB should be enough for anybody". On a desktop PC both claims have some merit for most tasks that do not involve a lot of computation, and besides, most desktop software does not exploit multiple cores effectively. Some applications benefit massively from millions of cores.

January 2, 2015 | Unregistered CommenterDan the Man

"power tends to be proportional to clockspeed cubed" - so you must minimize clock speed on low power devices. we need more concurrent, SIMD, and lock-free code; and C is a pretty horrible language for expressing this - nevermind the unending security nightmare of blind compilers and defenseless runtimes.

January 2, 2015 | Unregistered CommenterRob Fielding

Linus may have a point, I don't really care about this issue personally. But why does he have to be such a dick about his opinions? Jesus, between him and Stallman, it's a wonder anybody pays attention to the Open Source Wise Old Men at all when they open their pie holes. They are absolutely insufferable.

January 2, 2015 | Unregistered CommenterTimothy

Linus is absolutely right. Let me provide some theoretical reasons for his comments. As a reminder,

In elementary queuing theory, we look at several models, such as multiple input queues and multiple servers. With multiple input queues, we have the phenomenon of putting the next input onto the shortest queue, after taking care of queue hopping.

The next case we discover is that a "single input queue multiple server" design is more efficient from the time required for the individual service request. On average, the service time (arrival, queue time, service time) is minimal, relative to multiple input queues.

We then studied the situation, which is better, to have n servers at speed x, or 2 servers at speed n*x? If you think about it, as soon as you add extra servers, to process inputs, these servers start tripping over each other. It's the idea of a single booth, where you approach a server, and then you move to the side to allow the next server to tackle the next input. The servers interfere with each other because the access to the single queue is one single point of entry.

Taking this to the next level, ideally, there should be different queues for different services. A queue for instant responses, and a queue for longer responses.

I wont dwell on this further, except that any text book on queuing theory and statistical arrival patterns proves that in the long run, what works best with the absolute fewest number of the fastest servers. Is four cores at speed X better than 2 cores at speed 2X? No.

Linus, You are right,

Happy New Year.

Leslie (Montreal Quebec)

January 3, 2015 | Unregistered CommenterLeslie Satenstein

I'll wait until someone who isn't a raving ideologue says it. Linus simply isn't reliable.

January 3, 2015 | Unregistered Commentermk

The problem of using efficiently modern multi-core CPU (or many-core GPU) lies in the development of algorithms that are well-suited for parallelisation. If you are using code that utilizes sequential algorithms, than your computer will work very non-efficiently on these computers. Moreover, in this case the more cores you have in your processor, the less portion of the computer power you are using.

But it doesn't mean that "parallel computing is a bunch of crock". It just mean that you don't want to use (or to develop) good parallel algorithms.

A good illustration of this situation is solving a system of equations. It is quite a typical problem. Let's speak about a system of linear equations (systems of non-linear equations are usually solved by iterating corresponding systems of linear equations).

For solving a system of linear equations there are a lot of standard methods of linear algebra described in thousands of books. The only problem is that all these standard methods (like Gauss elimination, for instance) are not applicable for parallelisation. As a result, when the matrix of the system of linear equations becomes big enough, the calculations require a lot of time and a lot of memory.

As an example of the industry, where this problem arise in its full height, we can get the parametric CAD (Computer Aided Design). Inside all modern commercial CAD applications (like AutoCAD and Inventor of Autodesk, SolidWorks of Dassault, Solid Edge of Siemens, and many others) there is such a part of code that is called "solver" - it resolves system of constraints equations. And none of these glorious applications can use the full power of modern multi-core processors. They solve corresponding systems of equations in a single thread mode. Therefore, the regeneration of geometric models (if they have even some hundreds of geometric constraints) is very slow, non-stable, and very often a system of constraints equations appears to be unsolvable.

This situation remains in the CAD industry already more than 20 years. You ask "why?" - because the algorithms implemented in the solvers of all commercial CAD applications (and, by the way, this is the market with annual sales about $10 B) were developed 30 years ago without parallelisation in mind.

We, in Cloud Invent, do believe that "parallel computing is the future" (and not only for the CAD industry). And we are doing our best to implement this "future" already in 2015.

Happy New Year for everybody
(especially for those who believe in parallel computing)

Nick Sidorenko
CEO of Cloud Invent

January 4, 2015 | Unregistered CommenterNick Sidorenko

Introspek has a point when he mentions task switching. I recently played with Python's parallelize Library. I wrote a simple prime number sieve on my 2 core Macbook pro then parallelised it. the Parallel version ran much slower in 2 cores than on one and much slower then the non paralllelised version, which I currently put down to the cost of task switching. Since it was a private projectI have not yed had time to try other implementations, or indeed problems, where the cost of computation in each node would significanly outweigh the cost of task switching.

Many years ago I had the pleasur of working with the distributed Array Processor ( Now defunct) which had 1024 or in larger models 4096 single bit processors (can we call them cores?) each with 4 nearest neighbour connection. This proved very efficient in a surprising number of tasks, especially when its variety of fortean was extended to allow arbitrary size arrays with under thehood optimisations in the compiler. I recall programming parallel sorting and FFT algorithms ( I am sure I could not do that again today, at least not so easily) as well as computer vision algorithms. The professor who ran that department found that if the speed of the processors were to increase by a factro of three then data routing became a bottleneck

As long as you recall that the hardware architecture can significantly affect performance and that data routing not data instructions becomes the feature to work on, then you can get real speedups with parallelisation, but taskswitching can be a bottleneck.

January 5, 2015 | Unregistered CommenterAlex Kashko

Linux has many incarnations, examples are:
1.) Android: lots of Android devices, and that trend looks to only be going up (its already in the Billions)
2.) Linux servers: plenty of Linux servers powering the web

These 2 classes represent very different hardware classes:
1.) Android devices primary concern is energy consumption
2.) Linux servers may use 16-64 cores already (Jan 2015)

We have 2 Linux camps with very different needs:
1.) Android devices do not need 100 CPU-cores, the power consumption is prohibitive
2.) Linux servers can benefit from 100s of cores (e.g. serving 100s of users web-pages concurrently)

Chipmakers are making chips to serve these 2 very different ends of the CPU spectrum:
1.) Android devices: low-energy few-core (1-4) CPUs
2.) server-side: many-core CPUs (Jan 2015: 16 cores are commodity, 32 & 64 can't be far off)

And I am assuming that if Linus' comments stand, then Linux will become more tailored to the Android devices, and this means optimizations that 16+ cores need will be placed on the back burner.

This may be the signal of an OS branch, I don't see anyway that Linux can properly serve both camps' needs. An OS that runs efficiently on 16/32/64 cores and is not obsessed with power consumption is a very different OS than one optimized for minimal power consumption on 1-4 energy efficient cores.

January 5, 2015 | Unregistered CommenterRussell Sullivan

In systems that use Fuzzy Logic, rule based knowledge bases containing banks of crisp or fuzzy rules and data that fires them. These systems currently use priority flags on the rules to fire them in serial order, they were designed to run in parallel, many rules firing together and adding to the final output. These systems have been waiting for years for the advent of viable parallel computing. Fuzzy Clips is still around representing basic AI. Now the time has come finally for a surge in the use of knowledge bases again, the next few years should be revealing. Neural networks also ache for available parallel computing. Linus seems to be incorrect in this instance. Task scheduling for efficiency over cores still needs to be solved of course for a general purpose parallel machine.

John H in New Zealand

April 28, 2015 | Unregistered CommenterJohn Hetherington

He fails to understand that the corporate world doesn't run on Linux as much as we would all wish it would. There are several dilemma's here in regards to parallelization, the biggest one being programmers actually making use of multiple processes. The other being choice of operating systems that come as standard, Windows by default adds 50% drag time on any level of hardware across the board. On the other hand Windows does have a small repository of growing software that does meet certain demand, but imo it's still not enough which comes back to the programming point. On the other hand Linux has a vast library of software but 70% - 80% of it has its own goals and strict adherence's or is just plain crapware. We can whine and fuss all we like all day but until devs start getting serious with how they write software that addresses "cores, threads, processes" the whole argument is just mute. It's like arguing for the sake of arguing.

September 23, 2015 | Unregistered CommenterMeh

Linus's imagination is a bit stuck. Sure he has no choice but to concede that lots of parallel cores are good for graphics, but beyond that he can't imagine anything else. That's the thing about tools, some are made for a purpose, and some are made looking for a purpose. I do advanced audio processing on multi core CPU's and sometimes four full cores is not enough. 128 small purpose driven cores would be great for large convolutions. NEON/SIMD in arm processors is just not enough, neither is four cores of NEON/SIMD. SIMD with more lanes running on more cores all sharing the same memory would be great.

April 14, 2016 | Unregistered CommenterChuck

I would have to say there is a minor window where he is flat out wrong. In the past we would have a room with 120 machines with just a single core, doing parallel computing. Or in a really souped up setting 60 dual core machines doing multiprocessing and parallel computing, (which is actually even more programming work).

In this small window tho, consider now days. You could take your single parallel computing problem, and purchase one machine with 144 threads. Now you just make your program a multi-threaded beast, on one machine, one system, with no worries of, fiber opting cables and switches, network cables, network data transfer, network latency in job processing or ordering, not to mention shared data storage.

Just one machine with a backup power supply, one system to configure, unless you want to purchase two "mainframes" to have side by side, to have a clone of your work.

Considering not having to run thru the hoops of networking all of your cores, and working in parallel. I would say a 144 core multi-threaded beast is great platform for an older parallel computing problem.

That is the small window, but normal computing for the masses, wouldn't use multiple cores daily unless they wanted to, say, distribute a voice search across the internet returning multiple results...example...

May 6, 2016 | Unregistered CommenterThomas Aldershof II

Linus is being pretty accurate here. If you have two processors, one with 100 cores of 1 MIPS each, and another with 4 cores of 10 MIPS each, if you run a non-parallel program, the 4-core processor will beat the snot out of the 100-core processor (10x speed, ideally).

Now, if we run a parallel program, ideally, the 100-core processor will run faster. But it will only run about 2.5x faster, compared to the 10x that the 4-core managed over it previously.

Combine this with the fact that most programs (being consumer programs) cannot make good use of 100 cores, then the 4-core processor is usually faster than the 100-core one.

August 6, 2016 | Unregistered CommenterJason

I think parallel computing makes big data faster , only we need to do it the proper way. I agree with Linus

August 12, 2018 | Unregistered CommenterJules Irenge

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>