Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've spent a lot of time studying parallel compute over the past few years, trying to understand how GPU-algorithms are so fast.

Use barriers. Many data-structures are safe if:

1. Everyone is reading at the same time.

2. Only one thread is writing to any particular location.

How to accomplish these two facts? Again, use barriers. Take GPU-merge path algorithm for example. (GPU Merge-path is just a parallel-implementation of the "Merge-pass" of Merge-sort, normally taking O(N) time on sequential computers):

1. All threads read their relevant values from the array, and performs a binary search along the diagonal (read-only over the data-structure). Binary search is O(lg(n))

2. Barrier, all threads wait for all other threads to be done reading.

3. Write the "merge path".

4. Barrier, wait for everyone to finish writing.

5. Everyone reads the "merge path", which is all you need to figure out the "final location" of any particular value in your merge-sort in O(lg(n)) time.

6. Everyone writes their value "magically" to the correct, sorted, position of the array. Because everyone is writing to a different array location, there's no contention or race-conditions.

Since all steps are O(lg(n)), your Merge-path algorithm executes in O(lg(n)) time where n is the data-size and parallelism factor (ex: 10,000 element array has 10,000 processors), which is possible on modern GPUs.

Bonus points: all SIMD computation is innately barrier based. That's what SIMD means after all: all simd-lanes execute the same instruction at any given time. If your processor is on a "load" instruction, everyone's reading. If the processor is on a "store" instruction, everyone is writing. (Modern GPUs are MIMD though, and require explicit "barrier" instructions and/or kernel invoke calls if the SIMD-units from other compute-units are cooperating)

Your "only" job as the programmer, is to therefore, just ensure that those writes are to all different memory locations. A difficult job for sure, but doable on a wide variety of algorithms.

------------

Very, very few GPU-algorithms use mutexes or even atomics (!!!). The "bread and butter" is thread-barriers and enabling concurrent writes (by having everyone write to different locations of memory, as well as barriers to ensure the other threads are done reading or done writing).



> Use barriers. Many data-structures are safe if:

> 1. Everyone is reading at the same time.

> 2. Only one thread is writing to any particular location.

I think there is a missing point here:

3. Reads and writes do not happen at the same time.


If rules 1. and 2. are followed, 3. is implicitly true


Oh, right. If everyone is reading there is nobody left to write :)

It seems like quite a restriction, though.


See: https://news.ycombinator.com/item?id=31480323

> The easiest way to understand 100,000 concurrent threads is by making all 100,000 threads do the same thing and execute the same code.

If all your threads are off doing different things at different times, and executing different sets of code, you might have problems understanding your algorithm.

Its a restriction for sure, but it does lead to far simpler "thinking". There's still a very rich space of parallel algorithms that can be explored within this restricted space.

In effect: you probably *don't* want to write general-purpose "desynchronized" code unless you really have to. Keeping all 100,000 threads synchronized and in roughly the same spot of code really makes your brain think a lot easier.


How can reads and writes happen at the same time if everyone is reading at the same time?


Doesn't this only apply when all the threads are doing the same thing but on different data?


That's not a bad restriction in practice.

The easiest way to understand 100,000 concurrent threads is by making all 100,000 threads do the same thing and execute the same code.

EDIT: Works on the small scale too. If you're using a CPU, then you can assign exactly 128-threads to be your data-structure's "workers". You access the data-structure sequentially, but all manipulations are done in parallel by the 128x "worker" threads in the background.

Assuming a 64-core / 128-thread CPU like an AMD EPYC or whatever.

Normally, your data-structure "commands / frontend" aren't the bottleneck, instead is the data-structure's "work" that bottlenecks. As long as that work happens in parallel, you're probably fine.

------

Bonus points: this leads to "obvious" NUMA-locality. If your NUMA-cluster is just 8c/16-threads, you can have a NUMA-local set of 16-workers with affinity set. You scale to the level of your L3 cache (or whatever arbitrary NUMA-node you want), and different threads can have their "local workers" to choose from.


> That's not a bad restriction in practice.

Maybe, but it surely is inconvenient!

When our program can use pipe-line with several pipelines for concurrency (like GPU shaders work), sure, that's a good way to go. However the majority of what threads are doing is not pipe-line friendly work.

When we want our program to do different things at the same time (which is what we mostly use threads for), trying to do it in a pipe-line-like manner won't work.

For example, serving multiple http requests at the same time - some will be a POST to an API endpoint, some will be a GET for a static file, some controller code will make multiple requests to different backend microservices (each request in a different thread), etc.

Or maybe some client code, that makes several concurrent requests for different resources to different destinations, while each thread still needs to update the UI (progress meter, for example).

Or a parser for input from some upstream processes: you want to parse in parallel if at all possible.

Or the program in a microcontroller that cannot miss bytes coming in on one bus while it is sending bytes out on another bus.


Concurrent =/= parallel. CPU thread safety is mostly manifested [or let's say much more challenging] in context of concurrent, not parallel, algorithms. I'm not aware of anyone doing concurrent programming on the GPU (because it is a poor fit).


I'm not aware of anyone doing concurrent programming on the GPU (because it is a poor fit).

It's routine in good Vulkan-based games. One or more CPU threads are telling the GPU what to render, while other CPU threads are concurrently loading new content into the GPU for rendering on later frames. (You're driving down a road. Ahead of you, new content has to be loaded before you get there. Behind you, the memory containing content you can no longer see is released.)

This is complicated to make work. Unreal Engine does it, I think. Rend3->WGPU->Vulkan is supposed to do it, but right now there are locking delays which cause content updating to impact rendering. That's being worked on at the WGPU level.

This is one of those little problems being solved on the way to a big-world Metaverse.


Interesting, thanks.


Any concurrent algorithm, by definition, has branches of computation that are independent of each other, and is thus parallel.

Any parallel algorithm, by definition, is a (trivially) concurrent algorithm.


The usual way that the difference is presented is that concurrency is a semantic property of your program independent of its runtime performance while parallelism is a runtime property of your program independent of its semantics.

In particular, a basic property of concurrent programs is non-determinism. It must be always be available (can't be offline while serving one response) to service multiple requests in non-deterministic order (and often will give responses in a non-deterministic order as well even for a deterministic order of requests, or at least leave such an option available in its semantics). There are almost always other invariants that must be preserved (which is what is means for a concurrent program to be correct), but generally the invariants are lax enough that a concurrent program does not make guarantees about being deterministic. This has no direct relationship with its runtime implementation. For example you could implement a concurrent program in a single-threaded fashion just by interleaving and jumping back and forth between servicing different requests. This is effectively how something like green threads or async-await works.

Parallelism on the other hand is a property of the runtime. Indeed, often parallel programs are meant to be entirely deterministic in their semantics (e.g. map-reduce-type computations), where the answer will always be the same no matter what. You just might have faster runtime. But parallelism has no inherent link to non-determinism in the way that concurrency does, because after all determinism/non-determinism is a semantic property.


> But parallelism has no inherent link to non-determinism in the way that concurrency does, because after all determinism/non-determinism is a semantic property.

I don't know if I agree with this.

If we consider the map example, the order of elements to which the function is applied must be irrelevant, so you must have non-determinism at the most fundamental level. So map is still a concurrent algorithm, isn't it?

I mean, I get it, the meaning of the words is different, but I can't find any difference beyond "concurrency is a semantic property, parallelism is runtime manifestation of concurrency", which only makes them even more equivalent-ish.


> If we consider the map example, the order of elements to which the function is applied must be irrelevant, so you must have non-determinism at the most fundamental level.

This is not true. Parallelism does not require non-determinism (you can e.g. require as part of your interface that your list that you are map-reducing be pre-sorted and then using that sort key the order of operations is entirely determined, you could even pin certain sort keys to certain processor cores; this is in fact how some parallel interfaces for certain parallelism libraries operate, minus the pinning).

But that's only partially related to the larger point, which is that any discussion about the semantics vs the implementation of a program depends on where you draw the line between the semantics of your program and its runtime behavior. For different environments that line will be at different places. For most applications, runtime and memory usage are usually not considered part of its semantics (otherwise e.g. the notion of porting an application or refactoring wouldn't exist, since both of those would be changing the application's performance and therefore its semantics and would be essentially the same as changing the application's interface). In hard real-time environments, on the other hand, usually runtime and memory usage are considered part of its semantics, so the division between parallelism and concurrency is less relevant there.

> I can't find any difference beyond "concurrency is a semantic property, parallelism is runtime manifestation of concurrency", which only makes them even more equivalent-ish.

A very practical difference is that if you tell me a program is concurrent, but non-parallel I have a good idea of the outline what that program does. If you tell me that a program is parallel, but not concurrent, I also have a good idea of what that program is. And the two are very different and require very different engineering trade-offs.

Examples of the former include things such as chatbots, web servers, databases (usually the latter two will also sprinkle in parallelism), etc. Examples of the latter include compilers, physics simulations, rendering engines, etc.

And different tools are tuned for different use cases. If you use a tool meant for building concurrent, but not necessarily parallel programs, you'll have a bad time building a parallel, but not concurrent program with it and vice versa. E.g. Node JS is great for concurrent, but not parallel programs. You will have a horrible time building a parallel, but not concurrent program with it. None of its trade-offs will make sense. A similar case holds for something like Erlang (it at least has some support for parallelism, but it's not its strong suit). On the flip-side, Fortran is great for parallel but not concurrent programs, but you're not going find a lot of support for concurrent programs.


Yet we do distinguish the two types by distinct names (and algorithmic and data structure treatment) for a reason, precisely because your proposed equivalence offers little analytical and practical benefit.


> equivalence offers little analytical and practical benefit

I disagree, but that's irrelevant.

Considering things equivalent is default, if their properties are the same - in case of concurrency/parallelism, I think I have clearly demonstrated how every concurrent algorithm is parallel, and vice versa.

If you're claiming that a distinction should be drawn, the burden of proof is on you. What is the fundamental difference between concurrency and parallelism?


> I think I have clearly demonstrated how every concurrent algorithm is parallel, and vice versa.

You have not "clearly demonstrated" anything. You are merely pointing out a weak truism that holds in parts, which is the banal truth that all algorithms feature non-contended segments for a given executing thread. The fact that the distinction between these two types is not 'crisp' and precise does not mean a meaningful distinction does not exist.

If your position was valid, we would call e.g. a multi-producer multi-consumer queue a "parallel" data structure but we certainly don't. Sure, the producers are all producing in parallel before contending to enqueue, and the consumers do their thing in parallel after they contend to dequeue. That parallelism is a banal fact in this case, given that what we really need to address are the concurrent attempts to update the queue. Conversely, applying a filter to an image by tiling and performing the application in parallel has some minor contention bits but the main business is doing something in parallel.


Aside from the appeal to authority ("we" say X therefore X is right), you bring up some good points (although in a fairly unpleasant tone which I don't appreciate in the slightest), you still haven't answered my question - what is the fundamental difference between concurrency and parallelism?


Took a short walk and was thinking about this. This is my take on the matter.

The most fundamental difference between parallel and concurrent processing is that for the latter we need not have the same number of physical actors (threads) as the required logical actors [for meaningful/practical gains]. In other words, we can (and do, as in 'green threads') emulate the synchronized activity of nL logical actors with nR real physical actors, where 1 <= nR =< nL. Of course we can also emulate nL parallel activities with only 1 real actor but it is of no practical value.

So Amdahl's Law is very much about parallelism but it has absolutely nothing to do with 'concurrency'. [Whereas Queueing Theory is of relevance to analyzing concurrent algorithms.]

This is because in parallel algorithms, the concurrent actors are all doing the same thing (see SIMD), whereas in concurrent algorithms, the concurrent actors may be doing entirely different things (see typical GUI component of an OS).

This fact also informs why arrays and array processing languages are such good fits for parallel programming.

Then, it seems to me that parallel processing is orchestrated (deterministic) in nature, whereas concurrent processing is all about synchronized pseudo-chaos. [So parallelism is a simple, well ordered, type of concurrency. Think of it as symmetric and synchronized concurrency.]

Finally, the 'fuzzy' distinguishing matter is a sense of the degree of contention for mutating ops on registers. A concurrent algorithm is pretty much all about dealing with that: highly contended R/W registers.

p.s. so more examples of making this distinction: fork-join is about parallelism. green-threads/fibers are about concurrency. continuations are also about concurrency. Go is a language tuned for concurrency. Chapel is one tuned for parallelism.

p.s.s. things can be "embarrassingly parallel", but dealing effectively with concurrency is never something to be embarrassed about. /g

https://en.wikipedia.org/wiki/Embarrassingly_parallel


I see.

I think I understand better the motivation behind your initial comment that I've criticized. Thanks for taking time to explain.


NP. Actually it was good of you to insist on a clear definition. I would say applicability of Amdhal’s Law is a decisive litmus test.


As I understand it, parallelism describes tasks making progress simultaneously, while concurrency describes tasks that can be broken into smaller, independent tasks (that could still be executed serially).

For example, I could have map reduce problem executing sequentially on a single core CPU. Each map function would operate on the data sequentially, then the reduce step. The process is concurrent because I could work on any map step in any order, or even stop midway and work on another task set. Once all processes are complete I have my output. But since only one task was ever making progress at a time the computation wasn't parallel, it only becomes parallel once I add another execution unit.

Consider multitasking on early operating systems; a user may have several programs running concurrently, but early CPUs could only do one thing a time, hence there was no parallelism. Due to this architecture there are situations where there are all execution units are blocked and only one of them can make progress (e.g. reading from a spinning disk); making CPU synchronization primitives a "poor fit" for GPUs.

In other words, while any parallel algorithm, by definition, is a (trivially) concurrent algorithm, the inverse is not true, and is still a harder problem.


The "elevator algorithm" makes hard drives (and tapes) more-and-more efficient the more work is queued up for them.

Lets say you have 10 floors on an elevator (floor#0 through floor#9). You can imagine that I/O requests are handled by the elevator moving to a specific floor, and then reading/writing to that floor, and then moving to another floor.

Physically, this correlates to the positions a hard-drive arm is positioned on various sectors of a hard drive: you can think of floor#0 as the "inner-ring" of the hard drive, and floor#1000 as the "outer-ring" of the hard drive. The further away the floors are, the longer it takes for the arm to physically move across the hard drive.

----------------------

Now lets say there's a bunch of requests that have come in. Read Floor#5, write Floor#9, read Floor#1, write Floor#2, write Floor#7. When executing these 5 requests in order (5->9->1->2->7). This is clearly inefficient.

If all 5 requests were known at the same time, the hard drive can instead reorder the requests into 1->2->5->7->9, which moves the hard drive arm a much shorter distance. This is the innate benefits of "concurrency" of the 1980s era computers.

You wanted multiple-threads to buffer-up as many requests to the hard drive as possible. The more-and-more "load" you give to the hard drive arm, the more overall efficiency is achieved, because the hard drive controller can optimize those read/write requests and minimize the movement of the arm. (Ex: rearranging the "floor visit pattern" of the elevator algorithm to minimize movements of the hard-drive control arms)

----------------------

So we can see that "concurrency", or "multithreaded" programming was about maximizing I/O performance, instead of CPU-performance.

From this perspective, highly CPU-inefficient structures, such as Mutexes, are not really a penalty. The CPU is so much faster than the hard drive that it doesn't matter. Mutexes, task-switching, etc. etc. Spend all the CPU cycles you want on this, the CPU is obviously not the bottleneck here.

In contrast, today is 2020. We have solid state drives that are much much faster (still benefiting from parallel requests though!!), and the CPU itself can be the bottleneck.

Not only do we want to fill up our I/O systems with a huge number of parallel requests (modern SSDs can perform far, far more requests with high-queue depths), you also want to take advantage of multi-CPU clusters through parallel programming techniques.

-------------

As such, my particular usage of the words are:

* Concurrency are the sets of techniques to maximize I/O usage, even at high costs to CPU-time / CPU-power. (Mutexes, semaphores, pthreads, task-switching). If you have a technique that can "invent new I/O requests out of nothingness", that's ideal in the concurrency mindset. (Agents, golang-threads, etc. etc.)

* Parallelism are the sets of techniques to maximize CPU-usage, and therefore must be as low cost to CPU-power as possible. (Thread barriers, SIMD processing, spinlocks, thread-pools, thread-affinity, NUMA processing).

All of those goland-threads that generate more I/O requests are worthless in the scope of CPU-limited "parallel compute". But they're extremely helpful in I/O limited "concurrency". The more I/O requests that hit the controller simultaneously, the better and better those I/O devices perform. This is true for all forms of I/O, be it a GPU, Network adapter, hard drive, tape drive, or solid state drive.


Sorry about the tone!(?)


I apologize for my defensiveness. I tend to get defensive and emotional when under stress. I don't like that about myself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: