Separation of Compute, Control and Transfer
A while ago I spoke with Patrick Viry, CEO of Ateji. We shared ideas on GPGPU, OpenCL and programming in general. While talking about the strengths of Ateji PX (a Java-like language for parallel programming), he came with a remark which I found important and interesting: separation of transfer.
Separation of focus-areas increase effectiveness but are said to be for experts only. For example, the concepts of loops is well-known to programmers, so that seems to be reason enough it should be the starting-position for any goal concerning repetition. Current lower-level GPGPU languages are kernel-host languages and describe what has to be done at one coordinate (or group of coordinates) in the data instead looping over the whole data. From what I see, the idea is getting abandoned in higher level languages instead of turned into a design pattern.
I would like to discuss these separations to see where the focus on higher-level languages on top of these low-level languages could and/or should focus on.
I think using the image of a tree helps a lot to illustrate here: you have the nutrients coming through the roots, the transport going through the trunk and all the complex stuff happening in separate leaves.
Separation of compute
In the concept of looping, one can choose to loop over (i, j) of the input-data or (k, l) of the result-data, in OpenCL one is forced to do the computations seen from each (k, l) of the result-data (with some exceptions). An important difference is that you cannot have temporary-answers and use them in the next iteration. If iterations are needed, then data can be computated in several steps.
By forcing the programmer to think this way through the separation of the single computation from the repetition, the code can be optimized and scaled easier. And exactly this is what is abandoned in higher level GPGPU-languages. You see the low-level languages are all on the left, and most of the higher-level on the right. Exception is ArrayFire, which stays close to the Matlab-concept.
|Node-wise||Functional||Iterative & directives|
Node-wise (each group of data-elements on which computed is on) has the advantage of scaling, but takes some time to implement by programmers who are used to loops. Functional programming solves a lot, but is not applicable to all kinds of problems. I do like this approach, though, and it could be a good direction. Unlooping is a very well explored research-area, but it is still not optimal when it comes to scaling. Hence its importance.
I think the strength of separating the computation from the rest is undervalued.
Separation of transfer
When using GPUs, and also when reading a file from disk, part of the time that the whole operation takes is transferring data. This needs scheduling. Scheduling data-transfer is the most part of the host-code in OpenCL.
|In host-code||Explicit||By compiler|
|Ateji PX||C++ AMP
Here is explained how Ateji PX does this explicit transfer-scheduling. The choice for most new higher level languages is to leave it to the compiler to find out, even if explicit transfers could increase time.
Please let me know if it can be done with i.e. OpenACC‘s async.
While OpenCL and CUDA have room to improve this separation, only Ateji PX did not abandon it. Forcing programmers to explicitly defining this, increases the overall speed. If the programmer does not concern himself too much with where things happen, then next next year hardware with low transfer-speeds could perform best. This decreases potential growth of many types of dedicated co-processors, such as FPGAs.
Some words before finishing here…
Even if the old ways of programming have shown their scaling limits, most higher level languages which aim to replace OpenCL and CUDA, focus on trying to trust the old paradigms. The good things that CUDA and OpenCL brought are exactly this separation of computation and transfer. These could help us all entering the multi-core era. CUDA and OpenCL could still use a lot of optimization, such as automatic pinning of memory, optimization of simple kernels without coalescing and manual caching, etc. I invite language and compiler designers to focus on that, instead of making the new way of programming look like what we are already used to.
What do you think? More or less separation to cope with scaling on multi-core processors?