There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.

Transfer speeds per bus

The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to GPU-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a GPU need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.

We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right. What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.

What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the GPU and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it’s better to have a GPU that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.

You don’t see in this image how much time it takes to do an operation on the CPU or GPU itself when the data is available. These “transfer-speeds” are directly related with the actual FLOPS (frequency times number of cores times number of operations per core). Then you understand why these maximum theoretical FLOPS do not result in very high FLOPS in all real-life software: data-transfer is part of reality.

Operations and Data-size

This image shows the optimal OpenCL-hardware given operations per byte and data-size. It is a relative representation where to find the best hardware for devices. So if you have a lot of data (and thus transfers), there is a moment you could better use a APU (AMD Fusion or Intel Sandy Bridge). If the operations per byte are low, then it might be best just to use a CPU (using OpenCL).

As it really depends on the actual hardware (for instance GPUs on APUs are currently not really that powerful), use this image only to ask yourself the right questions for your algorithm. More on this in the upcoming new “hardware buying guide”.

I hope these images gave you an easy insight in how things work in the world of OpenCL. If you want more, just see our more extensive training-program.

StreamHPC communications

Theoretical transfer speeds visualised

Transfer speeds per bus

Operations and Data-size

Related Posts

Separation of Compute and Transfer from the rest of the code.

Building a 150 TFLOPS cluster with Accelerators in 2014

Performance can be measured as Throughput, Latency or Processor Utilisation

X86 Systems-on-a-Chip and GPGPU

StreamHPC communications

Transfer speeds per bus

Operations and Data-size

Related Posts

Separation of Compute and Transfer from the rest of the code.

Building a 150 TFLOPS cluster with Accelerators in 2014

Performance can be measured as Throughput, Latency or Processor Utilisation

X86 Systems-on-a-Chip and GPGPU

Discover more from StreamHPC