The OpenCL power: offloading to the CPU (AVX+SSE)

Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

int* a = ...;
int* b = ...; 
for (int i = 0; i < M; i++)
   a[i] = b[i]*2;

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

How to use the CPU instead of the GPU

To make use of the vector/media-extensions AVX and SSE, you need to have well-optimised code such that the compiler can optimise it for AVX. OpenCL is an option, but also programming languages that compile to OpenCL, writing AVX-instructions, using Assembly and using special extensions.

As you might have programmed the GPU-code in openCL, off-loading small parts of code to the AVX+SSE is easy. The below code shows how to create a context for a CPU – actually only one constant is different: CL_DEVICE_TYPE_CPU.

std::cout < < "trying to create a CPU context." << std::endl;
context = clCreateContextFromType(contextProperties,
          <strong>CL_DEVICE_TYPE_CPU</strong>, NULL, NULL, &amp;errNum); 
if (errNum != CL_SUCCESS) {
    std::cerr &lt;&lt; "Failed to create an OpenCL CPU context." &lt;&lt; std::endl;
    return NULL; 
} else { ...

To run OpenCL-CPU software, you do need AMD GPU drivers or Intel OpenCL drivers (included in SDK). For developing OpenCL for CPUs you need the AMD or Intel SDK.

When (not) to choose the CPU or GPU

In most cases when the computations per byte are high enough to get a maximum occupancy, then discrete GPUs are the best way to go. With the new PCIe 3.0 x16 the bandwidth is very high, but latency is the main problem here. Latency can be hidden if you do batch-processing, total transfer-time equals or is less than the compute-time and not all memory of the GPU is used.

In case of embedded GPUs (such as AMD A-series and Intel Ivy Bridge) you have other tradeoffs. The GPU is not as strong as a discrete GPU, but the transfer-time is far less (especially due to lower latency) and the embedded GPU is more powerful than the AVX/SSE on the CPU. AMD claims over 15 GB/s on their A8 APU, which is as fast as PCIe 3.0 x16. Limiting factor is the trasnfer-speed to/from main memory – discrete GPUs. The memory-bandwidth starts to be the limiting factor (20-25 GB/s on desktop CPUs up to 37 GB/s on high end server-CPUs).

Every byte in main memory, used as input or output in a compute-kernel run at the GPU, went through the CPU. So the moment you only need to do a few computations, the transfer to the GPU and back takes longer than the CPU could have handled it on is own. If your main problem when programming a GPU is getting a high enough occupancy, then this might be a viable option. This happens more often than you think – I’ve seen acceleration-propositions being turned down, only because it would not be fit for the GPU.

Above is a description for how it is now. With embedded GPUs becoming more powerful and having faster inter-connects, you cannot make assumptions about transfer-times based on the current state of the various architectures. Best is to do a little benchmark when running the software for the first time or when new hardware or drivers are found.


Currently no time – will be done later.

Using a simple computation:

  • unoptimised with GCC and LLVM
  • optimised with GCC and LLVM
  • OpenCL on CPU
  • OpenCL on GPU
  • E P

    Shouldn’t the numbers (3.5, 14, …) be in billions rather than millions?

    • streamcomputing

      True, thanks. Will recheck everything when I do the benchmarks.