A simple sum example, from CPU to OpenMP to OpenCL to GPU

  |   General   |   2 Comments
World's biggest ball pit is a sea filled with one million balls of fun. The minimum joy-level for GPUs.

World's biggest ball pit is a sea filled with one million balls of fun. The minimum joy-level for GPUs.

Porting code to OpenMP, OpenCL and the GPU can be very rewarding, as long as you know what you're doing. As I'll show in the example below, you can get to 363 times faster code with OpenCL on an AMD HD7970 compared to a 4 core Intel i7-2600 CPU. Now,  you can get an easy 4x speedup also on the CPU. And for the marketeers who read this: "Unbelievable! Over 1440 times speedup with OpenCL!" - which actually is sort of true. And this is the tip of the iceberg! There are many other ways, tips & tricks to get faster code. Have fun reading! Say we have the following Fortran code (can't read F90? You might understand the OpenCL-code below better):
do mm=1,M
  do nn=1,N
     vec2(nn) = vec2(nn) + vec1(nn);
  end do
end do

Where vec1 and vec2 are of type 'real' (float), and vec2 is preinitialised data. This translates to an OpenCL kernel we run M times (this is: the global size is M):

__kernel void sum (const int size, __global float * vec1, __global float * vec2) {
  int ii = get_global_id(0);
  for (int x=0; x < N; x++) {
    if (ii < size) vec2[ii] += vec1[ii];

For M we take 5000000 (5 million). As it is unfair to compare to code running on a single core, a basic OpenMP directive/pragma was added: "!$OMP PARALLEL DO". The OpenCL kernel for the CPU and GPU are exactly the same, but we kindly ask the driver to suggest a local workgroup size, which is different for the two devices.

Changing the N from 1 to 10000 gives the following results (CPU is Intel i7-2600 and GPU is AMD HD7970):







0.048 s

0.058 s

0.112 s

0.008 s


0.389 s

0.204 s

0.116 s

0.011 s


3.857 s

1.008 s

0.351 s

0.052 s


38.470 s

9.227 s

2.797 s

0.413 s


387.696 s

97.672 s

27.325 s

4.140 s

We started with 387 seconds on the CPU (single core). This speeds up to 97 seconds when using OpenMP on the CPU (using all 4 cores, so ~4x faster), and to 27 seconds when using OpenCL on the CPU (another ~4x speedup due to AVX). This potential of OpenCL on CPUs is often forgotten/neglected. Yes, you can do AVX/SSE with OpenMP, but it isn't your daily pragma.

You see that when the kernel gets more to do, the relative overhead of the initialization gets smaller. Comparing OpenCL on the GPU to OpenMP on the CPU for N = 10k the speedup is 23x, where for N = 100 the speedup is 19x. For N=1 the speedup is only 7x. Over 5000, the speedup stays around the same.

A little GPU optimization

To show how tricky GPU-optimization can be, here is one example. You might have seen that the check “if (ii < size)” is done each loop. This line is needed to be sure that no data outside vec2's size is being altered - a.k.a. buffer overflow. But what if we would put it outside the loop, like this:

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
int ii = get_global_id(0);
  if (ii < size) {
    for (int x=0; x < N; x++) {
      vec2[ii] += vec1[ii];

For N=10k the time on the CPU is about the same, but on the GPU the time goes from 4.1 seconds to 0.269! That is an extra speedup of another 15x, resulting in a total speedup of 363x over OpenMP on the CPU!

You'll see that the GPU is much more sensitive to optimizations than the CPU! This is the trick of GPU-programming: understanding the hardware enough, to understand how software needs to be written. Adding a directive/pragma could have resulted in generating the first kernel, leaving out 15x of the potential speedup. It could also have generated the second kernel, but you won't be able to tell if you don't know how to read intermediate code.

Want to learn more? Buy a book or attend one of our trainings in Europe. On request we also provide trainings worldwide.

  • Arseny Kapoulkine

    With all due respect, these results don’t make any sense to me. Maybe the code/numbers are just wrong and the logic is sane.

    First, is M 5*10^6 or 5*10^7? You say it’s 5 million but the number you quoted is 50 million.

    Second, your GPU code does not do the same thing your CPU code does. Your CPU code does M passes through the array, on every pass it performs a summation of two arrays; your GPU code essentially does N passes through the array, and on every pass performs summation of the two. As a consequence, your GPU kernel code is trivially optimizable into vec2[ii] += vec1[ii] * N (your CPU code is optimizable into vec2[i] += vec1[i] * M after loop rearrangement).

    Your CPU code as stated is not a reduction and is trivially vectorizable; some C++ compilers (sorry, F90 compilers. Seriously?..) should be able to do it automatically by now.

    Finally (disclaimer: I am not an expert on GPGPU), how does removing a 100% coherent branch on a scalar GPU make the code 15x faster? Also note that any CPU compiler that’s good enough will do loop-invariant code motion and extract your branch out of the loop body. Honestly I would have expected a GPU compiler to do the same, but if there is a performance difference it’s clearly not doing that.

    • streamcomputing

      There were requests to have a simple example. For years I said: just try and see for yourself, but not many people actually did. Now I hope people will do this same experiment with other code and see if they can make OMP and OCL versions out of plain C or Fortran code.

      I used the words ” The OpenCL kernel for the CPU and GPU are exactly the same” and not ” The OpenCL code for the CPU and GPU do exactly the same thing”. As you described only a part of the answer, this blog-post would be endlessly long, if I would discuss all of it. So here is my line of simplification, but I write on this blog almost weekly.

      The code is not coherent for the GPU – it is coherent for you. I just wanted to show a simple example and make a few points. One is that the word “should” often means “I am so frustrated if it doesn’t work like I expected” – so just don’t use pragmas/directives without checking the generated code.