Performance can be measured as Throughput, Latency or Processor Utilisation

40225151 - fiber optic cable
Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

  • Transfers. Data-movements through cables, interconnects, etc.
  • Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

  1. Reaction time: hardware/software noticing there is a job
  2. Transport time: it takes time to copy data, especially when we talk GBs
  3. Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.


This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge. internship/externship

Our internship is according the description: it’s a rather complex homepage which should look good on your CV (if you manage to build it).

Want to help build an important website?’s components have been designed and partly built, but still a lot of work needs to be done. We’re seeking an intern (or “extern” when not in Amsterdam) who can help us build the site. This internship is not about GPUs!

To complete the tasks, the following is required:

  • Technical expertise:
    • HTML5, CSS
    • PHP
    • Javascript
    • jQuery
    • Node.js
    • Mediawiki
    • XSLT
  • Can-do mentality
  • Able to plan own work
  • Good communication-skills
  • Available for 3 to 6 months

We don’t expect you know all tools, so we will guide you in learning new tools and techniques. Write us a “email of interest” to, and write what you can and what your objectives for an internship would be.

We’re looking forward to see your letter!

AMD is back!

AMD_Logo-and-wordmark-1024x768For years we haven been complaining on this blog what AMD was lacking and what needed to be improved. And as you might have concluded from the title of this blogpost, there has been a lot of progress.

AMD is back! It will all come together in the beginning of 2017, but you’ll see a lot of progress already the coming weeks and months.

AMD quietly recognised and solved various totally new problems in HPC, becoming the hidden innovator everybody needed.

This blog is to give an overview of how AMD managed to come back and what it took to get to there. Their market cap supports it, as you can see.

AMD’s market cap is back at 2012 levels (source)

Continue reading “AMD is back!”

ISC lunch discussion: Portable Open Standards in HPC

ISC-HPC-logoAre you around at ISC and have an opinion on portable open standards? Then you should join the discussion with other professionals at ISC. Some suggestions for discussions:

  • (non)-preference for open standards like OpenCL, OpenMP, HSA and OpenACC.
  • Portability versus performance.
  • Using scripting languages in HPC, like Python.

Below info might change, because a map is never reality! Make sure you check this page or our Twitter channel the evening before.

  • Where: ISC in Frankfurt, at the catering area near the lounge areas.
  • What: Food, HPC and Open Standards.
  • When: Tuesday 21 June, 12:00-13:00


See you there!

Let’s meet at ISC in Frankfurt

ISC-HPC-logoVincent Hindriksen will be walking around at ISC from 20 to 22 June. With me I bring our latest brochure, some examples of great optimisations and some Dutch delicacies. Also we will also have some exciting news with an important partner – stay tuned!

It will be a perfect time to discuss how StreamComputing can help you solve tough compute problems. Below is a regularly updated schedule of my time at ISC.

Get in contact to schedule a meeting.

If you’d like to talk technologies and bits&bytes, we’re trying to make a get-together – date&time TBD.

An introduction to Grid-processors: Parallella, Kalray and KnuPath

gridWe have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.

Grid-processors are different from GPUs. Where a multi-core GPU gets its strength from being able to compute lots of data in parallel (SIMD data-parallellism), a grid-processors is able to have each core do something differently (MIMD, task-based parallelism). You could say that a grid-processor is a multi-core CPU, where the number of cores is at least 16, and the cores are only connected to their neighbours. The difference with full-blown CPUs is that the cores are smaller (like the GPU) and thus use less power. The companies themselves categorise their processors as DSPs or Digital Signal Processors, but most popular DSPs only have 1 to 8 cores.

For the context, there are several types of bus-configurations:

  • single bus: like the PCIe-bus in a PC or the iMX6.
  • ring bus: like the XeonPhi till Knights Corner, and the Cell processor.
  • star bus: a central communication core with the compute-cores around.
  • full mesh bus: each core is connected to each core.
  • grid bus: all cores are connected to their direct neighbours. Messages hop from core to core.

Each of them have their advantages and disadvantages. Grid-processors get great performance (per Watt) with:

  • video encoding
  • signal processing
  • cryptography
  • neural networks

Continue reading “An introduction to Grid-processors: Parallella, Kalray and KnuPath”

Download all OpenCL header files and build your own OpenCL library

opencl-logoOpenCL header files

When you develop professional software, it is a best practise to have external header files fixed, to have versioning under full control. This way you don’t get the surprises when your colleague has another OpenCL SDK installed. Luckily the Khronos Group has put all version of the OpenCL header files on Github, so you can easily download the targeted OpenCL version.

Download a zip of the header files here:

If you found problems in one of these, you can directly communicate with the working group by submitting an issue on Github.

OpenCL.lib /

But wait, there is more!

You can build your own ICD, as the sources are open (licence). OpenCL version 2.1 is implemented, but it is fully backwards compatible to OpenCL 1.0. You can assume that the vendors use this code for their own, so you can safely use this code in your project.

Get the project from Github.

Heterogeneous Systems Architecture – memory sharing and task dispatching

HSA-logoWant to get an overview of what Heterogeneous Systems Architecture (HSA) does, or want to know what terminology has changed since version 1.0? Read further.

Back in 2012 the goals for HSA were high. The group tried to design a system where CPU and GPU would work together in an efficient way. In the 2013/2014 time-frame you’ll find lots of articles around the web, including on our blog, describing the capabilities of HSA. Unfortunately with the 1.0 specifications most terminologies have been changed.

In March 2015 the HSA Foundation released the final 1.0 specifications. It does not discuss hUMA (Heterogeneous Uniform Memory Access) nor hQ (Heterogeneous Queuing). These two techniques had undergone so many updates, that new terminologies were used.

In this blog post, we’ll present you an updated description of the two most important problems tackled by HSA: memory sharing and task dispatching.

We’ll be tuning the below description, so feedback is always welcome – focus is on clarity, not on completeness.

What is an HSA System?

Where the original HSA goals focused more on SoCs with CPU and GPU cores, now any compute core can be used. The reason was that modern SoCs are much more complex than just a CPU and GPU – integrated DSPs and video-decoder are found on many processors. HSA thus now (officially) supports truly heterogeneous architectures.hsa_mem_arch_3

The idea is that any heterogeneous processor can be designed by the principles of HSA. This will bring down design costs and enable more exotic configurations from different vendors.

And interesting fact about the HSA-specifications is that it only specifies goals, not how it must be implemented. This makes it possible to implement the specifications in software instead of hardware, making it possible to upgrade older hardware to HSA.

Why is HSA important?

A simple question: “will there be more CPUs with embedded GPU or discrete GPUs?”. A simple answer: “there are already more integrated GPUs than discrete ones”. HSA defines those chips with mixed processors.

CPUs with embedded GPUs used to be not much more than the discrete GPUs with shared memory we know from cheap laptops in the 00’s. When the GPU got integrated, each vendor started to create solutions for inter-processor dispatching (threading extended to heterogeneous computing), course-grained sharing (transferring ownership between processor units) and fine grained sharing (atomics working with all processor units).

The HSA Foundation

Sometimes an industry makes bigger steps by competing and sometimes by collaborating

AMD recognised the need for a standard. As AMD wanted to avoid the problems with introducing 64 bit into X86 and therefore initiated the HSA foundation. The founding members are AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung and Texas Instruments. NVidia and Intel are awkwardly absent.

Memory Sharing

HSA uses a relaxed memory model, which has full memory coherence (data guaranteed to be the same for all processes on all cores) and is pageable (subsets can be reserved by programs).

The below write-up is heavily simplified to give an overview how memory sharing is designed under HSA. If you want to know more, read chapter 5 from the HSA book.

Inter-processor memory-pointer sharing – Unified Addressing

The most important part is the unified memory model (previously referred to as “hUMA”), which makes programming the memory-interactions in a heterogeneous processor with CPU-cores, GPU-cores and DPS-cores comparable to a multi-core CPU.

Like other modern memory models, HSA defines various segments, including global, shared and private. A difference is that flat addressing is used. This means that each address pointer is unique: you don’t have an address 0 for private and an address 0 for global. Flat addressing simplifies optimisation operations for higher level languages. Ofcourse you still need to be aware that each segment size is limited and there will be consequences when defining larger memory chunks than is available in the segment.

When you have created a memory object and want the DSP or GPU continue to work on it, then you can use the same pointers without any translations.

Inter-processor cache coherency

In HSA-systems global memory is coherent without the need for explicit cache maintenance. This means that local caches are synchronised and/or that caches are shared. For more information, read this blog from ARM.

Fine grained memory – Atomic Operations

HSA allows protecting memory segments to be atomicly accessed. This makes it possible to have multiple threads running on different cores of different processor units, all accessing the same memory in a safe manner.

Small and large consequtive memory segments can be reserved for sharing, from very fine to coarse grained. All threads that have access to that segement are notified when atomic operations are done.

Fine Grained Shared Virtual Memory (HSA compatibility for discrete GPUs)

AMD has done some efforts to extend HSA to discrete GPUs. We’ll see the real advantages with dispatching, but it also works to create a cleaner memory management.

The so called “Fine Grained Shared Virtual memory” makes it possible use HSA with discrete GPUs that have HSA-support. Because it’s virtual and data is continuously transferred between GPU and the HSA-processor, the performance is ofcourse lower than when using real shared memory. You can compare it to NVidia’s Unified Virtual Memory, and it also has been planned to be in OpenCL 2.0 for a long time.


HSA defines in detail how a task gets into the queue of a worker thread. Below is an overview of how queues, threads and tasks are defined and are named under HSA.


Before HSA 1.0 we only spoke of “Heterogeneous Queue” (hQ). This is now further developed to “User Mode Queues”. A User Mode Queue holds the list of tasks for that specific (group of) processor cores, resides in the shared memory and is allocated at runtime.

Such task is described in a language called “Architected Queueing Language” (AQL), and is called an “AQL package”.

Agents and Kernel Agents

HSA threads run on one or a group of processor cores. These threads are called “Agents” and come in two variations: normal Agents and Kernel Agents. A Kernel Agents is an Agent that has a User Mode Queue and can execute kernels that work on a segment of memory. A normal Agent doesn’t have a queue and can only execute simple tasks.

If a normal agent cannot run kernels, but can run tasks, then what can it actually do? Here are a few examples:

  • Allocate memory, or other tasks only the host can do.
  • Send back (intermediate) data to the host – for example progress indication.

If you compare to OpenCL, an agent is the host (which creates the work) and kernel agents are the kernels (which can issue new threads under OpenCL 2.0).

AQL packages: communicating dispatch tasks

There are different types of the AQL (Architected Queueing Language) packets, of which these are the most important:

  • Agent dispatch packet: contains jobs for normal agents.
  • Kernel dispatch packet: contains jobs for kernel agents.
  • Vendor-specific packet: between processors of the same vendor there can be more freedoms.

In most cases, we’ll be talking about kernel dispatch packages.

The Doorbell signal: low latency dispatching

HSA dispatching is extremely fast and power-efficient due to the implementation of a “doorbell”. The doorbell of an agent is signalled when a new tasks is available, making it possible to take immediate action. A problem in OpenCL is the high dispatch times for GPUs without a doorbell – up to the millisecond range, as we have measured. For HSA-enabled GPUs the response-time before a kernel starts running is in the microseconds range.

Context switching

Threads can move from one core to another core – the task will be removed from the current queue and added to another queue. This can even happen when the thread is in running state.

StreamComputing’s position

The solution simply works and makes faster code – we have done a large project with it last year.

It seems that almost the whole embedded processor industry believes in it. AMD (CPU+GPU), ARM (CPU+GPU), Imagination (GPU), Mediatek, Qualcomm (GPU), Samsung and Texas Instruments (DSP) are founders. Companies like Analog Devices, CEVA, Sony, VIA, S3, Marvell and Cadence have later joined the club. Important Linux clubs like Linaro and Canonical are also seen.

The system-on-a-chip only will get more traction, and we see HSA as an enabler. Languages like OpenCL and OpenMP can be compiled down to HSA, so it just takes switching the compiler. HSA-capable software can be written in a more efficient manner, as now can be assumed that memory can efficiently be shared and dispatching new threads is really fast.

The most noticeable processors from NVIDIA, AMD and Intel

AMD-Intel-NVidia10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently. 


As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

kepler-k80Tesla K80 (Kepler)

  • The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
  • It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

  • 20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
  • 16 GB HBM2 (720 GB/s).
  • NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
  • OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

  • The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
  • The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

  • 10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.


Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

  • 32GB GDDR5 global memory

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

FireProS9300X2Radeon Nano and FirePro S9300X2 (Fiji)

  • Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
  • S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
  • Nano 175 Watt, S9300X2 300 Watt
  • Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

  • CPU with built-in GPU
  • About one TFLOPS
  • TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.


After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

  • 55MB cache, 22 cores
  • AVX 2.0 (256 bit vector operations)
  • DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPHi_KNL_socketXeonPhi Knights landing

  • Available in socket and PCI version
  • AVX 512 (512 bit vector operations)
  • 16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
  • Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.


  • Task-parallel processor
  • Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

  • CPU with built-in GPU
  • 0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamComputing we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

9 questions on OpenCL’s future answered at IWOCL

IWOCL-logoDuring the panel discussion some very interesting questions were asked, I’d like to share with you.

Should the Khronos group poll the community more often about the future of OpenCL?

I asked it on twitter, and this is the current result:khronos-community-feedback

Khronos needs more feedback from OpenCL developers, to better serve the user base. Tell the OpenCL working group what holds you back in solving your specific problems here. Want more influence? You can also join the OpenCL advisory board, or join Khronos with your company. Get in contact with Neil Trevett for more information.

How to (further) popularise OpenCL?

While the open standard is popular at IWOCL, it is not popular enough at universities. NVidia puts a lot of effort in convincing academics that OpenCL is not as good as CUDA and to keep CUDA as the only GPGPU API in the curriculum.

Altera: “OpenCL is important to be thought at universities, because of the low-level parts, it creates better programmers”. And I agree, too many freshly graduated CS students don’t understand malloc() and say “The compiler should solve this for me”.

The short answer is: more marketing.

At StreamComputing we have been supporting OpenCL with marketing (via this blog) since 2010. 6 years already. We are now developing the website to continue the effort, while we have diversified at the company.

How to get all vendors to OpenCL 2.0?

Ofcourse this was a question targeted at NVidia, and thus Neil Trevett answered this one. Use a carrot and not a stick, as it is business in the end.

Think more marketing and more apps. We already have a big list:opencl-library-ecosphere

Know more active(!) projects? Share!

Can we break the backwards compatibility to advance faster?

This was a question from the panel to the audience. From what I sensed, the audience and panel are quite open to this. This would mean that OpenCL could make a big step forward, fixing the initial problems. Deprecation would be the way to go the panel said. (OpenCL 2.3 deprecates all nuisances and OpenCL 3.0 is a redesign? Or will it take longer?)

See also the question below on better serving FPGAs and DSPs.

Should we do a specs freeze and harden the implementations?

Michael Wong (OpenMP) was clear on this. Learn from C++98. Two years were focused on hardening the implementations. After that it took 11 years to restart the innovation process and get to C++11! So don’t do a specs freeze.

How to evolve OpenCL faster?

Vendor extensions are the only option.

At StreamComputing we have discussed a lot about it, especially fall-backs. In most cases it is very doable to create slower fall-backs, and in other cases (like with special features on i.e. FPGAs) it can be the only option to make it work.

How to get more robust OpenCL implementations?

Open sourcing the Vulkan conformance tests was a very good decision to make Vulkan more robust. Khronos gets a lot of feedback on the test cases. It will be discussed soon to what extend this also can be done for OpenCL.

Test-cases from open source libraries are often used to create more test cases.

How to better support FPGAs and DSPs?

Now GPUs are the majority and democracy doesn’t work for the minorities.

An option to better support FPGAs and DSPs in OpenCL is to introduce feature sets. A lesson learnt from Vulkan. This way GPU vendors don’t need to spend time implementing features that they don’t find interesting.

Do we see you at IWOCL 2017?

Location will be announced later. Boston and Toronto are mentioned.

Comparing Syntax for CUDA, OpenCL and HiP

Both CUDA and OpenCL are well-known GPGPU-languages. Unfortunately there are some slight differences between the languages, which are shown below.

You might have heard of HiP, the language that AMD made to support both modern AMD Fiji GPUs and CUDA-devices. CUDA can be (mostly automatically) translated to HiP and from that moment your code also supports AMD high-end devices.

To give an overview how HiP compares to other APIs, Ben Sanders made an overview. Below you’ll find the table for CUDA, OpenCL and HiP, slightly altered to be more complete. The languages HC and C++AMP can be found in the original.

Term CUDA OpenCL HiP
Device int deviceId cl_device int deviceId
Queue cudaStream_t cl_command_queue hipStream_t
Event cudaEvent_t cl_event hipEvent_t
Memory void * cl_mem void *
Grid of threads grid NDRange grid
Subgroup of threads block work-group block
Thread thread work-item thread
Scheduled execution warp sub-group (warp, wavefront, etc) warp
Thread-index threadIdx.x get_local_id(0) hipThreadIdx_x
Block-index blockIdx.x get_group_id(0) hipBlockIdx_x
Block-dim blockDim.x get_local_size(0) hipBlockDim_x
Grid-dim gridDim.x get_global_size(0) hipGridDim_x
Device Kernel __global__ __kernel __global__
Device Function __device__ N/A. Implied in device compilation __device__
Host Function __host_ (default) N/A. Implied in host compilation. __host_ (default)
Host + Device Function __host____device__ N/A. __host____device__
Kernel Launch <<< >>> clEnqueueNDRangeKernel hipLaunchKernel
Global Memory __global__ __global __global__
Group Memory __shared__ __local __shared__
Private Memory (default) __private (default)
Constant __constant__ __constant __constant__
Thread Synchronisation __syncthreads barrier(CLK_LOCAL_MEMFENCE) __syncthreads
Atomic Builtins atomicAdd atomic_add atomicAdd
Precise Math cos(f) cos(f) cos(f)
Fast Math __cos(f) native_cos(f) __cos(f)
Vector float4 float4 float4

You see that HiP borrowed from CUDA.

The discussion is ofcourse if all alike APIs shouldn’t use the same wordings. A best thing would be to mix for the best, as CUDA’s “shared” is much more clearer than OpenCL’s “local”. OpenCL’s functions on locations and dimensions (get_global_id(0) and such) on the other had, are often more appreciated than what CUDA offers. CUDA’s “<<< >>>” breaks all C/C++ compilers, making it very hard to make a frontend of IDE-plugin.

I hope you found the above useful to better understand the differences between CUDA and OpenCL, but also to see how HiP comes into the picture.

Meet us in April

9017503_mThe coming month we’re travelling every week. This generates are a lot of opportunities where you can meet the StreamComputing team! For appointments, send an email to

  • Meet us at ParallelCon (6 April 2016, Heidelberg, Germany). Besides the crash course (see below), we also have a talk on Vulkan.
  • Crash Course OpenCL @ ParallelCon (8 April 2016, Heidelberg, Germany). This is part of the conference – you can still buy tickets!
  • Meet us in Toronto (11 April 2016, Toronto, Canada). In Toronto for business, with time for appointments.
  • Meet us at IWOCL (19 April 2016, Vienna, Austria). The event-of-the-year for all OpenCL. So ofcourse we’re there.
  • Meet us in Grenoble (25 April 2016, Grenoble, France). For a training we’re there the whole week. On Thursday and Friday there is time for appointments.

We’re happy to talk business and about technology. Also giving presentations at your company is an option.

This information was previously communicated via the newsletter and on LinkedIn.

AMD’s infographic on HBM

If one company is bad at bragging, then it’s AMD. It’s two main competitors are a lot better in that – NVIDIA even bragged about their upcoming GPUs having HBM. So I was surprised that recently I encountered a nice infographic, where AMD was actually bragging. And they deserved to do it!

I wanted to have comparisons with Intel/Micron’s HBC, but I leave that for another post as the good information is often a year old.

Very close to the processor

It’s using a high-speed bus on the substrate.


And yes, it really matters to be closer to the processor.


HBM versus GDDR5

  • Bus width from 32-bit tot 1024-bit
  • Clockspeed down. We need to wait for how it’s calculated.
  • Bandwidth up a lot. We can expect 1TB/s for GPUs now
  • Required voltage 14% down, which saves a lot of energy



Better GB/s per Watt

So for maintaining 320GB/s you would need 30 Watts. Now you need 9 Watts. As the reduction in power for the Radeon NANO is almost 100W, you understand that this tells only part of the power-reductions made possible.


A lot smaller

Yes, 94% less surface area. Only part of the reason is the stacking.


Standards AMD has pioneered

HBM has been engineered and implemented by AMD, made a standard by JEDEC and put into sylicon by Hynix.


And finally some bragging! AMD has made many standards we use daily, but never knew it was AMD technology.

  • Mantle. The predecessor of Vulkan, DirectX 12 and more
  • GDDR 1 to 5. Now being replaced by HBM, and not GDDR6
  • Wake-on-LAN. You never knew! Intel and IBM made it into a standard, but AMD introduced the Magic Packet in 1995.
  • DisplayPort Adaptive-Sync. Previously known as FreeSync.
  • X86-64. The reason why you find “amd64” packages in Linux.
  • Integrated Memory Controllers.
  • On-die GPUs.
  • Consumer Multicore CPU, the Athlon 62 X2.
  • HSA. Not in the list, probably because it’s a recent advancement.

Want to see the full infographic? Click here.

An example of real-world, end-user OpenCL usage

We ask all our customers if we could use their story on our webpage. For competition reasons, this is often not possible. The people of CEWE Stiftung & Co. KGaA were so kind to share his experience since he did a OpenCL training with us and we reviewed his code.

Enjoy his story on his experience from the training till now!

This year, the CEWE is planning to implement some program code of the CEWE Photoworld in OpenCL. This software is used for the creation and purchase of photo products such as the CEWE Photobook, CEWE Calendars, greeting cards and other products with an installation base of about 10 million throughout Europe. It is written in Qt and works on Windows, Mac and Linux.


In the next version, CEWE plans to improve the speed of image effects such as the oil painting filter, to become more useful in the world of photo manipulation. Customers like to some imaging effects to improve photo products, to get even more individual results, fix accidentally badly focused photos and so on.

Continue reading “An example of real-world, end-user OpenCL usage”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

random_300Random numbers are important elements in stochastic simulations, but they also show up in machine learning and applications of Monte Carlo methods such as within computational finances, fluid dynamics and molecular dynamics. These are classical fields in high-performance computing, which StreamComputing has experience in.

A common problem when porting traditional software in these fields to parallel computing environments is the generation and reproducibility of random numbers. Questions that arise are:

  • Performance: How can we efficiently obtain random numbers when they are classically generated in a serial fashion?
  • Quality: How can we make sure that random numbers generated in a parallel environment still fulfil statistical randomness requirements?
  • Verification: How can we be sure that the parallel implementation is correct?

We consider verification from the viewpoint of producing identical results among different software implementations. This is often an important matter for our customers, and we have given them guidance on how to address this issue when random numbers are involved.

In this first part of our two-part blog series, we will briefly address some common pitfalls in the generation of random numbers in parallel environments and suggest suitable random-number generation libraries for OpenCL and CUDA. In the second part – on the blog soon – we will discuss solutions for reproducibility in the presence of random numbers.


Random numbers in computer software are typically obtained via a deterministic pseudo-random number generator (PRNG) algorithm. The output of such an algorithm is not truly random but pseudo-random (i.e., it appears statistically random), though we will simply say “random” for simplicity. We do not consider truly random numbers, which may be derived from physical phenomena such as radioactive decay, because we want the output of a random number generator to be reproducible.

PRNGs traditionally offered to application developers fail within the parallel setting. One reason is that these algorithms usually only support the sequential generation of random numbers based on some initial (seed) value (e.g., consider the standard C rand() function), so work items on a parallel device would need to block for getting exclusive access to the generator, which clearly impacts efficiency.

Some applications may require only a moderate amount of random numbers. In this case, we found it feasible to precompute the required set of random numbers and hold them in global memory. We call this the table-based approach. Other applications in turn may need to efficiently create a huge amount of random numbers. In this case, it may be necessary to equip each work item with its own PRNG seed. One potential problem with this approach is the use of weak PRNGs such as linear congruential generators (LCGs), which remain popular due to their speed and simplicity. In parallel settings, correlations between output sequences are aggravated and the quality of the application output may be severely affected, so LCGs should not be used at all. Another problem is the use of a small seed or a small PRNG’s internal state space. In this case, we may expect that the probability of two work items creating the same random sequence is quite high. Indeed, if we would randomly seed via srand(), the chance is already 50% for two out of approximately 77,000 work items creating entirely the same random number sequence! So we may either need a PRNG with a larger seed space and internal state, or one with a larger state and some mechanism to subslice the PRNG’s output sequence into non-overlapping “substreams”, with one substream per work item. The Mersenne Twister is highly acclaimed but requires a memory state of approximately 2.5 KB per work item in a parallel setting, and substreams are difficult to implement. While good PRNGs with a small internal state and flexible substream support exist (e.g., MRG32k3a), there are also “index-based” PRNGs, which are often more elaborate to compute but do not maintain any state. Such state-less PRNGs take an arbitrary index and a “key” as input and return a random number corresponding to the index in its random output sequence (which depends on the key chosen). Index-based PRNGs are very useful in parallel computing environments, and we will show how we use them for reproducibility in the second part of this blog.

The choice of an appropriate PRNG may not be easy and ultimately depends on the application scenario. Luckily, there is choice! CUDA offers a set of PRNGs via its cuRAND library, and OpenCL applications can benefit from the clRNG library that AMD has released last year. Both cuRAND and clRNG offer a state-based interface with substream support. For index-based algorithms, the Random123 library provides high-quality PRNG implementations for both OpenCL and CUDA.

So far, we have discussed how we can safely generate random numbers in the GPU and FPGA context, but we cannot control the order in which parallel, concurrent work items create random numbers. This makes it difficult to verify the parallel implementation since its output may be different from that of the serial, original code. So the question is, in the presence of random numbers, how can we easily verify that our parallel code implements not only a faithful but a correct port of the serial version? This is addressed in part two – continue reading.

New: OpenCL Crash Courses

opencl-logoTo see if OpenCL is the right choice for your project, we now only ask one day of your time. This enables you to quickly assess the technology without a big investment.

Throughout Europe we give crash courses in OpenCL. After just one day you will know:

  • The models used to define OpenCL.
  • If OpenCL is an option for your project.
  • How to read and understand OpenCL code.
  • Code simple OpenCL programs.
  • Differences between CPUs, GPUs and FPGAs.

There are two types: GPU-oriented and FPGA-oriented. We’ve selected Altera FPGAs and AMD FirePro GPUs to do the standard trainings.

No events

If you are interested to get a certain crash course in another city than currently scheduled, fill in the below form to get notified when the crash course of your city of choice.

We will add more dates and places continuously. If you want to host an OpenCL crash course event, get in contact.

Note: crash courses are intended to get you in contact with software accelerators, so it doesn’t replace a full training.

Atomic operations for floats in OpenCL – improved

Atomic floats
Atomic floats

Looking around in our code and on our intranet, I found lots of great and unique solutions to get the most performance out. Ofcourse we cannot share all of those, but there are some code-snippets that just have to get out. Below is one of them.

In OpenCL there is only atomic_add or atomic_mul for ints, and not for floats. Unfortunately there are situations there is no other way to implement the algorithm, than with atomics and floats. Already in 2011 Igor Suhorukov shared a solution to get atomic functions for floats, using atomic_cmpxchg(). Here is his example for atomic add:

inline void AtomicAdd_g_f(volatile __global float *source, const float operand) {
    union {
        unsigned int intVal;
        float floatVal;
    } newVal;
    union {
        unsigned int intVal;
        float floatVal;
    } prevVal;
    do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
    } while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, 
                               newVal.intVal) != prevVal.intVal);

Unfortunately this implementation is not guaranteed to produce the correct results because OpenCL does not enforce global/local memory consistency across all work-items from all work-groups. In other words, a read from the buffer source is not guaranteed to actually perform a read from the specified global buffer; it could, for example, return the value stored in a local cache. For more details check the chapter 3.3 – “Memory Model”, subchapter “Memory Consistency” of the OpenCL specification.
To be sure that the implementation for AtomicAdd_g_f is correct we need to use the value returned by the function atomic_cmpxchg. This guarantees that the actual value stored in global memory is returned.

As it seems our improved version is hidden deep in the code of GROMACS, here’s the code you should use:

   _INLINE_ void atomicAdd_g_f(volatile __global float *addr, float val)
           unsigned int u32;
           float        f32;
       } next, expected, current;
   	current.f32    = *addr;
   	   expected.f32 = current.f32;
           next.f32     = expected.f32 + val;
   		current.u32  = atomic_cmpxchg( (volatile __global unsigned int *)addr, 
                               expected.u32, next.u32);
       } while( current.u32 != expected.u32 );

As was mentioned in Suhorukov’s blog post, you can change the global to local, and implement the other operations likewise:

Atomic_mul_g_f(): next.floatVal = expected.f32 * operand;
Atomic_mad_g_f(source, operand1, operand2): next.f32 = mad(operand1, operand2, expected.f32);
Atomic_div_g_f(): next.f32 = expected.f32 / operand;


Performance of 5 accelerators in 4 images

runningIf there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.


The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.


If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.


Lowest prices in the Netherlands, at the moment of writing:

  • Intel Xeon 2699 v3: € 6,560.
  • Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
  • NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
  • NVidia Tesla K40: € 4,070
  • AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.


Costs per GB/s
Where the four accelerators are around €11 per GB/s, the Xeon takes €131 (see note above). Note that the K40 with €14.13 is expensive compared to the other accelerators.


For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

PHD position at university of Newcastle

emcAt the university of Newcastle they use OpenCL for researching the performance balance between software and hardware. This resource management isn’t limited to shared memory systems, but extends to mixed architectures where batches of co-processors and other resources make it much more complex problem to solve. They chose OpenCL as it gives both inter-node and intra-node resource-management.

Currently they offer a PhD position and seek the brilliant mind that can solve the heterogeneous puzzle like a chess player. It is a continuation of years of research and the full description is in the PDF below.

Continue reading “PHD position at university of Newcastle”

SC15 news from Monday

SC15Warning: below is raw material, and needs some editing.

Today there was quite some news around OpenCL, I’m afraid I can’t wait till later to have all news covered. Some news is unexpected, some is great. Let’s start with the great news, as the unexpected news needs some discussion.

Khronos released OpenCL 2.1 final specs

As of today you can download the header files and specs from The biggest changes are:

  • C++ kernels (still separate source files, which is to be tackled by SYCL)
  • Subgroups are now a core functionality. This enables finer grain control of hardware threading.
  • New function clCloneKernel enables copying of kernel objects and state for safe implementation of copy constructors in wrapper classes. Hear all Java and .NET folks cheer?
  • Low-latency device timer queries for alignment of profiling data between device and host code.

OpenCL 2.1 will be supported by AMD. Intel was very loud with support when the provisional specs got released, but gave no comments today. Other vendors did not release an official statement.

Khronos released SPIR-V 1.0 final specs

SPIR-V 1.0 can represent the full capabilities of OpenCL 2.1 kernels.

This is very important! OpenCL is not the only language anymore that is seen as input for GPU-compilers. Neither is OpenCL hostcode the only API that can handle the compute shaders, as also Vulkan can do this. Lots of details still have to be seen, as not all SPIRV compilers will have full support for all OpenCL-related commands.

With the specs the following tools have been released:

SPIRV will make many frontends possible, giving co-processor powers to every programming language that exists. I will blog more about SPIRV possibilities the coming year.

Intel claims OpenMP is up to 10x faster than OpenCL

The below image appeared on Twitter, claiming that OpenMP was much faster than OpenCL. Some discussion later, we could conclude they compared apples and oranges. We’re happy to peer-review the results, putting the claims in a full perspective where MKL and operation mode is mentioned. Unfortunately they did not react, as <sarcasm>we will be very happy to admit that for the first time in history a directive language is faster than an explicit language – finally we have magic!</sarcasm>

Left half is FFT and GEMM based, probably using Intel’s KML. All algorithms seems to be run in a different mode (native mode) when using OpenMP, for which intel did not provide OpenCL driver support for.

We get back later this week on Intel and their upcoming Xeon+FPGA chip, if OpenCL is the best language for that job. It ofcourse is possible that they try to run OpenMP on the FPGA, but then this would be big surprise. Truth is that Intel doesn’t like this powerful open standard intruding the HPC market, where they have a monopoly.

AMD claims OpenCL is hardly used in HPC

Well, this is one of those claims that they did not really think through. OpenCL is used in HPC quite a lot, but mostly on NVidia hardware. Why not just CUDA there? Well, there is demand for OpenCL for several reasons:

  • Avoid vendor lock-in.
  • Making code work on more hardware.
  • General interest in co-processors, not specific one brand.
  • Initial code is being developed on different hardware.

Thing is that NVidia did a superb job in getting their processors in supercomputers and clouds. So OpenCL is mostly run on NVidia hardware and a therefore the biggest reason why that company is so successful in slowing the advancement of the standard by rolling out upgrades 4 years later. Even though I tried to get the story out, NVidia is not eager to tell the beautiful love story between OpenCL and the NVidia co-processor, as the latter has CUDA as its wife.

Also at HPC sites with Intel XeonPhi gets OpenCL love. Same here: Intel prefers to tell about their OpenMP instead of OpenCL.

AMD has few HPC sites and indeed there is where OpenCL is used.

No, we’re not happy that AMD tells such things, only to promote its own new languages.

CUDA goes AMD and open source

AMD now supports CUDA! The details: they have made a tool that can compile CUDA to “HiP” – HiP is a new language without much details at the moment. Yes, I have the same questions as you are asking now.

AMD @ SC15-page-007

Also Google joined in and showed progress on their open source implementation of CUDA. Phoronix is currently the best source for this initative and today they shared a story with a link to slides from Google on the project. the results are great up: “it is to 51% faster on internal end-to-end benchmarks, on par with open-source benchmarks, compile time is 8% faster on average and 2.4x faster for pathological compilations compared to NVIDIA’s official CUDA compiler (NVCC)”.

For compiling CUDA in LLVM you need three parts:

  • a pre-processor that works around the non-standard <<<…>>> notation and splits off the kernels.
  • a source-to-source compiler for the kernels.
  • an bridge between the CUDA API and another API, like OpenCL.

Google has done most of this and now focuses mostly on performance. The OpenCL community can use this to use this project to make a complete CUDA-to-SPIRV compiler and use the rest to improve POCL.

Khronos gets a more open homepage

Starting today you can help keeping the Khronos webpage more up-to-date. Just put a pull request at and wait until it gets accepted. This should help the pages be more up-to-date, as you can now improve the webpages in more ways.

More news?

AMD released HCC, a C++ language with OpenMP built-in that doesn’t compile to SPIRV.

There have been tutorials and talks on OpenCL, which I should have shared with you earlier.

Tomorrow another post with more news. If I forgot something on Sunday or Monday, I’ll add it here.

OpenCL at SC15 – the booths to go to

SC15This year we’re unfortunately not at SuperComputing 2015 for reasons you will hear later. But we haven’t forgotten about the people going and trying to find a share of OpenCL. Below is a list of companies having a booth at SC15, which was assembled by the guys of IWOCL and we completed with some more background information.


The first place to go to is booth #285 and meet Khronos to hear where to go at SC15 to see how OpenCL has risen over the years. More info here. Say hi from the StreamComputing team!

OpenCL on FPGAs

Altera | Booth: #462. Expected to have many demos on OpenCL. See their program here. They have brought several partners around the floor, all expecting to have OpenCL demos:

  • Reflex | Booth: #3115.
  • BittWare | Booth #3010.
  • Nallatech | Booth #1639.
  • Gidel | Booth #1937.

Xilinx | Booth: #381. Expected to show their latest advancements on OpenCL. See their program here.

Microsoft | Booth: #1319. Microsoft Bing is accelerated using Altera and OpenCL. Ask them for some great technical details.

ICHEC | Booth #2822. The Irish HPC centre works together with Xilinx using OpenCL.

Embedded OpenCL

ARM | Booth: #2015. Big on 64 bit processors with several partners on the floor. Interesting to ask them about the OpenCL-driver for the CPU and their latest MALI performance.

Huawei Enterprise | #173. Recently proudly showed the world their OpenCL capable camera-phones, using ARM MALI.


Below are the three companies that promise at least 1 TFLOPS DP per co-processor.

Intel | Booth: #1333/1533. Where they spoke about OpenMP and forgot about OpenCL, Altera has brought them back. Maybe they share some plans about Xeon+FPGA, or OpenCL support for the new XeonPhi.

AMD | Booth: #727. HBM, HSA, Green500, HPC APU, 32GB GPUs and 2.2 TFLOPS performance – enough to talk about with them. Also lots of OpenCL love.

NVidia | Booth: #1021. Every year they have been quite funny when asked about why OpenCL is badly supported. Please do ask them this question again! Funniest answer wins something from us – to be decided.


You’ll find OpenCL in many other places.

ArrayFire | Booth #2229. Their library has an OpenCL backend.

IBM | Booth: #522. Now Altera joined Intel, IBM’s OpenPower has been left with NVidia for accelerators. OpenCL could revive the initiative.

NEC | Booth: #313. The NEC group has accelerated PostgreSQL with OpenCL.

Send your photos and news!

Help us complete this post with news and photos, to complete this post. We’re sorry not to be there this year, so we need your help to make the OpenCL party complete. You can send via email, twitter and in the comments below. Thanks in advance!