Category Archives: Featured


A) Unified Memory Access (UMA). B) NVIDIA’s Unified Virtual Addressing (UVA), now rebranded as “Unified Memory”.

AMD, ARM-vendors and Intel have been busy unifying CPU and GPU memories for years. It is not easy to design a model where 2 (or more) processors can access memory without dead-locking each other.

NVIDIA just announced CUDA 6 and to my surprise includes “Unified Memory”. Am missing something completely, or did they just pass their competitors as it implies one memory? The answer is in their definition:

Unified Memory — Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

The official definition is:

Unified Memory Access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

HPCGuru_not-shared-memSee the difference?

The image at the right explains it differently. A) is how UMA is officially defined, and B is how NVIDIA has redefined it.

So NVIDIA’s Unified Memory solution is engineered by marketeers, not by hardware engineers. On Twitter, I seem not to be the only one who had the need to explain that it is different from the terminology the other hardware-designers have been using.

So if it is not unified memory, what is it?

It is intelligent synchronisation between CPU and GPU-memory. The real question is what the difference is between Unified Virtual Addressing (UVA, introduced in CUDA 4) and this new thing.


UVA defines a single Address Space, where CUDA takes care of  the synchronisation when the addresses are physically not on the same memory space. The developer has to give ownership to or the CPU or the GPU, so CUDA knows when to sync memories. It does need CudeDeviceSynchronize() to trigger synchronisation (see image).


From AnandTech, which wrote about Unified (virtual) Memory:

This in turn is intended to make CUDA programming more accessible to wider audiences that may not have been interested in doing their own memory management, or even just freeing up existing CUDA developers from having to do it in the future, speeding up code development.

So its to attract new developers, and then later taking care of them being bad programmers? I cannot agree, even if it makes GPU-programming popular – I don’t bike on highways.

From Phoronix, which discussed the changes of NVIDIA Linux driver 331.17:

The new NVIDIA Unified Kernel Memory module is a new kernel module for a Unified Memory feature to be exposed by an upcoming release of NVIDIA’s CUDA. The new module is nvidia-uvm.ko and will allow for a unified memory space between the GPU and system RAM.

So it is UVM 2.0, but without any API-changes. That’s clear then. It simply matters a lot if it’s true or virtual, and I really don’t understand why NVIDIA chose to obfuscate these matters.

In OpenCL this has to be done explicitly with mapping and unmapping pinned memory, but is very comparable to what UVM does. I do think UVM is a cleaner API.

Let me know what you think. If you have additional information, I’m happy to add this.


Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.


The following subjects and optimisation tricks are discussed:

  • FPGA Overview
  • Pipelines
  • Good Design Practices
  • Avoid Pointer Aliasing
  • Avoid Expensive Functions
  • Avoid Work-Item ID-Dependent Backward Branching
  • Aligned Memory Allocation
  • Ensure 4-Byte Alignment for All Data Structures
  • Maintain Similar Structures for Vector Type Elements
  • Optimization of Data Processing Efficiency
  • Specify a Maximum Work-Group Size or a Required Work-Group Size
  • Loop Unrolling
  • Resource Sharing
  • Kernel Vectorization
  • Multiple Compute Units
  • Combination of Compute Unit Replication and Kernel SIMD Vectorization
  • Resource-Driven Optimization
  • Floating-Point Operations
  • Optimization of Memory Access Efficiency
  • General Guidelines on Optimizing Memory Accesses
  • Optimize Global Memory Accesses
  • Perform Kernel Computations Using Constant, Local or Private Memory
  • Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.


You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

mali-product-feature-CLSDK-940x300_vX1The past year you might not have heard much from OpenCL-on-ARM, besides the Arndale developer-board. You have heard just a small portion of what has been going on.

Yesterday the (Linux) OpenCL-drivers for the Chromebook (which contains an ARM MALI T604) the have been released and several companies will launch products using OpenCL.

Below are a few interviews with companies who have built such products. This will give an idea of what is possible on those low-power devices. To first get an idea of what this MALI T604 GPU can do if it comes to OpenCL, here a video from the 2013-edition of the LEAP-conference we co-organised.

Understand that the whole board takes less than ~11.6 Watts – that is including the CPU, GPU, memory , interconnects, networking, SD-card, power-adapter, etc. Only a small portion of that is the GPU. I don’t know the exact specs as this developer-board was not targeted towards energy-optimisation goals. I do know this is less than the 225 Watts of a discrete GPU alone.

Interviews with ARM partners Read more …

SVpressgraphicAltera has been very busy adding resources and has kicked off the beginning of June with opening up their OpenCL-program for the general public.
Only Stratix V devices are supported, but that could change later.

Below are all pages and PDFs concerning OpenCL I’ve found while searching Altera’s website.

Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms

Altera wanted to know where they could compete with GPUs and CPUs. For a big company their comparisons are quite honest (for instance about their limited access-speed to memory), but they don’t tell everything – like the hours(!) of compilation-time. The idea is that you develop on a GPU and when it’s correct, you port the correctly working software to the FPGA.

If you don’t have any experience working with their FPGAs, best is to ask around.


Image taken from Altera website.

Read more …


Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to GPUs and other vector-processors? In other words: What kind of algorithms are faster when using accelerators and OpenCL?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarfs of OpenCL” (2011). It became an important document here in StreamComputing, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarfs” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarfs and in Tolkien’s “The Hobbit” there were 13. Read more …

applyBored at work? Go start working for one of the anti-boring GPU-expert companies: StreamComputing (Netherlands, EU), Appilo (Israel) or AccelerEyes (Georgia, US).

We all look for people who know how to code GPUs. Experience is key, so you need to have at least one year of experience in GPU-programming.

Submit your CV now and don’t give up on your dream leaving behind the boring times.


Amsterdam-based StreamComputing is a young startup with a typical Dutch, open atmosphere (gezellig). Projects are always different from the previous as we work in various fields. Most work is in parallisation and OpenCL-coding. We are quite demanding of your knowledge on the various hardware-platforms OpenCL works on (GPUs, mobile GPUs, array-processors, DSPs, FPGAs).

See the jobs-page for more information and how to apply.


North Israel based Appilo is seeking GPU-programmers for project-basis contracts. Depending on the project, most of the work could usually be performed remotely.

Use the mail on the contact-page at Appilo to send your CV.


Atlanta-based AccelerEyes delivers products which are used to accelerate C, C++, and Fortran codes on CUDA GPUs and OpenCL devices. They look for people who believe in their products and want to help make them better.

See their jobs-page for more information and how to apply.