Category Archives: Featured

gpuverifyGPUVerify is a tool for formal analysis of GPU kernels written in OpenCL and CUDA. The tool can prove that kernels are free from certain types of defect, such as data races and bugs. This is quite useful feedback for any GPU-programmer.

Below you find a online version of the tool (please don’t break it!). Play around and test your kernels. Currently it is limited to OpenCL-kernels. Be aware the number of groups is different from global worksize.

For demo-purposes some values have been pre-filled with a simple kernel – press “Check my OpenCL kernel” to find the results. Did you expect this from this kernel? Can you explain the result?

After the LEAP-conference I’ll extend this article – till then I’m too time-limited. For now I wanted to share the online version with you, especially with the people who will attend the tutorial at LEAP. Be sure to check out the GPUVerify website and paper to learn more about this fantastic tool! Read more …

arm_mali_cover_151112297646_640x360On 20 April there was a discussion between Jan Gray and David Kanter. Jan is a specialist in C++ and FPGAs (twitter, homepage), David a specialist in CPU and GPU architectures (twitterhomepage). Both know their ways well in the field of semiconductors. It is always a joy to follow their short discussions when they happen, and this one I’d specially like to share with you.

OpenCL on ARM: Growth-expectation of GFLOPS/Watt of mobile GPUs exceeds Moore’s law. That’s incredible!

Jan Gray: .@OpenCLonARM GFLOPS/W more a factor of almost-over Dennard Scaling. But plenty of waste still to quash. http://www.fpgacpu.org/papers/Gray_AutumnOfMooresLaw_SingularityUniversity_11-06-23.pdf

Jan Gray‏: .@openclonarm Scratch Dennard tweet: reduced capacitance of yet smaller devices shd improve GFLOPS/W even as we approach end of Vdd scaling.

David Kanter: @jangray @OpenCLonARM I think some companies would argue Vdd scaling isn’t dead…

Jan Gray: @TheKanter @openclonarm it’s not dead, but slowing, we’ve gone from 5V to 1V (25x power savings) and have maybe several hundred mVs to go.

David Kanter: @jangray I reckon we have at least 400mV, so ~2X; slower than ideal, but still significant

Jan Gray: @TheKanter We agree, I think.

David Kanter: @jangray I suspect that if GPU scaling > Moore’s Law then they are just spending more area or power; like discrete GPUs in the last decade

David Kanter: @jangray also, most positive comment I’ve heard from industry folks on mobile GPU software and drivers is “catastrophically terrible”

Jan Gray: @TheKanter Many ways to reduce power, soup to nuts. For ex HMC DRAM on interposer for lower energy signaling. I’m sure many tricks to come.

In a nutshell all the reasons they think mobile GPUs can outpace Moore’s law while staying under a certain power-usage.

It needs some background-info, so let’s start the background of the first tweet, and then explain what has been said. Read more …

flagsIn OpenCL large memory objects, residing in the main memory of the host or the global memory at the accelerator/GPU, need special treatment. First reason is that these memories are relatively slow. Second reason is that the most times serial copy of objects between these two memories take time.

In this post I’d like to discuss all the flags for when creating memory objects, and what they can do to assist in this special treatment.

This is explained on this page of clCreateBuffer in the specifications, but I think it is not really clear. The function clCreateBuffer (and the alike functions for creating images, sub-buffers, etc) suggests that you create a special OpenCL-object to be given as argument to the kernel. What actually happens is that space is made available in main memory of the accelerator and optionally a link with host-memory is made.

The flags are divided over three groups: device access, host access and host pointer flags.

Read more …

Curved iMac has your back…

Nuno Teixeira designed a large curved monitor in 2008 and assumed it would never be made. For a “few” thousand dollar NEC offers one to you right now. Also Samsung and LG have announced several new curved TVs at CES 2013 (with hdmi-port). We only need a workstation to go with it, where this blog-article might come in handy.

So you want to start developing for OpenCL? When you focus on developing OpenCL for X86, you have these three options: CPUs, GPUs and CPUs with and embedded GPU. This article is for you and represents the current state of hardware – if you want the best hardware for your specific algorithm, the below information is probably not sufficient.

In 2013 we focus on 3 groups: servers/cloud (FirePro, Tesla, XeonPhi), workstations (discussed here), low-power devices (SoCs) and special accelerators (FPGAs and DSPs). This article does not discuss high-end accelerators of a few thousands of Euro, which are laid out in here.

Before reading on, you need to set the goal for your workstation.

  • If you want to learn the basics of OpenCL-programming, first check if your current machine has OpenCL-support.
  • If you need more processing power, be sure you select the right hardware for the job. Don’t buy the most expensive hardware (FirePro, Tesla or XeonPhi), but take your time to find out which hardware supports your algorithms best. Feel free to ask us.
  • If you want to make sure your software works on various types of accelerators, you can choose between:
    • swapping PCIe-cards – disadvantage is the drivers-hazzle and time-consumption.
    • more accelerators in one machine – disadvantage is that only GPU 1 can do OpenGL/DirectX.
    • identical machines with different accelerators – disadvantage is the price.
  • If you want to focus on multi-GPU development, you need:
    • or enough power-supply and the motherboard supports many lanes,
    • or buy a videocard with two GPUs.

This article has the goal to help you with buying a good machine for OpenCL-development. Prices are of January 2013. If you think I make the wrong suggestions, please give feedback via the comments.

My contacts at various companies can tell: I want to stay independent no matter what. No deals have been made nor was there any outside influence, except the friendly people of the local computer shops. I was surprised I ended up with suggestion so much AMD hardware, that I felt quite uncomfortable with it – I finally decided to keep to my first conclusions and leave the comments completely open.

Read more …

Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

int* a = ...;
int* b = ...; 
for (int i = 0; i < M; i++)
   a[i] = b[i]*2;
}

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

Read more …

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Read more …

Get in contact now!

We offer training in GPU-programming (OpenCL, CUDA, etc),
and consultancy-services for performance engineering.

Mail to info@streamcomputing.eu or fill in below form.

The web-form currently does not work.
Please send an e-mail while we resolve the issue.