Company Blog

Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

int* a = ...;
int* b = ...; 
for (int i = 0; i < M; i++)
   a[i] = b[i]*2;
}

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

Read more …

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Read more …

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

Read more …

Two months ago I wrote about the FirePro S9000 – AMD’s answer to the K10 – and was already looking forward to this K20. Where in the gaming world, it hardly matters what card you buy to get good gaming performance, in the compute-world it does. AMD presented 3.230 TFLOPS with 6GB of memory, and now we are going to see what the K20 can do.

The K20 is very different from its predecessor, the K10. Biggest difference is the difference between the number of double precision capability and this card being more powerful using a single GPU. To keep power-usage low, it is clocked at only 705MHz compared to 1GHz of the K10. I could not find information of the memory-bandwidth.

ECC is turned on by default, hence you see it presented having 5GB. No information yet if this card also has ECC on the local memories/caches.

Read more …

For who hasn’t seen the latest addition to our knowledge base, we have added a list of all (almost) available OpenCL-SDKs. You can find it in the menu under “Knowledge Base” -> “SDKs…“.

This list shows how important OpenCL is getting, as developers now can write compute-intensive parallel software on CPUs, GPUs, ARM-based accelerators and even FPGAs. This growth of OpenCL-devices is very exciting and important news, and that’s why it has got its own section on the site.

The the current list is (in random order):

Currently looking into:

  • Intel Xeon Phi
  • Nintendo Wii U dev
  • Sony Playstation 4 Orbis
  • Vivante
  • Xilinx
  • NVidia GPUs
  • Qualcomm

The SDK of NVIDIA is on the second list, what you maybe did not unexpected. We have to wait until they have put their official statement on what they are going to do with CUDA and OpenCL.

While you are there, also check the other parts of the Knowledge Base:

  • What is… -> Explanations of terminology. Put your requests in a comment.
  • Event&Talks -> A list of events which StreamComputing attends, give talks at and helps organise. Interesting for both managers and engineers.
  • Self Study – The part of the site most visited after the blog. This is for the engineers who want to start learning programming GPUs.

This section will be updated and extended continuously with information not available anywhere else.

StreamComputing has been in the OpenCL business since 2010 as one of the few. We have been the most visible and known OpenCL-specialist ever since.

In many hard sciences focus is on formulas and text, whereas images are mainly graphs or simplified representations of researched matters. Beautiful visualisations are mainly artist’s impressions in popular media targeting hobby-scientists. When Cyrille Favreau made the first good-working version of his real-time GPU-accelerated raytracer, he saw potential in exactly this area: beautiful, realistic visualisations to be used in serious science. This resulted in software called IPV.

He chose to focus on rendering molecules of proteins and this article discusses raytracing in molecular sciences, while highlighting the features of the software.

This project has been discussed on GPU Science, but this article looks at the the software from a slightly different perspective. If you don’t want to know how the software works and what it can do, scroll down for a download-link.

Read more …

“Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question answered here is: “How to do you make software that performs on several platforms?”.

This article is not fully finished – I add more information during the coming months. It’s busy here.

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that is in various aspects of programming. Let’s look at some main goals software can have, and which portability-problems they have.

  • Functional. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
  • User-interface. This is what one sees and which is not too abstract to talk about. For example porting software to a touch-device requires a lot of rethinking of interaction-principles.
  • API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. First two being good architectural choices, the latter be a potential hazard. Changing the underlying APIs is not easy.
  • Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
  • OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
  • Hardware-performance. Optimising software for a specific platform, makes it harder to port to other platforms. And this is the subject of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have if it comes to write code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become more clear that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Read more …

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Want more information on the AMD Fusion Developer Summit 2012, visit: http://bit.ly/AFDS_2012.

Read more …

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Read more …

Thursday 4 October I talk on mobile compute at Hackers&Founders Amsterdam on what mobile compute can do. The goal is to initiate new ideas for start-ups, as not many know their mobile phone and tablet is very powerful and next year can be used for compute intensive tasks.

The other talk is from Mozilla on Firefox OS (Edit: it was cancelled), which is actually reason enough to visit this Hackers&Founders Meetup. Entrance is free, drinks are not. Alternatively you could go to the Hadoop User Group Meetup at Science Park, Amsterdam.

Read more …

Let’s look at what programming is, if we only look at the brain, the code and the computer.

The idea of a program lives in the brains of a programmer. The way to get the program to the computer is using a specialism called coding. When the program coded on the computer and the program embedded as an idea in the brain are alike, the programmer is happy. When over time the difference between the brain-version and the computer-version grows, then we go for a maintenance phase – but still this is mostly from brain to computer.

When the coding-language or important coding-paradigms change, something completely different happens. In such case the program in the brain is updated or altered. Humans are not good at that, or at least not many textbooks discuss how to change from one model to another.

In this article I want to discuss one such new coding-paradigm: dependencies in parallel software.
Read more …

This is a message to GPU-programmers only.

It is a simple question, and has many answers: what are GPU-brains? How is it possible your brain can code GPUs and only few friends and colleagues understand what you are doing? Is it thinking in parallel, focusing on one kernel and having the architecture in the back of the head. Is it simple loop-unrolling? Is it a web of thoughts? Is it just cool, as not many people can do it?

Read more …

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamComputing we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Read more …

If you are looking for the samples in one zip-file, scroll down.

This sentence “NVIDIA’s Industry-Leading Support For OpenCL” was proudly used on NVIDIA’s OpenCL page last year. It seems that NVIDIA saw a great future for OpenCL on their GPUs. But when CUDA began borrowing the idea of using LLVM for compiling kernels, NVIDIA’s support for OpenCL slowly started to fade instead. Since with LLVM CUDA-kernels can be loaded in OpenCL and vice versa, this could have brought the two techniques more together.

What is the cause for this decreased support for OpenCL? Did they suddenly got aware LLVM would decrease any advantage of CUDA over OpenCL and therefore decreased support for OpenCL? Or did they decide so long ago, as their last OpenCL-conformant product on Windows is from July 2010? We cannot be sure, but we do know NVIDIA does not have an official statement on the matter.

The latest action demonstrating NVIDIA’s reduced support of OpenCL is the absence of the samples in their GPGPU-SDK. NVIDIA removed them without notice or clear statement on their position on OpenCL. Therefore we decided to start a petition to get these OpenCL samples back. The only official statement on the removal of the samples was on LinkedIn:

All of our OpenCL code samples are available at http://developer.nvidia.com/opencl, and the latest versions all work on the new Kepler GPUs.
They are released as a separate download because developers using OpenCL don’t need the rest of the CUDA Toolkit, which is getting to be quite large.
Sorry if this caused any alarm, we’re just trying to make life a little easier for OpenCL developers.

Best regards,

Will.

William Ramey
Sr. Product Manager, GPU Computing
NVIDIA Corporation

Read more …

System for communicating power-efficiency of new equipment. “A” being best, “F” being worst. 2011-A is incomparable with 2012-A.

For yearly power-usage there is a rule-of-thumb which states that a device that is continuously on, costs the amount of Watt times 1.5 in Euro per year. So the computer in front of me, that takes around 107 Watt, costs me €160 a year if I would leave it on. A moderate cluster with several GPUs of a few hundred Watts each, would cost a few thousand Euros a year. I would say: very doable for most companies.

So why is the performance per Watt? There is more to a Watt than just the costs. The energy to cool a cluster is quite high, as most of the energy escapes via heat. And then there is the increase in demand for portable power. In cases you are thinking of sweeping you credit card for a top 10 supercomputer, then these energy-costs are extremely high.

In this article I try to get an overview of who is entering the 20+ GFLOPS/Watt area. All processors that do less than 20 GFLOPS/Watt, need to have other advantages to survive. And you’ll see that all the green processors are programmed with OpenCL, the technology StreamComputing is all about.

IMPORTANT: The total power used is sometimes including and sometimes excluding memory-transfers. So the comparison below is not fair. The graphics cards are including memory-transfers, while the CPUs and SoCs are not.

Read more …

Get in contact now!

We offer training in GPU-programming (OpenCL, CUDA, etc),
and consultancy-services for performance engineering.

Mail to info@streamcomputing.eu or fill in below form.

The web-form currently does not work.
Please send an e-mail while we resolve the issue.