The CPU is dead. Long live the CPU!

Scene from Gladiator when is decided on the end of somebody’s life.

Look at the computers and laptops sold at your local computer shop. There are just few systems with a separate GPU, neither as PCI-device nor integrated on the motherboard. The graphics are handled by the CPU now. The Central Processing Unit as we knew it is dying.

To be clear I will refer to an old CPU as “GPU-less CPU”, and name the new CPU (with GPU included) as plain “CPU” or “hybrid Processor”. There are many names for the new CPU with all their own history, which I will discuss in this article.

The focus is on X86. The follow-up article is on whether the king X86 will be replaced by king ARM.

Know that all is based on my own observations; please comment if you have nice information.

Continue reading “The CPU is dead. Long live the CPU!”

Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support

Screenshot from Intel’s “God Rays” demo

This article is still work-in-progress

Intel has just released its OpenCL bit CPU-drivers, version 2013 bèta. It has support for OpenCL 1.1 (not 1.2 as for the CPU) on Intel HD Graphics 4000/2500 of the 3rd generation Core processors (Windows only). The release notes mention support for Windows 7 and 8, but the download-site only mentions windows 8. Support under Linux is limited to 64 bits.

The release notes mention:

  • General performance improvements for many OpenCL* kernels running on CPU.
  • Preview Tool: Kernel Builder (Windows)
  • Preview Feature: support of kernel source code hotspots analysis with the Intel VTuneT Amplifier XE 2011 update 3 or higher.
  • The GNU Project Debugger (GDB) debugging support on Linux operating systems.
  • New OpenCL 1.2 extensions supported by the CPU device:
    • cl_khr_int64_base_atomics and cl_khr_int64_extended_atomics
    • cl_khr_fp16
    • cl_khr_gl_sharing
    • cl_khr_gl_event
    • cl_khr_d3d10_sharing
    • cl_khr_dx9_media_sharing
    • cl_khr_d3d11_sharing.
  • OpenCL 1.1 extensions that were changed in OpenCL 1.2:
    • Device Fission supports both OpenCL 1.1 EXT API’s and also OpenCL* 1.2 fission core features
    • Media Sharing support intel 1.1 media sharing extension and also the 1.2 KHR media sharing extension
    • Printf extension is aligned with OpenCL 1.2 core feature.

Check the release notes for full information.

The drivers can be found on http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk-2013/. Installation is simple. For Windows there is a installer. If you have Linux, make sure you remove any previous version of Intel’s openCL drivers. If you have a Debian-based Linux, use the command ‘alien’ to convert the rpm to deb, and make sure ‘libnuma1‘ is installed. There are requirements for libc 2.11 or 2.12 – more information on that later as Ubuntu 12.04 has libc6 2.15.

Continue reading “Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support”

Taking on OpenCL

Quote by Dr. Kelso (from the series “Scrubs”) – click for video

OpenCL is getting more and more important and for more developers a skill worth having. At StreamHPC we saw this coming in 2010 and have been training people in OpenCL since. A few weeks ago I got a question on how to take on OpenCL, which could be interesting for more people: how to take on OpenCL. In other words: the steps to take to learn OpenCL the quickest. Since the last time I wrote on learning OpenCL is almost two years ago, it is a good time to share more recent insights on this matter.

Taking on OpenCL takes four main steps in this order:
  1. Understanding the hardware and architectures.
  2. Thinking both in parallel and in vectors.
  3. Learning the OpenCL language itself.
  4. Profiling and debugging.

You see that is a whole difference from learning for instance Java with a Pascal-background. Learning VHDL for programming FPGAs comes closer, though you don’t need to tinker with timings when doing OpenCL. Let’s go through the steps.

Continue reading “Taking on OpenCL”

How expensive is an operation on a CPU?

Programmers know the value of everything and the costs of nothing. I saw this quote a while back and loved it immediately. The quote by Alan Perlis is originally about Perl LISP-programmers, but only highly trained HPC-programmers seem to have obtained this basic knowledge well. In an interview with Andrew Richards of Codeplay I heard it from another perspective: software languages were not developed in a time that cache was 100 times faster than memory. He claimed that it should be exposed to the programmer what is expensive and what isn’t. I agreed again and hence this post.

I think it is very clear that programming languages (and/or IDEs) need to be redesigned to overcome the hardware-changes of the past 5 years. I talked about that in the article “Separation of compute, control and transfer” and “Lots of loops“. But it does not seem to be enough.

So what are the costs of each operation (on CPUs)?

This article is just to help you on your way, and most of all: to make you aware. Note it is incomplete and probably not valid for all kinds of CPUs.

Continue reading “How expensive is an operation on a CPU?”

GPGPU-day materials – teaser

Just a quick teaser. More materials (photos, sheets, videos) are coming soon.

Don’t forget to subscribe to the mailing-list of Platform Parallel Netherlands to hear about more events around parallel programming in the Netherlands.

Click on the icon at bottom-right to watch the video full-screen.

If you have made photos during the day, please send them.

Music by Professor Kliq.

Below is the short version with photos only

StreamComputing is 2 years old! A personal story.

More than two years ago, on 13 January 2010, I wrote my first blog-post. Four months later StreamComputing (redacted: rebranded to StreamHPC in 2017) was both official and unknown. I want to share with you my personal story on how I got to start-up this company.

The push-factor

I wanted to create a company which was about innovative projects – something I had hardly encountered until then. The years before I programmed parts of A-to-B-flows, as I call them. That is software that is in the base quite simple, but tediously discussed as very, very complex.

“Complex” software

The complexity is not the software, as you can see. It is undocumented APIs, forgotten knowledge, knowledge in heads of unknown people, bossy and demanding people who friendly ask for last-minute architecture changes, deadlines around promotion-rounds, new deadlines due to board-decisions, people being afraid of getting replaced if the software is finished, jealousy if another team makes version 2 of the software, etc. The rule of office-software is therefore understandable:

Software is either unfinished,
or turned into a platform for unintended functionality.

The fun in office-software is there for analyst, architect or manager – the developer just puts in his earphones and makes all the requested changes (hooray for services like Spotify). But as I did not want to become a manager and wished to keep improving my development skills, I had to conclude I was on the wrong track.

Continue reading “StreamComputing is 2 years old! A personal story.”

AMD gDEBugger 6.2 for Linux

The printf-funtion in kernels isn’t the solution to everything, so hence profilers and debuggers specially tailored for GPU-programming. On Windows there is a lot of choice, but mostly only if you have a paid version of Visual Studio. On Linux you have GDB, but that program is not really user-friendly for the GUI-lovers.

For AMD there is now gDEBugger again available for Linux. Again, as version 5.8 by Gremedy worked with Linux, after AMD bought the company it got Windows-only for version 6. A few weeks ago, 10 months after 6.0, Linux-binaries got back with version 6.2. It supports OpenCL 1.2, OpenGL 3.2 and quite some extensions. As only AMD is supported, later more on debugging OpenCL-applications on NVidia and Intel.

Installation is quite straightforward. For creating a menu-item, you’ll find an useful image in /opt/gDEBugger6.2.xxx/tutorial/images/.

Continue reading “AMD gDEBugger 6.2 for Linux”

NVIDIA: mobile phones, tablets and HPC (cloud)

If you want to see what is coming up in the market of consumer-technology (PC, mobile and tablet), then NVIDIA can tell you the most. The company is very flexible, and shows time after time it really knows in which markets is currently operates and can enter. I sometimes strongly disagree with their marketing, but watch them closely as they are in the most important markets to define the near future in: PCs, Mobile/Tablet and HPC.
You might think I completely miss interconnects (buses between processors, devices and memory) and memory-technologies as clouds have a large need for high-speed data-transport, but the last 20 years have shown that this is a quite stable developing market based on IP-selling to the hardware-vendors. With the acquisition of Cray’s interconnect technology, we have seen this is serious business for Intel, so things might change indeed. For this article I want to focus on NVIDIA’s choices.

Neil Trevett on OpenCL

The Khronos Group gave some talks on their technologies in Shanghai China on the 17th of March 2012. Neil Trevett did some interesting remarks on the position of NVidia on OpenCL I would like to share with you. Neil Trevett is both an important member of Khronos and employee of NVidia. To be more precise, he is the Vice President Mobile Content of NVidia and the president of Khronos. I think we can take his comments serious, but we must be very careful as these are mixed with his personal opinions.

Regular readers of the blog have seen I am not enthusiastic at all about NVidia’s marketing, but am a big fan of their hardware. And exactly I am very positive they are bold enough in the industry to position themselves very well with the fast-changing markets of the upcoming years. Having said that, let’s go to the quotes.

All quotes were from this video. Best you can do is to start at 41:50 till 45:35.

At 44:05 he states: “In the mobile I think space CUDA is unlikely to be widely adopted“, and explains: “A party API in the mobile industry doesn’t really meet market needs“. Then continues with his vision on OpenCL: “I think OpenCL in the mobile is going to be fundamental to bring parallel computation to mobile devices” and then “and into the web through WebCL“.

Also interesting at 44:55: “In the end NVidia doesn’t really mind which API is used, CUDA or OpenCL. As long as you are get to use great GPUs“. He ends with a smile, as “great GPUs” refers to NVidia’s of course. 🙂

At 45:10 he puts NVidia’s plans on HPC, before getting back to : “NVidia is going to support both [CUDA and OpenCL] in HPC. In Mobile it’s going to be all OpenCL“.

At 45:23 he repeats his statements: “In the mobile space I expect OpenCL to be the primary tool“.

Continue reading “Neil Trevett on OpenCL”

USB-stick sized ARM-computers

Now that smartphones get more powerful and internet makes it possible to have all functionality and documents with you anywhere, the computer needs to be reinvented. You see all big IT-companies searching for how that can be, from Windows Metro to complete docking stations to replace the desktop by your phone. A turbulent market.

One of the new products are USB-stick sized computers. Stick them into a TV or monitor, zap in your code and you have your personal working environment. You never need to carry laptops to your hotel-room or conference, as long as a screen is available – any screen.

There are several USB-computers entering the market, but I wanted to introduce you to two. Both of these see a future in a strong processor in a portable device, and both do not have a real product with these strong processors. But you can expect that in 2013 you can have a device that can do very fast parallel processing to have a smooth Photoshop experience… at your key-ring.

Continue reading “USB-stick sized ARM-computers”

PDFs of Monday 16 April

By exception, another PDF-Monday.

OpenCL vs. OpenMP: A Programmability Debate. The one moment OpenCL and the other mom ent OpenMP produces faster code. From the conclusion: “OpenMP is more productive, while OpenCL is portable for a larger class of devices. Performance-wise, we have found a large variety of ratios between the two solutions, depending on the application, dataset sizes, compilers, and architectures.”

Improving Performance of OpenCL on CPUs. Focusing on how to optimise OpenCL. From the abstract: “First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization.”

Variants of Mersenne Twister Suitable for Graphic Processors. Source-code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/

Accelerating the FFTD method using SSE and GPUs. “The Finite-Difference Time-Domain (FDTD) method is a computational technique for modelling the behaviour of electromagnetic waves in 3D space”. This is a project-plan, but describes the theories pretty well. Continue reading “PDFs of Monday 16 April”

5 types of loops you should avoid

In “Separation of compute, control and transfer” I talked about node-wise programming as a method we should embrace instead of trying to unroll the existing loops. In this article I get into loops and discuss a few types and how they can be run in a parallel form. Dependency is the big variable in each type: the lower the dependency on previous iterations, the better it can be parallelised. Another one is the known iteration-dimensions known before the loop is started.

The more you think about it, the more you find that a loop is not a loop.

Continue reading “5 types of loops you should avoid”

Supporting OpenCL on your own hardware

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware”

Separation of Compute and Transfer from the rest of the code.

What if trees had the roots, trunk and crown were mixed up? Would it still have the advantage over other plants?

In the beginning of 2012 I spoke with Patrick Viry, former CEO of Ateji – now out-of-business. We shared ideas on GPGPU, OpenCL and programming in general. While talking about the strengths of his product, he came with a remark which I found important and interesting: separation of transfer. This triggered me to think further – those were the times when you could not read on modern computing, but had to define it yourself.

Separation of focus-areas are known to increase effectiveness, but are said to be for experts only. I disagree completely – the big languages just don’t have good support for defining the separations of concerns.

For example, the concepts of loops is well-known to all programmers, but OpenCL and CUDA have broken with that. Instead of using huge loops, those languages describe what has to be done at one location in the data and what the data is to be processed. From what I see, this new type of loop is getting abandoned in higher level languages, while it is a good design pattern.

I would like to discuss separation of compute and transfer from the rest of the code, to show that this will improve the quality of code. Continue reading “Separation of Compute and Transfer from the rest of the code.”

StreamHPC flirts with ARM

With the launch of twitter-channel @OpenCLonARM we now officially show a strong interest in ARM for compute. And we are not the only ones, as the twitter already has 80 followers (60 in 1.5 day and 12 retweets of the welcome-message).

ARM has made tremendous progress in both technology and market-share. With ARM-64, companies like NVidia (and maybe AMD) in the field, X86 seems to be getting a real competitor. This could happen because since a few years computers are fast enough and are not being replaced by a faster one, but a smaller one (tablet, phone) or extra one. By the rules of the market, current technologies are replaced by the ones that give those other needs. ARM is fast (enough), flexible in design, very cheap, low-power and passively cooled. The biggest obstacle seems to be only getting a standard for a docking-station to connect your mobile, tablet or watch to keyboard, mouse and large screen.

OpenCL is perfect for ARM, as it gives the computation-power to the intensive computations not already covered by hardware-support. In the world of X86 this interests high performance and big data companies, where on ARM this interests also more. Without the need for OpenCL you can already watch HD video, with OpenCL you can encode the video with MP4. This year you will certainly hear more about new possibilities of OpenCL on ARM.

What do you think. Why does Intel not sell IP to ARM-companies as many technologies could be reused? Could Intel be the next ARM as an IP-seller, or will they stay the defender of X86 for many years to come?

streamhpc.com is not affiliated with ARM.

AccelerEyes ArrayFire

There is a lot going on at the path to GPGPU 2.0 – the libraries on top of OpenCL and/or CUDA. Among many solutions we see for example Microsoft with C++ AMP on top of DirectCompute, NVidia (and more) with OpenACC, and now AccelerEyes (most known for their Matlab-extension Jacket and libJacket) with ArrayFire.

I want you to show how easy programming GPUs can be when using such libraries – know that for using all features such as complex numbers, multi-GPU and linear algebra functions, you need to buy the full version. Prices start at $2500,- for a workstation/server with 2 GPUs.

It comes in two flavours: for OpenCL (C++) and for CUDA (C, C++, Fortran). The code for both is the same, so you can easily switch – though you still see references to cuda.h you can compile most examples from the CUDA-version using the OpenCL-version with little editing. Let’s look a little into what it can do.

Continue reading “AccelerEyes ArrayFire”

Theoretical transfer speeds visualised

There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.

Transfer speeds per bus

The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to GPU-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a GPU need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.

We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right. What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.

What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the GPU and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it’s better to have a GPU that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.

Continue reading “Theoretical transfer speeds visualised”

Do your (X86) CPU and GPU support OpenCL?

Does your computer have OpenCL-capable hardware? Read on and find out if your computer is compatible…

If you want to know what other non-PC hardware (phones, tablets, FPGAs, DSPs, etc) is running OpenCL, see the OpenCL SDK page.

For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for OpenCL 1.1 since drivers 280.13, so if you need OpenCL 1.1, then make sure you have this version or later. If you want to use Intel-processors and you don’t have an AMD GPU installed, you need to download the runtime of Intel OpenCL.

If you want to know if your X86 device is supported, you’ll find answers in this article.

Often it is not clear how OpenCL works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. OpenCL takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this here and here. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with OpenCL. The minimum requirement for OpenCL-on-a-CPU is SSE 4.2, though.

A question I see often is what to do if you have more devices. There is no OpenCL-package for all the available devices, so you then need to install drivers for each device. CPU-drivers are often included in the GPU-drivers.

Read on to find out exactly which processors are supported.

Continue reading “Do your (X86) CPU and GPU support OpenCL?”

Basic concepts: Function Qualifiers

19092053_m
Optimisation of one’s thoughts is a complex problem: a lot of interacting processes can be defined, if you think of it.

In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification))) void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read about them here in the official documentation, with more examples.

Continue reading “Basic concepts: Function Qualifiers”

Black-Scholes mixing on SandyBridge, Radeon and Geforce

Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their OpenCL optimisation-document (page 28 and further) with 3 random factors as input: S, K and T, and two configuration-constants R and V. NVidia is easy to compare to Intel’s, while AMD chose to write down the algorithm quite different.
So we have three different but comparable kernels in total. What will happen if we run these, all optimised for specific types of hardware, on the following devices?

  • Intel(R) Core(TM) i7-2600 CPU @3.4GHz, Mem @1333MHz
  • GeForce GTX 560 @810MHz, Mem @1000MHz
  • Radeon HD 6870 @930MHz, Mem @1030MHz

Three different architectures and three different drivers. To complete the comparison I also try to see if there is a difference when using Intel’s and AMD’s driver for CPUs. Continue reading “Black-Scholes mixing on SandyBridge, Radeon and Geforce”

OpenCL potentials: Watermarked media for content-protection

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection”