Rant: No surprise there’s a shortage of good GPU-developers

notyetanothergraphicsAPI
Another Monday, yet another graphics API

We could read here that software is critical for HPC – a market where accelerators/GPUs are used a lot. So all we need to do is to better support all GPU-developers as a whole, not? Unfortunately something else is happening.

Each big corporation wants to have their own developers, not to be shared with the competition.

Microsoft was quite early in this with Ballmer’s “developers, developers, developers” meme. Tip of the hat to them for acting on the shortage, a shake of the head for how they acted. For .NET is was a success to steal away developers from Java and C/C++, increasing market share of Windows Server, SQL Server and more.

GPU-vendors want that too – growing the cake together they find too slow – best is to start the fight while the cake is tiny.

Result: GPU-developers are forced to learn many, many languages…

…if they choose not to only serve one hardware-vendor. So the developers are the ones who are the victim.

Large companies created their own programming languages and all kinds of APIs that only work on one specific platforms or within a tight scope – knowledge and written software cannot be used outside these artificial borders. It currently is that bad, that competing “standards” use different terminologies for exactly the same functionality. Sometimes it’s due to IP issues and such, but sometimes this is deliberate!

Just think that in CPU languages the differences would mostly be that standard words like “void” would be “blank”, “empty” or “null”. This is what is happening in the GPU languages a lot.

With all those languages the time to learn takes a lot longer – this in turn delays the training of the new wave of GPU-developers. On top of that, more time is put in fanboyism (“the red/blue/green team is better”) instead of educating each other.

So, what’s the difference between all those languages?

If you compare them to the huge variety of CPU languages – unfortunately not much.

Here at StreamComputing we know many languages to program GPUs: OpenCL, CUDA, Shaders-languages, OpenMP, RenderScript, Metal, C++AMP and more. We can remember them, because most concepts from one language can be implemented in the other. We actually prefer to use our time in getting to know more specifics of the different hardware, than to remember the different languages.

There is a lot of overlap in the various unique languages
There is a lot of overlap in the various “unique” languages

GPU-languages can be divided in two groups: host-kernel languages and pragma/magic.

The first group is so much alike, that one can be implemented in another for the largest part – unique conveniences are or hardware-dependent or easy to implement. The discussion on what’s missing is mostly on what’s not yet built in, not on capabilities.

The second is to mostly serve CPU-programmers and promises to take care of the complexity of GPU-programming – until now they all succeed in a rather easy-to-parallelise N-body simulation, not much more. We have been promised compilers that understand half-baked code since the beginning of programming – no difference here.

So what to do?

Do you really want to have this small group of GPU-developers put their time into porting between hardware architectures?

Stop reinventing the wheel to win developers. Build on top of existing languages.

For instance SPIR-V has been created to enable higher-level languages like Domain Specific Languages. Also many libraries are already out there, ready to be integrated and the 13 types of GPU-algorithms are also known for years.

With only knowing C, I can program SPARC, ARM, X86 and other CPUs. This is what also should be possible with all those  different GPUs.

4-day training on OpenCL-on-FPGAs, 24 October, Amsterdam

fast-fpgaFrom 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week, including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Porting code that uses random numbers

dobbelstenen

When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 2)

random_300In the first part of our two-part blog series, we have discussed how parallel computing applications can best use pseudo-random number generators (PRNGs) so as to benefit from parallel computing speedups, without negatively impacting the statistical properties of the random numbers generated. We have argued that index-based PRNGs (e.g., from the Random123 library), which do not maintain any state but instead take an index and a key as input and return the random number corresponding to the index in its random output sequence, provide numerous benefits in parallel programming. This week, we discuss both an application of index-based PRNGs and another important aspect of PRNGs in parallel environments: that of reproducing the same output among different parallel implementations or among a parallel and a serial implementation. We focus on the latter and consider reproducibility for the purpose of verification, which – as we have stated last week – is an important matter for our customers.

Reproducibility

Reproducibility may be simple if random numbers are only used in the initialization phase of the application. In this case, it may be sensible to record the set of random numbers generated in the serial code, write it to a file and use this file as input to a table-based approach for random-number generation in the parallel code (as discussed in Part 1 of our blog). Indeed, this is a convenient approach for both us and our customers as we do not have to deal with PRNG implementation internals and our customer can independently verify the correctness of the parallel implementation. We have used this for several of our projects.

More complex scenarios such as stochastic simulations and Monte Carlo applications require a continuous, parallel generation of random numbers. In this case, reproducibility at first seems challenging considering the concurrent and thus unpredictable order in which parallel code may invoke PRNGs. Indeed, it may be impossible to achieve when using traditional, seed-based PRNGs. This is because we may need to access entries in the PRNG’s output sequence in an arbitrary order rather than sequentially. Essentially, we need to define a one-to-one mapping between random numbers generated in the serial code and those used in the parallel version and then be able to ask the PRNG for the first entry, the second entry, and so on, in both implementations. The random-access capability is exactly what index-based PRNGs provide. The main challenge lies in the definition of the mapping.

We may consider two options for defining the one-to-one mapping between random numbers generated in a serial code and those created in a parallel implementation. The summary of our findings is this:

  • Serial-to-parallel mapping: This amounts to an emulation of the serial random number generation in the parallel code. It generally requires minimal or no changes in the serial, original code, but may be increasingly difficult or impossible to define as the complexity of the code grows.
  • Parallel-to-serial mapping: This emulates the parallel random number generation in the serial code. It is usually easier to define than the serial-to-parallel mapping but requires deeper changes in the serial code.

We explain below what we mean by that.

Consider a simple scenario where we have some serial code that generates a new set of random numbers in each iteration of a loop using a traditional seed-based PRNG. A parallel implementation may operate by executing each iteration via a single work item. Using traditional PRNGs, we may equip each work item with its own PRNG seed.

  • Using a serial-to-parallel mapping, we may then record the PRNG state at each iteration of the loop in the serial code and initialize the PRNG state in each work item of the parallel code similarly to a table-based approach – this does not require any changes in the serial code (other than temporarily for recording the random numbers). Alternatively, we may replace the PRNG in the serial code with an index-based one and use a single, static counter for the PRNG, which we increment with each invocation of the PRNG – this is a minimal change in the serial code. We reflect the index-based PRNG in the parallel code and initialize each work item’s counter with w·n, where w is the work item index and n denotes the size of the set of random numbers created per loop. In both cases, the serial and the parallel code should produce identical outputs.
  • For a parallel-to-serial mapping, it is easier to assume that the parallel implementation uses index-based PRNGs. Let’s assume that each work item uses indices of the form w·n+j for the j-th item in the set of random numbers generated. Then the serial code is adapted to use the same PRNG and computes indices as WI(i)·n+j, where WI(i) returns the work item index corresponding to the i-th iteration of the loop. Again, the serial and parallel implementations should then produce identical outputs.

A proper mapping becomes increasingly difficult to define when the code complexity grows. However, it is usually easier to map the parallel random number generation to the serial code. Indeed, consider the case where some random numbers are generated in a data-dependent manner, i.e., only if some condition on the input data is fulfilled. Then it is impossible to give a predefined mapping from the serial to the parallel code. We therefore prefer the serial-to-parallel mapping whenever it is easy to define (as we don’t risk or minimize the risk of introducing bugs in the original code by changing the PRNG generation) but resort to the parallel-to-serial mapping for more difficult cases.

Strengthen our team as a remote worker (freelancer)

code-jobsIn the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.

Do understand that we have extra requirements for freelancers:

  • You have a personality for working independently.
  • You have your own computer with an OpenCL-capable GPU.
  • You have good internet (for doing remote access).

We offer a job in a well-known OpenCL-company with various interesting projects. You can improve your OpenCL skills and work with various hardware (modern GPUs, embedded processors, FPGAs and more).

Our hiring-procedure is as follows:

  • You send a CV and tell us why you are the perfect candidate.
  • After that you are invited for a longer online test. You show your skills on C/C++ and algorithms. You will receive a PDF with useful feedback. (3 hours)
  • We send you a GPU assignment. You need to pick out the right optimisations, code it and explain your decisions in detail. (Hopefully under 30 minutes)
  • If all goes well, you’ll have a videochat on personal and practical matters. You can also ask us anything, to find out if we fit you. (Around 1 hour)
  • If you and the company are a fit, then you’ll go to the technical round. (About 3 hours)
  • Made it to here? Expect a job-offer.

We’re looking forward to your application.

Apply for a job as OpenCL expert (freelancer) now!

Speed: Throughput, Latency and Processor Utilisation

40225151 - fiber optic cable
Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of utilizations:

  • Transfers. Data-movements through cables, interconnects, etc.
  • Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

  1. Reaction time: hardware/software noticing there is a job
  2. Transport time: it takes time to copy data, especially when we talk GBs
  3. Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

OpenCL.org internship/externship

intern
Our internship is according the description: it’s a rather complex homepage which should look good on your CV (if you manage to build it).

Want to help build an important website? OpenCL.org’s components have been designed and partly built, but still a lot of work needs to be done. We’re seeking an intern (or “extern” when not in Amsterdam) who can help us build the site. This internship is not about GPUs!

To complete the tasks, the following is required:

  • Technical expertise:
    • HTML5, CSS
    • PHP
    • Javascript
    • jQuery
    • Node.js
    • Mediawiki
    • XSLT
  • Can-do mentality
  • Able to plan own work
  • Good communication-skills
  • Available for 3 to 6 months

We don’t expect you know all tools, so we will guide you in learning new tools and techniques. Write us a “email of interest” to info@streamcomputing.eu, and write what you can and what your objectives for an internship would be.

We’re looking forward to see your letter!

AMD is back!

AMD_Logo-and-wordmark-1024x768For years we haven been complaining on this blog what AMD was lacking and what needed to be improved. And as you might have concluded from the title of this blogpost, there has been a lot of progress.

AMD is back! It will all come together in the beginning of 2017, but you’ll see a lot of progress already the coming weeks and months.

AMD quietly recognised and solved various totally new problems in HPC, becoming the hidden innovator everybody needed.

This blog is to give an overview of how AMD managed to come back and what it took to get to there.

Hardware

HSA – mixing different flavours of silicon in a single chip, an AMD initiative

HSA-logoIn 2006 AMD bought ATI and told the world they would integrate the GPU and CPU, under code-name “Fusion“. Problem was that combining the two different worlds was a lot harder than anybody expected. From shared silicon to heating issues, the technology was certainly not ready back then.

AMD took lead to fix the problems that come in these very heterogeneous processors: memory-sharing and task-dispatching. The HSA-capabilities can (soon) also be found on processors of ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung and Texas Instruments. HSA-optimised software will therefore also result in performance improvements on non-AMD processors.

HSA goes further than combining CPU and GPU

Nowadays X86 CPUs are full of specialised silicon like a h264-encoder/decoder. ARM processors are even more an exotic collection of various special-purpose silicon. With the increase of fabless, IP-selling design companies, the “exoticness” of processors will only increase. Thanks to HSA it is possible to design special purpose silicon and have it integrate in processors of other vendors, as long as HSA design principles are used.

It comes to no surprise that HSA is getting a lot of recognition.

HBM – reducing the memory bottleneck, an AMD invention

HBM-stackedIsn’t it interesting that the memories on all NVidia’s GPU boards have been designed by AMD engineers? HBM is the next step up after GDDR. The bandwidth of HBM2 can reach up to 1 TB/s per GPU, using drastically less power.

The new memory is really different: it is asynchronous, smaller and faster. Just like it was a huge difference from OpenGL to Vulkan (evolved from Mantle, also an AMD invention). Read more on HBM here.

HSA-enabled high-performance APUs, an AMD product

xbox-scorpioSee the specs of the upgraded PlayStation and Xbox and you’ll see what the new APUs will deliver: around 5 TFLOPS by a single chip.

At the CPU side a lot has happened. You might have heard of Zen – the new CPU architecture that’s coming soon. After over a decennium AMD finally leapfrogs Intel again – remember the Athlon, leapfrogged by Intel’s Core architecture? What ofcourse also helps is no illegal competition practises by Intel for many years.

Upcoming APUs will be over 5 TFLOPS and thus directly competing with discrete GPUs.

HSA-GPUs – the third generation AMD/ATI GPGPU

FireProS9300X2There was still a lot to do, to get from an programmable GPU to a real co-processor. This took three generations:

  • The first GPUs capable of GPGPU used “VLIW”, which were pretty hard to program.
  • The second generation used a scalar architecture, starting with the HD 7000 series and ending with Hawaii, called Graphics Core Next (GCN).
  • The third generation GPGPU, starting with Fiji, are HSA-capable GPUs with GDDR5 or HBM. This generation is now officially labelled “ROCm”.

Starting with Radeon R9 Fury, Radeon R9 Nano and FirePro S9300x2, AMD’s GPU architecture was evolved to have compute-performance, power-efficiency and HSA-capabilities in one package.

The upcoming Polaris brings down the costs – the RX 460 will cost around €100 and the RX 490 €300, and are to be launched in two weeks. The main competitor’s GPUs cost double(!) the price.

Software

Open Source – enabling more possibilities

GPUOpen-logoOne thing that holds back innovation in broad, complex areas is closed source software. If the keeper of the software disagrees with directions, this will delay progress. Also a bug in a driver can be very costly, as only the keeper knows how to best write around it. For example doing something very different than deep learning on NVidia GPUs or VR on AMD GPUs.

AMD boldly decided that everything should be open source, as far patents allowed. This includes the (Linux) drivers, as described further below. This could mean that the hardware can follow the drivers instead of the other way around.

Check out this list on GPUOpen.com to find out what is open sourced for HPC. The list is pretty long

New drivers built from the ground up

amd-crimson-driversAMD has had a bad name if it comes to drivers. Drastic measures were needed and the complete driver-stack was built up from scratch: Crimson.

The new driver is built on top of HSA, and thus needs at least Fiji. Because of that, the support for older hardware has been reduced to the most critical updates: “AMD Radeon R5 235X, Radeon R5 235, Radeon R5 230, Radeon™ R5 220, Radeon HD 8470, Radeon HD 8350, Radeon HD 8000 (D/G variants), Radeon HD 7000 Series (HD 7600 and below), Radeon HD 6000 Series, and Radeon HD 5000 Series Graphics products have been moved to a legacy support model and no additional driver releases are planned“. This will come as bad news for owners of that hardware. I understood this was necessary, to be able to fully focus on third-generation GPUs. I hope that HSAIL (3rd gen GPGPUs) can be converted to AMDIL (2nd gen GPGPUs) in the future, but that seems to be quite a task.

The good news is that new driver is much more stable and very performant on 3rd gen GPUs.

Open source Linux driver

As said above, the Linux drivers are open source now. You can find the Linux kernel driver on Github. For HPC (mostly Linux) this is very important, as bug fixing is under full control of the HPC software builders (like StreamComputing).

Follow Phoronix for updated benchmarks of the new Linux driver.

ROCm – easier programming of “exotic” hardware

ROCmThere are two parts important here: HCC and HIP.

HCC is a C++ compiler inspired by C++AMP and C++14. It offers the following modes (taken from the GPUOpen-website):

  • C++ AMP: Microsoft C++ AMP is a C++ accelerator API with support for GPU offload. This mode is compatible with version 1.2 of the C++ AMP specification.
  • C++ Parallel STL: HCC provides an initial implementation of the parallel algorithms described in the ISO C++ Extensions for Parallelism, which enables parallel acceleration for certain STL algorithms.
  • OpenMP: HCC supports OpenMP 3.1 on CPU. The support for OpenMP 4.x accelerator offloading is currently in development.

The HIP-tool converts CUDA code to HIP (with some restrictions), HIP to HSA (via HCC compiler) and HIP to NVidia PTX (via NVCC compiler). This way it’s possible to indirectly run your CUDA-code on AMD FirePro (3rd gen GPGPUs).

I will write more on ROCm very soon, so leave it with this for now.

AMD’s return is important for the industry

Not an AMD fan? No worries, because it is also good for you.

Both Intel and NVidia have been asking for more and more money for their hardware over the years. NVidia now charges about $10,000 for a GPU, and also Intel has slowly increased their CPU-prices. We can simply blame AMD for not democratising the industry for years.

Innovation also has been slower-paced due to lack of AMD’s competitiveness – even though you might disagree here, you will agree that with AMD’s return, it will bring about an increase in innovation.

ISC lunch discussion: Portable Open Standards in HPC

ISC-HPC-logoAre you around at ISC and have an opinion on portable open standards? Then you should join the discussion with other professionals at ISC. Some suggestions for discussions:

  • (non)-preference for open standards like OpenCL, OpenMP, HSA and OpenACC.
  • Portability versus performance.
  • Using scripting languages in HPC, like Python.

Below info might change, because a map is never reality! Make sure you check this page or our Twitter channel the evening before.

  • Where: ISC in Frankfurt, at the catering area near the lounge areas.
  • What: Food, HPC and Open Standards.
  • When: Tuesday 21 June, 12:00-13:00

ISC-floorplan

See you there!

Let’s meet at ISC in Frankfurt

ISC-HPC-logoVincent Hindriksen will be walking around at ISC from 20 to 22 June. With me I bring our latest brochure, some examples of great optimisations and some Dutch delicacies. Also we will also have some exciting news with an important partner – stay tuned!

It will be a perfect time to discuss how StreamComputing can help you solve tough compute problems. Below is a regularly updated schedule of my time at ISC.

Get in contact to schedule a meeting.

If you’d like to talk technologies and bits&bytes, we’re trying to make a get-together – date&time TBD.

An introduction to Grid-processors: Parallella, Kalray and KnuPath

gridWe have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.

Grid-processors are different from GPUs. Where a multi-core GPU gets its strength from being able to compute lots of data in parallel (SIMD data-parallellism), a grid-processors is able to have each core do something differently (MIMD, task-based parallelism). You could say that a grid-processor is a multi-core CPU, where the number of cores is at least 16, and the cores are only connected to their neighbours. The difference with full-blown CPUs is that the cores are smaller (like the GPU) and thus use less power. The companies themselves categorise their processors as DSPs or Digital Signal Processors, but most popular DSPs only have 1 to 8 cores.

For the context, there are several types of bus-configurations:

  • single bus: like the PCIe-bus in a PC or the iMX6.
  • ring bus: like the XeonPhi till Knights Corner, and the Cell processor.
  • star bus: a central communication core with the compute-cores around.
  • full mesh bus: each core is connected to each core.
  • grid bus: all cores are connected to their direct neighbours. Messages hop from core to core.

Each of them have their advantages and disadvantages. Grid-processors get great performance (per Watt) with:

  • video encoding
  • signal processing
  • cryptography
  • neural networks

Continue reading “An introduction to Grid-processors: Parallella, Kalray and KnuPath”

Download all OpenCL header files and build your own OpenCL library

opencl-logoOpenCL header files

When you develop professional software, it is a best practise to have external header files fixed, to have versioning under full control. This way you don’t get the surprises when your colleague has another OpenCL SDK installed. Luckily the Khronos Group has put all version of the OpenCL header files on Github, so you can easily download the targeted OpenCL version.

Download a zip of the header files here:

If you found problems in one of these, you can directly communicate with the working group by submitting an issue on Github.

OpenCL.lib / libOpenCL.so

But wait, there is more!

You can build your own ICD, as the sources are open (licence). OpenCL version 2.1 is implemented, but it is fully backwards compatible to OpenCL 1.0. You can assume that the vendors use this code for their own, so you can safely use this code in your project.

Get the project from Github.

Heterogeneous Systems Architecture – memory sharing and task dispatching

HSA-logoWant to get an overview of what Heterogeneous Systems Architecture (HSA) does, or want to know what terminology has changed since version 1.0? Read further.

Back in 2012 the goals for HSA were high. The group tried to design a system where CPU and GPU would work together in an efficient way. In the 2013/2014 time-frame you’ll find lots of articles around the web, including on our blog, describing the capabilities of HSA. Unfortunately with the 1.0 specifications most terminologies have been changed.

In March 2015 the HSA Foundation released the final 1.0 specifications. It does not discuss hUMA (Heterogeneous Uniform Memory Access) nor hQ (Heterogeneous Queuing). These two techniques had undergone so many updates, that new terminologies were used.

In this blog post, we’ll present you an updated description of the two most important problems tackled by HSA: memory sharing and task dispatching.

We’ll be tuning the below description, so feedback is always welcome – focus is on clarity, not on completeness.

What is an HSA System?

Where the original HSA goals focused more on SoCs with CPU and GPU cores, now any compute core can be used. The reason was that modern SoCs are much more complex than just a CPU and GPU – integrated DSPs and video-decoder are found on many processors. HSA thus now (officially) supports truly heterogeneous architectures.hsa_mem_arch_3

The idea is that any heterogeneous processor can be designed by the principles of HSA. This will bring down design costs and enable more exotic configurations from different vendors.

And interesting fact about the HSA-specifications is that it only specifies goals, not how it must be implemented. This makes it possible to implement the specifications in software instead of hardware, making it possible to upgrade older hardware to HSA.

Why is HSA important?

A simple question: “will there be more CPUs with embedded GPU or discrete GPUs?”. A simple answer: “there are already more integrated GPUs than discrete ones”. HSA defines those chips with mixed processors.

CPUs with embedded GPUs used to be not much more than the discrete GPUs with shared memory we know from cheap laptops in the 00’s. When the GPU got integrated, each vendor started to create solutions for inter-processor dispatching (threading extended to heterogeneous computing), course-grained sharing (transferring ownership between processor units) and fine grained sharing (atomics working with all processor units).

The HSA Foundation

Sometimes an industry makes bigger steps by competing and sometimes by collaborating

AMD recognised the need for a standard. As AMD wanted to avoid the problems with introducing 64 bit into X86 and therefore initiated the HSA foundation. The founding members are AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung and Texas Instruments. NVidia and Intel are awkwardly absent.

Memory Sharing

HSA uses a relaxed memory model, which has full memory coherence (data guaranteed to be the same for all processes on all cores) and is pageable (subsets can be reserved by programs).

The below write-up is heavily simplified to give an overview how memory sharing is designed under HSA. If you want to know more, read chapter 5 from the HSA book.

Inter-processor memory-pointer sharing – Unified Addressing

The most important part is the unified memory model (previously referred to as “hUMA”), which makes programming the memory-interactions in a heterogeneous processor with CPU-cores, GPU-cores and DPS-cores comparable to a multi-core CPU.

Like other modern memory models, HSA defines various segments, including global, shared and private. A difference is that flat addressing is used. This means that each address pointer is unique: you don’t have an address 0 for private and an address 0 for global. Flat addressing simplifies optimisation operations for higher level languages. Ofcourse you still need to be aware that each segment size is limited and there will be consequences when defining larger memory chunks than is available in the segment.

When you have created a memory object and want the DSP or GPU continue to work on it, then you can use the same pointers without any translations.

Inter-processor cache coherency

In HSA-systems global memory is coherent without the need for explicit cache maintenance. This means that local caches are synchronised and/or that caches are shared. For more information, read this blog from ARM.

Fine grained memory – Atomic Operations

HSA allows protecting memory segments to be atomicly accessed. This makes it possible to have multiple threads running on different cores of different processor units, all accessing the same memory in a safe manner.

Small and large consequtive memory segments can be reserved for sharing, from very fine to coarse grained. All threads that have access to that segement are notified when atomic operations are done.

Fine Grained Shared Virtual Memory (HSA compatibility for discrete GPUs)

AMD has done some efforts to extend HSA to discrete GPUs. We’ll see the real advantages with dispatching, but it also works to create a cleaner memory management.

The so called “Fine Grained Shared Virtual memory” makes it possible use HSA with discrete GPUs that have HSA-support. Because it’s virtual and data is continuously transferred between GPU and the HSA-processor, the performance is ofcourse lower than when using real shared memory. You can compare it to NVidia’s Unified Virtual Memory, and it also has been planned to be in OpenCL 2.0 for a long time.

Dispatching

HSA defines in detail how a task gets into the queue of a worker thread. Below is an overview of how queues, threads and tasks are defined and are named under HSA.

Queueing

Before HSA 1.0 we only spoke of “Heterogeneous Queue” (hQ). This is now further developed to “User Mode Queues”. A User Mode Queue holds the list of tasks for that specific (group of) processor cores, resides in the shared memory and is allocated at runtime.

Such task is described in a language called “Architected Queueing Language” (AQL), and is called an “AQL package”.

Agents and Kernel Agents

HSA threads run on one or a group of processor cores. These threads are called “Agents” and come in two variations: normal Agents and Kernel Agents. A Kernel Agents is an Agent that has a User Mode Queue and can execute kernels that work on a segment of memory. A normal Agent doesn’t have a queue and can only execute simple tasks.

If a normal agent cannot run kernels, but can run tasks, then what can it actually do? Here are a few examples:

  • Allocate memory, or other tasks only the host can do.
  • Send back (intermediate) data to the host – for example progress indication.

If you compare to OpenCL, an agent is the host (which creates the work) and kernel agents are the kernels (which can issue new threads under OpenCL 2.0).

AQL packages: communicating dispatch tasks

There are different types of the AQL (Architected Queueing Language) packets, of which these are the most important:

  • Agent dispatch packet: contains jobs for normal agents.
  • Kernel dispatch packet: contains jobs for kernel agents.
  • Vendor-specific packet: between processors of the same vendor there can be more freedoms.

In most cases, we’ll be talking about kernel dispatch packages.

The Doorbell signal: low latency dispatching

HSA dispatching is extremely fast and power-efficient due to the implementation of a “doorbell”. The doorbell of an agent is signalled when a new tasks is available, making it possible to take immediate action. A problem in OpenCL is the high dispatch times for GPUs without a doorbell – up to the millisecond range, as we have measured. For HSA-enabled GPUs the response-time before a kernel starts running is in the microseconds range.

Context switching

Threads can move from one core to another core – the task will be removed from the current queue and added to another queue. This can even happen when the thread is in running state.

StreamComputing’s position

The solution simply works and makes faster code – we have done a large project with it last year.

It seems that almost the whole embedded processor industry believes in it. AMD (CPU+GPU), ARM (CPU+GPU), Imagination (GPU), Mediatek, Qualcomm (GPU), Samsung and Texas Instruments (DSP) are founders. Companies like Analog Devices, CEVA, Sony, VIA, S3, Marvell and Cadence have later joined the club. Important Linux clubs like Linaro and Canonical are also seen.

The system-on-a-chip only will get more traction, and we see HSA as an enabler. Languages like OpenCL and OpenMP can be compiled down to HSA, so it just takes switching the compiler. HSA-capable software can be written in a more efficient manner, as now can be assumed that memory can efficiently be shared and dispatching new threads is really fast.

The most noticeable processors from NVIDIA, AMD and Intel

AMD-Intel-NVidia10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently. 

NVidia

As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

kepler-k80Tesla K80 (Kepler)

  • The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
  • It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

  • 20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
  • 16 GB HBM2 (720 GB/s).
  • NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
  • OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

  • The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
  • The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

  • 0.5 TFLOPS SP (GPU), 1 TFLOPS HP
  • 10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.

AMD

Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

  • 32GB GDDR5 global memory
  • 2.5 TFLOPS DP, 5 TFLOPS SP

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

FireProS9300X2Radeon Nano and FirePro S9300X2 (Fiji)

  • Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
  • S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
  • Nano 175 Watt, S9300X2 300 Watt
  • Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

  • CPU with built-in GPU
  • About one TFLOPS
  • TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.

Intel

After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

  • 55MB cache, 22 cores
  • AVX 2.0 (256 bit vector operations)
  • DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPHi_KNL_socketXeonPhi Knights landing

  • Available in socket and PCI version
  • 3 TFLOPS DP, 6 TFLOPS SP
  • AVX 512 (512 bit vector operations)
  • 16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
  • Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.

Xeon+FPGA

  • Task-parallel processor
  • Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

  • CPU with built-in GPU
  • 0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamComputing we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

9 questions on OpenCL’s future answered at IWOCL

IWOCL-logoDuring the panel discussion some very interesting questions were asked, I’d like to share with you.

Should the Khronos group poll the community more often about the future of OpenCL?

I asked it on twitter, and this is the current result:khronos-community-feedback

Khronos needs more feedback from OpenCL developers, to better serve the user base. Tell the OpenCL working group what holds you back in solving your specific problems here. Want more influence? You can also join the OpenCL advisory board, or join Khronos with your company. Get in contact with Neil Trevett for more information.

How to (further) popularise OpenCL?

While the open standard is popular at IWOCL, it is not popular enough at universities. NVidia puts a lot of effort in convincing academics that OpenCL is not as good as CUDA and to keep CUDA as the only GPGPU API in the curriculum.

Altera: “OpenCL is important to be thought at universities, because of the low-level parts, it creates better programmers”. And I agree, too many freshly graduated CS students don’t understand malloc() and say “The compiler should solve this for me”.

The short answer is: more marketing.

At StreamComputing we have been supporting OpenCL with marketing (via this blog) since 2010. 6 years already. We are now developing the website opencl.org to continue the effort, while we have diversified at the company.

How to get all vendors to OpenCL 2.0?

Ofcourse this was a question targeted at NVidia, and thus Neil Trevett answered this one. Use a carrot and not a stick, as it is business in the end.

Think more marketing and more apps. We already have a big list:opencl-library-ecosphere

Know more active(!) projects? Share!

Can we break the backwards compatibility to advance faster?

This was a question from the panel to the audience. From what I sensed, the audience and panel are quite open to this. This would mean that OpenCL could make a big step forward, fixing the initial problems. Deprecation would be the way to go the panel said. (OpenCL 2.3 deprecates all nuisances and OpenCL 3.0 is a redesign? Or will it take longer?)

See also the question below on better serving FPGAs and DSPs.

Should we do a specs freeze and harden the implementations?

Michael Wong (OpenMP) was clear on this. Learn from C++98. Two years were focused on hardening the implementations. After that it took 11 years to restart the innovation process and get to C++11! So don’t do a specs freeze.

How to evolve OpenCL faster?

Vendor extensions are the only option.

At StreamComputing we have discussed a lot about it, especially fall-backs. In most cases it is very doable to create slower fall-backs, and in other cases (like with special features on i.e. FPGAs) it can be the only option to make it work.

How to get more robust OpenCL implementations?

Open sourcing the Vulkan conformance tests was a very good decision to make Vulkan more robust. Khronos gets a lot of feedback on the test cases. It will be discussed soon to what extend this also can be done for OpenCL.

Test-cases from open source libraries are often used to create more test cases.

How to better support FPGAs and DSPs?

Now GPUs are the majority and democracy doesn’t work for the minorities.

An option to better support FPGAs and DSPs in OpenCL is to introduce feature sets. A lesson learnt from Vulkan. This way GPU vendors don’t need to spend time implementing features that they don’t find interesting.

Do we see you at IWOCL 2017?

Location will be announced later. Boston and Toronto are mentioned.

Comparing Syntax for CUDA, OpenCL and HiP

Both CUDA and OpenCL are well-known GPGPU-languages. Unfortunately there are some slight differences between the languages, which are shown below.

You might have heard of HiP, the language that AMD made to support both modern AMD Fiji GPUs and CUDA-devices. CUDA can be (mostly automatically) translated to HiP and from that moment your code also supports AMD high-end devices.

To give an overview how HiP compares to other APIs, Ben Sanders made an overview. Below you’ll find the table for CUDA, OpenCL and HiP, slightly altered to be more complete. The languages HC and C++AMP can be found in the original.

Term CUDA OpenCL HiP
Device int deviceId cl_device int deviceId
Queue cudaStream_t cl_command_queue hipStream_t
Event cudaEvent_t cl_event hipEvent_t
Memory void * cl_mem void *
Grid of threads grid NDRange grid
Subgroup of threads block work-group block
Thread thread work-item thread
Scheduled execution warp sub-group (warp, wavefront, etc) warp
Thread-index threadIdx.x get_local_id(0) hipThreadIdx_x
Block-index blockIdx.x get_group_id(0) hipBlockIdx_x
Block-dim blockDim.x get_local_size(0) hipBlockDim_x
Grid-dim gridDim.x get_global_size(0) hipGridDim_x
Device Kernel __global__ __kernel __global__
Device Function __device__ N/A. Implied in device compilation __device__
Host Function __host_ (default) N/A. Implied in host compilation. __host_ (default)
Host + Device Function __host____device__ N/A. __host____device__
Kernel Launch <<< >>> clEnqueueNDRangeKernel hipLaunchKernel
Global Memory __global__ __global __global__
Group Memory __shared__ __local __shared__
Private Memory (default) __private (default)
Constant __constant__ __constant __constant__
Thread Synchronisation __syncthreads barrier(CLK_LOCAL_MEMFENCE) __syncthreads
Atomic Builtins atomicAdd atomic_add atomicAdd
Precise Math cos(f) cos(f) cos(f)
Fast Math __cos(f) native_cos(f) __cos(f)
Vector float4 float4 float4

You see that HiP borrowed from CUDA.

The discussion is ofcourse if all alike APIs shouldn’t use the same wordings. A best thing would be to mix for the best, as CUDA’s “shared” is much more clearer than OpenCL’s “local”. OpenCL’s functions on locations and dimensions (get_global_id(0) and such) on the other had, are often more appreciated than what CUDA offers. CUDA’s “<<< >>>” breaks all C/C++ compilers, making it very hard to make a frontend of IDE-plugin.

I hope you found the above useful to better understand the differences between CUDA and OpenCL, but also to see how HiP comes into the picture.

Meet us in April

9017503_mThe coming month we’re travelling every week. This generates are a lot of opportunities where you can meet the StreamComputing team! For appointments, send an email to contact@streamcomputing.eu.

  • Meet us at ParallelCon (6 April 2016, Heidelberg, Germany). Besides the crash course (see below), we also have a talk on Vulkan.
  • Crash Course OpenCL @ ParallelCon (8 April 2016, Heidelberg, Germany). This is part of the conference – you can still buy tickets!
  • Meet us in Toronto (11 April 2016, Toronto, Canada). In Toronto for business, with time for appointments.
  • Meet us at IWOCL (19 April 2016, Vienna, Austria). The event-of-the-year for all OpenCL. So ofcourse we’re there.
  • Meet us in Grenoble (25 April 2016, Grenoble, France). For a training we’re there the whole week. On Thursday and Friday there is time for appointments.

We’re happy to talk business and about technology. Also giving presentations at your company is an option.

This information was previously communicated via the newsletter and on LinkedIn.

AMD’s infographic on HBM

If one company is bad at bragging, then it’s AMD. It’s two main competitors are a lot better in that – NVIDIA even bragged about their upcoming GPUs having HBM. So I was surprised that recently I encountered a nice infographic, where AMD was actually bragging. And they deserved to do it!

I wanted to have comparisons with Intel/Micron’s HBC, but I leave that for another post as the good information is often a year old.

Very close to the processor

It’s using a high-speed bus on the substrate.

HBM-package

And yes, it really matters to be closer to the processor.

HBM-stacked

HBM versus GDDR5

  • Bus width from 32-bit tot 1024-bit
  • Clockspeed down. We need to wait for how it’s calculated.
  • Bandwidth up a lot. We can expect 1TB/s for GPUs now
  • Required voltage 14% down, which saves a lot of energy

 

HBM-vs-GDDR5

Better GB/s per Watt

So for maintaining 320GB/s you would need 30 Watts. Now you need 9 Watts. As the reduction in power for the Radeon NANO is almost 100W, you understand that this tells only part of the power-reductions made possible.

HBM-Watt

A lot smaller

Yes, 94% less surface area. Only part of the reason is the stacking.

HBM-size

Standards AMD has pioneered

HBM has been engineered and implemented by AMD, made a standard by JEDEC and put into sylicon by Hynix.

AMD-standards

And finally some bragging! AMD has made many standards we use daily, but never knew it was AMD technology.

  • Mantle. The predecessor of Vulkan, DirectX 12 and more
  • GDDR 1 to 5. Now being replaced by HBM, and not GDDR6
  • Wake-on-LAN. You never knew! Intel and IBM made it into a standard, but AMD introduced the Magic Packet in 1995.
  • DisplayPort Adaptive-Sync. Previously known as FreeSync.
  • X86-64. The reason why you find “amd64” packages in Linux.
  • Integrated Memory Controllers.
  • On-die GPUs.
  • Consumer Multicore CPU, the Athlon 62 X2.
  • HSA. Not in the list, probably because it’s a recent advancement.

Want to see the full infographic? Click here.

An example of real-world, end-user OpenCL usage

We ask all our customers if we could use their story on our webpage. For competition reasons, this is often not possible. The people of CEWE Stiftung & Co. KGaA were so kind to share his experience since he did a OpenCL training with us and we reviewed his code.

Enjoy his story on his experience from the training till now!

This year, the CEWE is planning to implement some program code of the CEWE Photoworld in OpenCL. This software is used for the creation and purchase of photo products such as the CEWE Photobook, CEWE Calendars, greeting cards and other products with an installation base of about 10 million throughout Europe. It is written in Qt and works on Windows, Mac and Linux.

 

In the next version, CEWE plans to improve the speed of image effects such as the oil painting filter, to become more useful in the world of photo manipulation. Customers like to some imaging effects to improve photo products, to get even more individual results, fix accidentally badly focused photos and so on.

Continue reading “An example of real-world, end-user OpenCL usage”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

random_300Random numbers are important elements in stochastic simulations, but they also show up in machine learning and applications of Monte Carlo methods such as within computational finances, fluid dynamics and molecular dynamics. These are classical fields in high-performance computing, which StreamComputing has experience in.

A common problem when porting traditional software in these fields to parallel computing environments is the generation and reproducibility of random numbers. Questions that arise are:

  • Performance: How can we efficiently obtain random numbers when they are classically generated in a serial fashion?
  • Quality: How can we make sure that random numbers generated in a parallel environment still fulfil statistical randomness requirements?
  • Verification: How can we be sure that the parallel implementation is correct?

We consider verification from the viewpoint of producing identical results among different software implementations. This is often an important matter for our customers, and we have given them guidance on how to address this issue when random numbers are involved.

In this first part of our two-part blog series, we will briefly address some common pitfalls in the generation of random numbers in parallel environments and suggest suitable random-number generation libraries for OpenCL and CUDA. In the second part – on the blog soon – we will discuss solutions for reproducibility in the presence of random numbers.

Generation

Random numbers in computer software are typically obtained via a deterministic pseudo-random number generator (PRNG) algorithm. The output of such an algorithm is not truly random but pseudo-random (i.e., it appears statistically random), though we will simply say “random” for simplicity. We do not consider truly random numbers, which may be derived from physical phenomena such as radioactive decay, because we want the output of a random number generator to be reproducible.

PRNGs traditionally offered to application developers fail within the parallel setting. One reason is that these algorithms usually only support the sequential generation of random numbers based on some initial (seed) value (e.g., consider the standard C rand() function), so work items on a parallel device would need to block for getting exclusive access to the generator, which clearly impacts efficiency.

Some applications may require only a moderate amount of random numbers. In this case, we found it feasible to precompute the required set of random numbers and hold them in global memory. We call this the table-based approach. Other applications in turn may need to efficiently create a huge amount of random numbers. In this case, it may be necessary to equip each work item with its own PRNG seed. One potential problem with this approach is the use of weak PRNGs such as linear congruential generators (LCGs), which remain popular due to their speed and simplicity. In parallel settings, correlations between output sequences are aggravated and the quality of the application output may be severely affected, so LCGs should not be used at all. Another problem is the use of a small seed or a small PRNG’s internal state space. In this case, we may expect that the probability of two work items creating the same random sequence is quite high. Indeed, if we would randomly seed via srand(), the chance is already 50% for two out of approximately 77,000 work items creating entirely the same random number sequence! So we may either need a PRNG with a larger seed space and internal state, or one with a larger state and some mechanism to subslice the PRNG’s output sequence into non-overlapping “substreams”, with one substream per work item. The Mersenne Twister is highly acclaimed but requires a memory state of approximately 2.5 KB per work item in a parallel setting, and substreams are difficult to implement. While good PRNGs with a small internal state and flexible substream support exist (e.g., MRG32k3a), there are also “index-based” PRNGs, which are often more elaborate to compute but do not maintain any state. Such state-less PRNGs take an arbitrary index and a “key” as input and return a random number corresponding to the index in its random output sequence (which depends on the key chosen). Index-based PRNGs are very useful in parallel computing environments, and we will show how we use them for reproducibility in the second part of this blog.

The choice of an appropriate PRNG may not be easy and ultimately depends on the application scenario. Luckily, there is choice! CUDA offers a set of PRNGs via its cuRAND library, and OpenCL applications can benefit from the clRNG library that AMD has released last year. Both cuRAND and clRNG offer a state-based interface with substream support. For index-based algorithms, the Random123 library provides high-quality PRNG implementations for both OpenCL and CUDA.

So far, we have discussed how we can safely generate random numbers in the GPU and FPGA context, but we cannot control the order in which parallel, concurrent work items create random numbers. This makes it difficult to verify the parallel implementation since its output may be different from that of the serial, original code. So the question is, in the presence of random numbers, how can we easily verify that our parallel code implements not only a faithful but a correct port of the serial version? This is addressed in part two – continue reading.

New: OpenCL Crash Courses

opencl-logoTo see if OpenCL is the right choice for your project, we now only ask one day of your time. This enables you to quickly assess the technology without a big investment.

Throughout Europe we give crash courses in OpenCL. After just one day you will know:

  • The models used to define OpenCL.
  • If OpenCL is an option for your project.
  • How to read and understand OpenCL code.
  • Code simple OpenCL programs.
  • Differences between CPUs, GPUs and FPGAs.

There are two types: GPU-oriented and FPGA-oriented. We’ve selected Altera FPGAs and AMD FirePro GPUs to do the standard trainings.

No events

If you are interested to get a certain crash course in another city than currently scheduled, fill in the below form to get notified when the crash course of your city of choice.

We will add more dates and places continuously. If you want to host an OpenCL crash course event, get in contact.

Note: crash courses are intended to get you in contact with software accelerators, so it doesn’t replace a full training.