NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

The 8 reasons why our customers had their code accelerated (by us)

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these.

Reduce processing time

When people hear about Software Performance Engineering, the first thing they think about is reducing processing time. The largest part of our customers have hired us for this task, but all had different time-requirements.

Meeting timing requirements

The most usual type of project has very specific timing-requirements. Some examples:

  • When there is 10 minutes of time to respond, the computations cannot take any second later.
  • On cloud applications, the time is more often less than half a second.
  • When processing data for a customer using the latest info, the time should not take more than “Let me get your file”.
  • When a plane lands, there should be no “Just hold a minute”

We did not only use GPUs (OpenCL or CUDA) to make our customer’s code faster, but also redesigned algorithms to be more efficient. In a few cases we could get the timing-requirement by optimising the code – there was no need to port to the GPU anymore.

Increasing user efficiency

Getting hours to minutes or minutes to seconds.

When employees are waiting for the computer to finish, this reduces efficiency. Some examples:

  • Handling customer-data at the reception, so more customers can be helped.
  • Reducing a daily batch job of 2 hours to minutes, to minimise overtime.
  • Importing new data into the system in less time, so the user can more focus on quality.

We found that users feel less powerless when pressure increases, as they have more control of the time and the software. Before the speedup they felt controlled by the software.

Increasing the responsiveness

From seconds or even minutes to milliseconds.

When a system does not react immediately, trust in the system goes down. Some examples:

  • For creative software the user needs to be in a flow and not have to wait for each small step taken. This is because slower software reduces the number of items in one’s short term memory.
  • For data analysts getting a feeling for the data takes some “wandering around”. Responsive software reduces learning time.

Besides standard performance engineering, there are more options to make the user interaction more immediate and snappy. For instance a “low resolution” result can let the user preview the results.

Reduce latency

Getting from seconds to milliseconds, or from milliseconds to microseconds.

Where responsiveness deals with users, latency describes automated systems where microseconds matter. The requirements to maximum processing time are often very strict and real-time operating systems could be used. Some examples:

  • Real time image processing for video-streams.
  • Feedback-loops for machine control to reduce operational errors.
  • High-speed networking applications, such as finance.

We choose FPGAs when the latency needs to be in the low microsecond, and GPUs when latency has to be in the millisecond range. Work we did here included porting from CPU to GPU and from GPU to FPGA.

Increase functionality and data sizes

The goal in this category is the same as the previous, but the problem is described often from the perspective of features and data size. This is because the time is not seen as a current problem, but as a future problem.

Add extra functionality

Same processing time, more functionality.

The processing time is described as a disadvantage, but is not a complaint – there is understanding the computations are intensive. On the other hand the customers request extra features. Some examples:

  • Applying extra image improvement algorithms on the video-stream.
  • Applying an alternative algorithm to the same data.

In cases where we also improved the existing software for performance, total processing time for more data went down.

Increase simulation/data sizes

Same processing time, tenfold the data.

Each year the data size increases more than the performance of computers increase. Where a single server was enough 3 years ago, now a small cluster has to be rented.

To cope with this explosive growth we are asked to remodel the software that ports well to current end future high-performance processors. Some examples:

  • Promoting prototypes to production software.
  • Going from 1D data to 2D and 3D.
  • Cross-analysing 10,000 shoppers instead of 100.
  • Doing a weather-model for the whole of Europe instead of one country only.
  • Improving a stochastic model.
  • Using higher resolution data.

This is the most common type of problem we solve in this category. Especially proven-to-work models (including prototypes) are chosen to work on larger data sets than they were designed for.

Reduce operational costs

When the operation scales up, the operational costs can increase exponentially. Some of our customers identified the problem werll in advance and let us reduce their operational costs before it got out of hand.

Reducing the server count

Same performance, less hardware.

Processing data can take 10s to 100s of servers, increasing power and maintenance costs and thus lowering the performance/€ or performance/$.

  • If the computations are the limiting factor, ten dual-socket servers can be replaced by single GPU or FPGA-server, it will be much easier to double the capacity.
  • Computations can be moved to the user, by doing pre-processing on the mobile phone or desktop. By using the GPU, the device’s battery doesn’t get drained.

We’ve helped early scale-ups who identified the problem of the operational costs sky-rocket when the code is not optimised.

Reducing power usage

Same performance, less Watt.

It’s all about performance/Watt. Some examples:

  • A GPU can take 200 to 300 Watt, but algorithms can take 10 time less than a 100 Watt CPU.
  • On smartphone, porting code to the GPU reduces the power usage.
  • When an FPGA can be used (i.e. networking), there are options to replace the full server by a single 20-30 Watt FPGA.
  • Porting code to GPUs using HBM, which uses much less memory.

Reducing power usage on portable devices has been the most common use case here.

Recognise one of these problems?

Call or email us to start discussing your problem. In a one week effort of analysing your code and discussing with the developers, we can often provide a good indication how much time the performance improvement can take.

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. If you want to learn more about PRACE, click here.

2017 International Summer School on HPC Challenges in Computational Sciences – Applications due 6 March 2017

The summer school is sponsored by Compute/Calcul Canada, the Extreme Science and Engineering Discovery Environment (XSEDE), the Partnership for Advanced Computing in Europe (PRACE) and the RIKEN Advanced Insti­tute for Computational Science (RIKEN AICS).

Graduate students and postdoctoral scholars from institutions in Canada, Europe, Japan and the United States are invited to apply for the eighth International Summer School on HPC Challenges in Computational Sciences, to be held June 25 – 30 2017, in Boulder, Colorado, United States of America.

Leading computational scientists and HPC technologists from the U.S., Europe, Japan and Canada will offer instructions on a variety of topics and also provide advanced mentoring.
Topics include:

  • HPC challenges by discipline
  • HPC programming proficiencies
  • Performance analysis & profiling
  • Algorithmic approaches & numerical libraries
  • Data-intensive computing
  • Scientific visualization
  • Canadian, EU, Japanese and U.S. HPC-infrastructures

For more details please visit:

PRACE Summer of HPC 2017 – Applications due 19 February 2017

The PRACE Summer of HPC is a PRACE outreach and training programme that offers summer placements at top HPC centres across Europe to late-stage undergraduates and early-stage postgraduate students. Up to twenty top applicants from across Europe will be selected to participate. Participants will spend two months working on projects related to PRACE technical or industrial work and produce a report and a visualisation or video of their results.

Early-stage postgraduate and late-stage undergraduate students are invited to apply for the PRACE Summer of HPC 2017 programme, to be held in July & August 2017. Consisting of a training week and two months on placement at top HPC centres around Europe, the programme affords participants the opportunity to learn and share more about PRACE and HPC, and includes accommodation, a stipend and travel to their HPC centre placement.

The programme will run from 2 July to 31 August 2017, with a kick-off training week at IT4I Supercomputing Centre in Ostrava attended by all participants. Flights, accommodation and a stipend will be provided to all successful applicants. Two prizes will be awarded to the participants who produce the best project and best embody the outreach spirit of the programme.

For more details please visit:

This is a unique chance to get in touch with HPC. If you have friends who could use HPC for their research, make sure they learn about these summer activities.

How many threads can run on a GPU?

Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

Funded PhD internships at StreamComputing

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamComputing.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.


The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamComputing”

IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

We have been awarded the Khronos project to upgrade the OpenCL test suite to 2.2!

Some weeks ago we started with implementing the Compiler Test Suite for OpenCL 2.2. The biggest improvement of OpenCL 2.2 is C++ kernels, which originally was planned for 2.1. SPIRV 1.1 is another big improvement.

We are very happy to have a part in making OpenCL better! We find OpenCL C++ kernels very important, even if it has its limitations. Thanks to SPIRV 1.1 it gets easier to have more (unofficial) kernel languages next to C and C++, and to get SYCL. Also upgrading from 2.0 to 2.2 is rather easy thanks to the open source libclcxx.

Personally I found this project to also be very important for our internal knowledge building, as almost every function would be touched and discussed.

OpenCL 2.2 CTS RFQ has been awarded to StreamComputing

Khronos issued a Request For Quote (RFQ) back in September 2016 to enhance and expand the existing OpenCL 2.1 conformance tests to create an OpenCL 2.2 test suite to be used to define conformance for OpenCL 2.2 implementations. The contract has been awarded to StreamComputing. StreamComputing is a software consultancy company specialized in performance tuned software development for CPU, GPU and FPGA. A large part of their clients hires them for their OpenCL expertise.

Already improvements have been added, bugs splatted and documentation improved. We hope to continue this the coming months!

We’ll be ready in March. Hopefully the first implementations are ready by then, as there is a test suite ready to iron out any bug discovered. Which three OpenCL drivers do you think will be first to have OpenCL 2.2? Intel, AMD, NVidia, ARM, Imagination, Qualcomm, TI, Intel FPGA (Altera), Xilinx, Portable OpenCL or another?

AMD gets into Machine Intelligence with “MI” range of hardware and software

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post.

For fast deep learning you need two things: extremely fast memory and lots of FLOPS at 16-bit. AMD happens to have developed HBM2, the world’s fastest memory and now available to everybody. So AMD only needed to beat the NVIDIA P100 on FLOPS, and they did: the AMD “MI25” is expected to deliver around 25 TFLOPS for 16-bit operations. If you want to know more, lots of new links show up daily on Google.

This means that AMD is beating NVIDIA’s top-range GPUs again. Add NVlink-competitor CCIX and it’s clear that AMD is a strong competitor again, as they used to. The only problem is that much of the software is written in CUDA…

Software: porting from CUDA

AMD’s Greg Stoner, Director of Radeon Open Compute, opened up today on the current state of their software (typos fixed):

If you guys saw the Radeon Instinct launch you will find we finally announced our big push into Deep Learning. Here is good article http://www.anandtech.com/show/10905/amd-announces-radeon-instinct-deep-learning-2017

We will be delivery HIP version of Caffe, Tensorflow, Torch7, MxNet, Theano, CNTK, Chainer, all supporting our new MIOpen – our new Deep Learning solver.

Since the everyone is interested in Tensorflow

Note this will run on AMD and NVIDIA hardware


The status of Eigen is “35 out of 43”, which is a rather vague description but an indication nevertheless. Eigen is a very important part of TensorFlow. A good promise that the code will be ready when the new VEGA hardware is launched.

Also interesting the the mention of MIOpen. It has been discussed on TechReport:

This library offers a range of functions pre-optimized for execution on Radeon Instinct cards, like convolution, pooling, activation, normalization, and tensor operations. AMD says that convolution operations performed with MIOpen are nearly three times faster than those performed using the widely-used “general matrix multiplication” (GEMM) function from the standard Basic Linear Algebra Subprograms specification. That speed-up is important because convolution operations make up the majority of program run time for a convolutional neural network, according to Google TensorFlow team member Pete Warden.

The reason why they can deliver so many software-ports in such limited time with a small team, is because of HIP. This makes it possible to port CUDA code to HIP, which runs on both AMD and NVIDIA.

We personally also had good experience with porting code to HIP. If you need CUDA code to be ported to AMD, know we tend to make the code faster and solve previously undiscovered bugs during the porting process.

Opinions crossing the table: Khronos for world peace

Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

Accelerating Excel with OpenCL

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating Excel with OpenCL”

Online Tutorials are here

46188854 - beautiful smiling female student using online education service. young woman looking in laptop display watching training course and listening it with headphones. modern study technology concept
Online training

We’re going online with our presentations and tutorials. This makes it easy to reach more people and make our trainings more flexible.

We’re starting with short introductory trainings, but we have bigger plans. Keep an eye on our events (shared on Twitter, LinkedIn, this blog and the newsletter) to see what the offerings are. And you’re very welcome to join!

On 4 October (new date) there will be an OpenCL 101 of two hours for free. Target timezone is East-America and Europe.

Agenda Online OpenCL 101

  • Introductions (20 minutes)
    • StreamComputing
    • GPUs and paralellism
    • OpenCL
  • By example: Getting started with OpenCL (30 minutes)
  • By example: Porting a simple program to OpenCL (30 minutes)
  • Q&A in parallel (30 minutes). Ask us any question, for instance:
    • General OpenCL.
    • OpenCL on GPUs.
    • OpenCL on FPGAs.
    • What algorithms work well with GPUs, CPUs and FPGAs.
    • StreamComputing services.
  • The next steps (5 minutes).
  • Closing words (5 minutes).

Read more here…

Tutorial server

You can already test if the tutorial server works for you by looking around in our demo room. The tutorial itself will be in another room. Use your own name and password “ap“.

See you soon!

How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)

Hampstead flooding

How water moves through an area given a certain pace of instream, can be fully simulated. We got a request to make such simulation faster, as it took already too much time to do moderate simulations. As the customer wanted to be able to have more details, larger areas and more alternative situations computed, the current performance did not suffice.

The code was already ported to MPI to scale to 8 cores. This code was used as a base for creating our optimised GPU-code. Using a single GPU we managed to get an 44 to 58 times speedup over single core CPU, which is 5 to 7 times faster than MPI on 8 to 32 CPU cores.

For larger experiments we could increase the performance advantage over MPI-code from 7 times to a total of 35 times, using multiple GPUs.

We solved both the weak-scaling problem and the mapping on GPUs

If you add the 9x speedup of the initial performance-optimisation, the total is over 2600x. What could be done in a year, now can be done in 3.5 hours. This clearly shows the importance of software performance engineering. Most code already had some optimisations applied (just like here) and 5 to 7 times speedup is quite achievable.

Read below for some more details. Continue reading “How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)”

Get ready for conversions of large-scale CUDA software to AMD hardware

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

CUDA Compute Capability 6.1 Features in OpenCL 2.0

On the CUDA page of Wikipedia there is a table with compute capabilities, as shown below. While double checking support for AMD Fijij GPUs (like Radeon Nano and FirePro S9300X2) I got curious how much support is still missing in OpenCL. For the support of Fiji it looks like there is 100% support of all features. For OpenCL 2.0 read on.

CUDA features per Compute Capability on Wikipedia

Continue reading “CUDA Compute Capability 6.1 Features in OpenCL 2.0”

Rant: No surprise there’s a shortage of good GPU-developers

Another Monday, yet another graphics API

We could read here that software is critical for HPC – a market where accelerators/GPUs are used a lot. So all we need to do is to better support all GPU-developers as a whole, not? Unfortunately something else is happening.

Each big corporation wants to have their own developers, not to be shared with the competition.

Microsoft was quite early in this with Ballmer’s “developers, developers, developers” meme. Tip of the hat to them for acting on the shortage, a shake of the head for how they acted. For .NET is was a success to steal away developers from Java and C/C++, increasing market share of Windows Server, SQL Server and more.

GPU-vendors want that too – growing the cake together they find too slow – best is to start the fight while the cake is tiny. Continue reading “Rant: No surprise there’s a shortage of good GPU-developers”

4-day training on OpenCL-on-FPGAs, 24-28 October, Amsterdam

fast-fpgaFrom 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week ( 2 + 2 days of training, one pause day), including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Porting code that uses random numbers


When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 2)

random_300In the first part of our two-part blog series, we have discussed how parallel computing applications can best use pseudo-random number generators (PRNGs) so as to benefit from parallel computing speedups, without negatively impacting the statistical properties of the random numbers generated. We have argued that index-based PRNGs (e.g., from the Random123 library), which do not maintain any state but instead take an index and a key as input and return the random number corresponding to the index in its random output sequence, provide numerous benefits in parallel programming. This week, we discuss both an application of index-based PRNGs and another important aspect of PRNGs in parallel environments: that of reproducing the same output among different parallel implementations or among a parallel and a serial implementation. We focus on the latter and consider reproducibility for the purpose of verification, which – as we have stated last week – is an important matter for our customers.


Reproducibility may be simple if random numbers are only used in the initialization phase of the application. In this case, it may be sensible to record the set of random numbers generated in the serial code, write it to a file and use this file as input to a table-based approach for random-number generation in the parallel code (as discussed in Part 1 of our blog). Indeed, this is a convenient approach for both us and our customers as we do not have to deal with PRNG implementation internals and our customer can independently verify the correctness of the parallel implementation. We have used this for several of our projects.

More complex scenarios such as stochastic simulations and Monte Carlo applications require a continuous, parallel generation of random numbers. In this case, reproducibility at first seems challenging considering the concurrent and thus unpredictable order in which parallel code may invoke PRNGs. Indeed, it may be impossible to achieve when using traditional, seed-based PRNGs. This is because we may need to access entries in the PRNG’s output sequence in an arbitrary order rather than sequentially. Essentially, we need to define a one-to-one mapping between random numbers generated in the serial code and those used in the parallel version and then be able to ask the PRNG for the first entry, the second entry, and so on, in both implementations. The random-access capability is exactly what index-based PRNGs provide. The main challenge lies in the definition of the mapping.

We may consider two options for defining the one-to-one mapping between random numbers generated in a serial code and those created in a parallel implementation. The summary of our findings is this:

  • Serial-to-parallel mapping: This amounts to an emulation of the serial random number generation in the parallel code. It generally requires minimal or no changes in the serial, original code, but may be increasingly difficult or impossible to define as the complexity of the code grows.
  • Parallel-to-serial mapping: This emulates the parallel random number generation in the serial code. It is usually easier to define than the serial-to-parallel mapping but requires deeper changes in the serial code.

We explain below what we mean by that.

Consider a simple scenario where we have some serial code that generates a new set of random numbers in each iteration of a loop using a traditional seed-based PRNG. A parallel implementation may operate by executing each iteration via a single work item. Using traditional PRNGs, we may equip each work item with its own PRNG seed.

  • Using a serial-to-parallel mapping, we may then record the PRNG state at each iteration of the loop in the serial code and initialize the PRNG state in each work item of the parallel code similarly to a table-based approach – this does not require any changes in the serial code (other than temporarily for recording the random numbers). Alternatively, we may replace the PRNG in the serial code with an index-based one and use a single, static counter for the PRNG, which we increment with each invocation of the PRNG – this is a minimal change in the serial code. We reflect the index-based PRNG in the parallel code and initialize each work item’s counter with w·n, where w is the work item index and n denotes the size of the set of random numbers created per loop. In both cases, the serial and the parallel code should produce identical outputs.
  • For a parallel-to-serial mapping, it is easier to assume that the parallel implementation uses index-based PRNGs. Let’s assume that each work item uses indices of the form w·n+j for the j-th item in the set of random numbers generated. Then the serial code is adapted to use the same PRNG and computes indices as WI(i)·n+j, where WI(i) returns the work item index corresponding to the i-th iteration of the loop. Again, the serial and parallel implementations should then produce identical outputs.

A proper mapping becomes increasingly difficult to define when the code complexity grows. However, it is usually easier to map the parallel random number generation to the serial code. Indeed, consider the case where some random numbers are generated in a data-dependent manner, i.e., only if some condition on the input data is fulfilled. Then it is impossible to give a predefined mapping from the serial to the parallel code. We therefore prefer the serial-to-parallel mapping whenever it is easy to define (as we don’t risk or minimize the risk of introducing bugs in the original code by changing the PRNG generation) but resort to the parallel-to-serial mapping for more difficult cases.

Strengthen our team as a remote worker (freelancer)

code-jobsIn the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.

Do understand that we have extra requirements for freelancers:

  • You have a personality for working independently.
  • You have your own computer with an OpenCL-capable GPU.
  • You have good internet (for doing remote access).

We offer a job in a well-known OpenCL-company with various interesting projects. You can improve your OpenCL skills and work with various hardware (modern GPUs, embedded processors, FPGAs and more).

Our hiring-procedure is as follows:

  • You send a CV and tell us why you are the perfect candidate.
  • After that you are invited for a longer online test. You show your skills on C/C++ and algorithms. You will receive a PDF with useful feedback. (3 hours)
  • We send you a GPU assignment. You need to pick out the right optimisations, code it and explain your decisions in detail. (Hopefully under 30 minutes)
  • If all goes well, you’ll have a videochat on personal and practical matters. You can also ask us anything, to find out if we fit you. (Around 1 hour)
  • If you and the company are a fit, then you’ll go to the technical round. (About 3 hours)
  • Made it to here? Expect a job-offer.

We’re looking forward to your application.

Apply for a job as OpenCL expert (freelancer) now!