Double the performance on AMD Catalyst by tweaking subgroup operations

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamComputing we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux.

Doubling the performance

We therefore created low-level code to implement our own sub-group functions. Below we present benchmark results of 4 different exclusive prefix sum (scan) operations running on AMD Radeon R9 Nano (Fiji), using Catalyst on Linux:

  • sub-group scan using local memory (red),
  • built-in sub-group function sub_group_scan_exclusive_add (orange),
  • our implementation of sub-group scan (green),
  • the built-in work-group function work_group_scan_exclusive_add (purple), and
  • as a reference a standard buffer copy operation (blue).

While analysing the results it’s important to remember that work-group scan performs prefix sum on all 256 work-items, whereas other scan functions work with wavefronts/sub-groups of 64 work-items.

Linux, Crimson Edition 15.12

As you can see our sub-group scan function is 2.3x faster compared to built-in sub-group function. Even the method with local memory was faster than the sub-group scan!

After those tests on Linux, we ran the same benchmarks for AMD Radeon R9 Nano (Fiji) on Windows, using the latest AMD drivers  – Crimson ReLive Edition 17.3.3:

Windows, Crimson ReLive Edition 17.3.3

Built-in sub-group function has turned out to be around 20% faster compared to what we got on Linux. On the other hand, the performance of the kernel which uses local memory dropped by 25%. Our sub-group scan function remained the best, getting even more GiB/s than on Linux.  However, what is very important is that those results proves that our solution is second to none when it comes to performance on both Linux and Windows.

It is important to add that on ROCm platform the performance built-in sub-group scan function and our implementation of sub-group scan were virtually the same. As ROCm doesn’t have OpenGL-interop yet, this was not possible to use.

Also you can have this in your code

Do your OpenCL kernels use work-group or sub-group functions? Would your kernels benefit from using sub-group operations? Is the expected performance lower than expected? If you answer is “yes” for at least one of those questions, then you should consider contacting us.

Not only scan on AMD Catalyst – we now have the framework to improve any specialised function on several platforms.

With the approach we used we have a freedom to implement other, more complex operations as sub-group functions, tailoring them to the requirements of algorithms and to the device, which should result in performance improvement and lower local memory usage.

The technique can be applied to platforms other than AMD and also to functions other than scan. This makes it possible to create missing functions for unsupported hardware features.

Win an OpenCL mug!

The first batch is in and you can win one from the second batch!

We’re sending a mug to a random person who subscribes to out newsletter before the end of 17 April 2017 (Central European Time). Yes, that’s a Monday.

Two winners

We’ll pick two winners: one from academia and one from industry. If you select “other” as your background, then share which category you fall in the last field.

Did you already subscribe and also want to win? I am not forgetting you – more details are in a newsletter next quarter.

More winners, by referring to a friend

If you refer a colleague, a friend or even a stranger to subscribe, you can both win a mug. Just be sure he/she remembers to mention you to me when I ask. Before you ask: the maximum referral-length is 5 (so referral of referral of referral of referral, etc) plus the one who started it.

UPDATE: If you win a mug and were not referred by somebody, you can pick a co-winner yourself. Joy should be shared.

Subscribe now!

* indicates required

Most interested in: *

Background: *

You can also use this link

Customer: “So you also do full projects?”

One of these moments when you find out that the company is not seen as I want it to be seen. Compared to generic software engineering companies, we have the advantage of creating software that is capable of processing more data.

For years we have promoted this unique advantage, for which we use OpenCL, CUDA, HIP and several other languages. Always having full projects in mind, but unfortunately not clearly communicating this.

During discussions with several existing and new customers, it became suddenly clear that we are seen as a company that fixes code, not one that builds the full code.

It became most clear that when was suggested to let us collaborate with another party, where our role would be to make sure they would not make mistakes regarding performance and code-quality.

  • Customer:   You can work with a team we hired before.
  • Us:             We also do full projects.
  • Customer:   Really?

This would mean we would be the seniors in the group, but not own the project – a suboptimal situation, as important design decisions could be ignored. Continue reading “Customer: “So you also do full projects?””

Should SPIRV be supported in CUDA?

Would you like to run CUDA-kernels on the OpenCL framework? Or Python or Rust? SPIRV is the answer! Where source-to-source translations had several limitations, SPIRV 1.1 even supports higher level languages like C++.

SPIRV is the strength of OpenCL and it will only get bigger.

Currently Intel drivers best supports SPIRV, making it the first target for the new SPIRV-frontends. It is unknown which vendor will be next – probably one which (almost) has OpenCL 2.0 drivers already, such as AMD, ARM, Qualcomm or even NVidia.

How interesting is SPIRV really?

So if SPIRV is really that important a reason to choose for OpenCL, I thought of framing it towards CUDA-devs on twitter in a special way:

Should CUDA 9 support SPIRV?

2 people voted no, 29 people voted yes.

So that’s a big yes for SPIRV-support on CUDA. And why not? Don’t we want to program in our own language and not be forced to use C or C++? SPIRV makes it possible to quickly add support in any language out there without official support of the vendor. Let wrappers handle the differences per vendor and let SPIRV be the shared language for GPU-kernels. Where do we need OpenCL or CUDA for, if the real work is defined by SPIRV?

What do you think? Leave your comment below how you see the future of GPGPU with SPIRV in town. Is 2017 the year of SPIRV?

And if you worked on a SPIRV-frontend, get in touch to continue your project on Yes, it’s empty right now, but you don’t know what’s hidden.

NVIDIA beta-support for OpenCL 2.0 on Linux

In the release notes for 378.66 graphics drivers for Windows (February 2017), NVIDIA officially spoke about supporting OpenCL 2.0 for the first time. Unfortunately, this is partial support only and, as NVIDIA said, these new [OpenCL 2.0] features are available for evaluation purposes only.

We did our own tests on a GTX 1080 on Windows and could confirm that for Windows the green team is halfway there. NVIDIA still has to implement pipes, enable non-uniform work-group sizes (this happens when in ND-range global_work_size is not divisible by the local_work_size), and fix a few bugs in device side enqueue.

Today we decided to test out NVIDIA latest driver (378.13) for 64-bit Linux and check its support for OpenCL 2.0.

NVIDIA, OpenCL 2.0 and Linux

Just like on Windows, our GTX 1080 reports that it is an OpenCL 1.2 devices. It is understandable since support for OpenCL 2.0 is only in beta stage. In the following table you’ll find an overview of the 2.0 functions supported by this Linux driver.

OpenCL 2.0 featureSupportedNotes
SVMYesOnly coarse-grained SVM is supported. Fine-grained SVM is not.
Device side enqueuePartially. Surprisingly, it
works better than
on Windows
Almost OpenCL programs with device side queue we have tested work.

Some advanced examples with multi-level device side kernel enqueuing
and/or CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP fail. When using device
side queue, it's only possible to use 1D nd-range with uniform work
groups (or without specifying local size). 2D and 3D nd-ranges
don't work.
Work-group functionsYes
PipesNoPipe functions are defined in in 378.13 drivers,
but using them cause run-time errors.
Generic address spaceYes
Non-uniform work-groupsNo
C11 AtomicsPartiallyUsing atomic_flag_* functions cause an CL_BUILD_ERROR error.
Subgroups extensionNo

The host-side functions clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties() are also present and working.

As you can see, the support for OpenCL 2.0 on Linux is almost exactly the same as on Windows. But in contrast with the Windows-drivers, we were able to successfully compile and run several more kernels that use device side queue. It may indicate that this feature is being actively developed and maybe in future drivers it will work much better – for both Linux and Windows.

What you can do to make it better

As NVIDIA only adds new functionality to OpenCL driver when requested, it is very important that they receive these requests. So when you or your employer is a paying customer, do keep requesting the features you need. Know that NVIDIA knows that lacking required functionality will be bad for their sales.

NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

The 8 reasons why our customers had their code written or accelerated by us

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these.

Reduce processing time

When people hear about Software Performance Engineering, the first thing they think about is reducing processing time. The largest part of our customers have hired us for this task, but all had different time-requirements.

Meeting timing requirements

The most usual type of project has very specific timing-requirements. Some examples:

  • When there is 10 minutes of time to respond, the computations cannot take any second later.
  • On cloud applications, the time is more often less than half a second.
  • When processing data for a customer using the latest info, the time should not take more than “Let me get your file”.
  • When a plane lands, there should be no “Just hold a minute”

We did not only use GPUs (OpenCL or CUDA) to make our customer’s code faster, but also redesigned algorithms to be more efficient. In a few cases we could get the timing-requirement by optimising the code – there was no need to port to the GPU anymore.

Increasing user efficiency

Getting hours to minutes or minutes to seconds.

When employees are waiting for the computer to finish, this reduces efficiency. Some examples:

  • Handling customer-data at the reception, so more customers can be helped.
  • Reducing a daily batch job of 2 hours to minutes, to minimise overtime.
  • Importing new data into the system in less time, so the user can more focus on quality.

We found that users feel less powerless when pressure increases, as they have more control of the time and the software. Before the speedup they felt controlled by the software.

Increasing the responsiveness

From seconds or even minutes to milliseconds.

When a system does not react immediately, trust in the system goes down. Some examples:

  • For creative software the user needs to be in a flow and not have to wait for each small step taken. This is because slower software reduces the number of items in one’s short term memory.
  • For data analysts getting a feeling for the data takes some “wandering around”. Responsive software reduces learning time.

Besides standard performance engineering, there are more options to make the user interaction more immediate and snappy. For instance a “low resolution” result can let the user preview the results.

Reduce latency

Getting from seconds to milliseconds, or from milliseconds to microseconds.

Where responsiveness deals with users, latency describes automated systems where microseconds matter. The requirements to maximum processing time are often very strict and real-time operating systems could be used. Some examples:

  • Real time image processing for video-streams.
  • Feedback-loops for machine control to reduce operational errors.
  • High-speed networking applications, such as finance.

We choose FPGAs when the latency needs to be in the low microsecond, and GPUs when latency has to be in the millisecond range. Work we did here included porting from CPU to GPU and from GPU to FPGA.

Increase functionality and data sizes

The goal in this category is the same as the previous, but the problem is described often from the perspective of features and data size. This is because the time is not seen as a current problem, but as a future problem.

Add extra functionality

Same processing time, more functionality.

The processing time is described as a disadvantage, but is not a complaint – there is understanding the computations are intensive. On the other hand the customers request extra features. Some examples:

  • Applying extra image improvement algorithms on the video-stream.
  • Applying an alternative algorithm to the same data.

In cases where we also improved the existing software for performance, total processing time for more data went down.

Increase simulation/data sizes

Same processing time, tenfold the data.

Each year the data size increases more than the performance of computers increase. Where a single server was enough 3 years ago, now a small cluster has to be rented.

To cope with this explosive growth we are asked to remodel the software that ports well to current end future high-performance processors. Some examples:

  • Promoting prototypes to production software.
  • Going from 1D data to 2D and 3D.
  • Cross-analysing 10,000 shoppers instead of 100.
  • Doing a weather-model for the whole of Europe instead of one country only.
  • Improving a stochastic model.
  • Using higher resolution data.

This is the most common type of problem we solve in this category. Especially proven-to-work models (including prototypes) are chosen to work on larger data sets than they were designed for.

Reduce operational costs

When the operation scales up, the operational costs can increase exponentially. Some of our customers identified the problem werll in advance and let us reduce their operational costs before it got out of hand.

Reducing the server count

Same performance, less hardware.

Processing data can take 10s to 100s of servers, increasing power and maintenance costs and thus lowering the performance/€ or performance/$.

  • If the computations are the limiting factor, ten dual-socket servers can be replaced by single GPU or FPGA-server, it will be much easier to double the capacity.
  • Computations can be moved to the user, by doing pre-processing on the mobile phone or desktop. By using the GPU, the device’s battery doesn’t get drained.

We’ve helped early scale-ups who identified the problem of the operational costs sky-rocket when the code is not optimised.

Reducing power usage

Same performance, less Watt.

It’s all about performance/Watt. Some examples:

  • A GPU can take 200 to 300 Watt, but algorithms can take 10 time less than a 100 Watt CPU.
  • On smartphone, porting code to the GPU reduces the power usage.
  • When an FPGA can be used (i.e. networking), there are options to replace the full server by a single 20-30 Watt FPGA.
  • Porting code to GPUs using HBM, which uses much less memory.

Reducing power usage on portable devices has been the most common use case here.

Recognise one of these problems?

Call or email us to start discussing your problem. In a one week effort of analysing your code and discussing with the developers, we can often provide a good indication how much time the performance improvement can take.

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. If you want to learn more about PRACE, click here.

2017 International Summer School on HPC Challenges in Computational Sciences – Applications due 6 March 2017

The summer school is sponsored by Compute/Calcul Canada, the Extreme Science and Engineering Discovery Environment (XSEDE), the Partnership for Advanced Computing in Europe (PRACE) and the RIKEN Advanced Insti­tute for Computational Science (RIKEN AICS).

Graduate students and postdoctoral scholars from institutions in Canada, Europe, Japan and the United States are invited to apply for the eighth International Summer School on HPC Challenges in Computational Sciences, to be held June 25 – 30 2017, in Boulder, Colorado, United States of America.

Leading computational scientists and HPC technologists from the U.S., Europe, Japan and Canada will offer instructions on a variety of topics and also provide advanced mentoring.
Topics include:

  • HPC challenges by discipline
  • HPC programming proficiencies
  • Performance analysis & profiling
  • Algorithmic approaches & numerical libraries
  • Data-intensive computing
  • Scientific visualization
  • Canadian, EU, Japanese and U.S. HPC-infrastructures

For more details please visit:

PRACE Summer of HPC 2017 – Applications due 19 February 2017

The PRACE Summer of HPC is a PRACE outreach and training programme that offers summer placements at top HPC centres across Europe to late-stage undergraduates and early-stage postgraduate students. Up to twenty top applicants from across Europe will be selected to participate. Participants will spend two months working on projects related to PRACE technical or industrial work and produce a report and a visualisation or video of their results.

Early-stage postgraduate and late-stage undergraduate students are invited to apply for the PRACE Summer of HPC 2017 programme, to be held in July & August 2017. Consisting of a training week and two months on placement at top HPC centres around Europe, the programme affords participants the opportunity to learn and share more about PRACE and HPC, and includes accommodation, a stipend and travel to their HPC centre placement.

The programme will run from 2 July to 31 August 2017, with a kick-off training week at IT4I Supercomputing Centre in Ostrava attended by all participants. Flights, accommodation and a stipend will be provided to all successful applicants. Two prizes will be awarded to the participants who produce the best project and best embody the outreach spirit of the programme.

For more details please visit:

This is a unique chance to get in touch with HPC. If you have friends who could use HPC for their research, make sure they learn about these summer activities.

How many threads can run on a GPU?

Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

Funded PhD internships at StreamComputing

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamComputing.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.


The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamComputing”

IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

We have been awarded the Khronos project to upgrade the OpenCL test suite to 2.2!

Some weeks ago we started with implementing the Compiler Test Suite for OpenCL 2.2. The biggest improvement of OpenCL 2.2 is C++ kernels, which originally was planned for 2.1. SPIRV 1.1 is another big improvement.

We are very happy to have a part in making OpenCL better! We find OpenCL C++ kernels very important, even if it has its limitations. Thanks to SPIRV 1.1 it gets easier to have more (unofficial) kernel languages next to C and C++, and to get SYCL. Also upgrading from 2.0 to 2.2 is rather easy thanks to the open source libclcxx.

Personally I found this project to also be very important for our internal knowledge building, as almost every function would be touched and discussed.

OpenCL 2.2 CTS RFQ has been awarded to StreamComputing

Khronos issued a Request For Quote (RFQ) back in September 2016 to enhance and expand the existing OpenCL 2.1 conformance tests to create an OpenCL 2.2 test suite to be used to define conformance for OpenCL 2.2 implementations. The contract has been awarded to StreamComputing. StreamComputing is a software consultancy company specialized in performance tuned software development for CPU, GPU and FPGA. A large part of their clients hires them for their OpenCL expertise.

Already improvements have been added, bugs splatted and documentation improved. We hope to continue this the coming months!

We’ll be ready in March. Hopefully the first implementations are ready by then, as there is a test suite ready to iron out any bug discovered. Which three OpenCL drivers do you think will be first to have OpenCL 2.2? Intel, AMD, NVidia, ARM, Imagination, Qualcomm, TI, Intel FPGA (Altera), Xilinx, Portable OpenCL or another?

AMD gets into Machine Intelligence with “MI” range of hardware and software

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post.

For fast deep learning you need two things: extremely fast memory and lots of FLOPS at 16-bit. AMD happens to have developed HBM2, the world’s fastest memory and now available to everybody. So AMD only needed to beat the NVIDIA P100 on FLOPS, and they did: the AMD “MI25” is expected to deliver around 25 TFLOPS for 16-bit operations. If you want to know more, lots of new links show up daily on Google.

This means that AMD is beating NVIDIA’s top-range GPUs again. Add NVlink-competitor CCIX and it’s clear that AMD is a strong competitor again, as they used to. The only problem is that much of the software is written in CUDA…

Software: porting from CUDA

AMD’s Greg Stoner, Director of Radeon Open Compute, opened up today on the current state of their software (typos fixed):

If you guys saw the Radeon Instinct launch you will find we finally announced our big push into Deep Learning. Here is good article

We will be delivery HIP version of Caffe, Tensorflow, Torch7, MxNet, Theano, CNTK, Chainer, all supporting our new MIOpen – our new Deep Learning solver.

Since the everyone is interested in Tensorflow

Note this will run on AMD and NVIDIA hardware


The status of Eigen is “35 out of 43”, which is a rather vague description but an indication nevertheless. Eigen is a very important part of TensorFlow. A good promise that the code will be ready when the new VEGA hardware is launched.

Also interesting the the mention of MIOpen. It has been discussed on TechReport:

This library offers a range of functions pre-optimized for execution on Radeon Instinct cards, like convolution, pooling, activation, normalization, and tensor operations. AMD says that convolution operations performed with MIOpen are nearly three times faster than those performed using the widely-used “general matrix multiplication” (GEMM) function from the standard Basic Linear Algebra Subprograms specification. That speed-up is important because convolution operations make up the majority of program run time for a convolutional neural network, according to Google TensorFlow team member Pete Warden.

The reason why they can deliver so many software-ports in such limited time with a small team, is because of HIP. This makes it possible to port CUDA code to HIP, which runs on both AMD and NVIDIA.

We personally also had good experience with porting code to HIP. If you need CUDA code to be ported to AMD, know we tend to make the code faster and solve previously undiscovered bugs during the porting process.

Opinions crossing the table: Khronos for world peace

Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

Accelerating Excel with OpenCL

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating Excel with OpenCL”

Online Tutorials are here

46188854 - beautiful smiling female student using online education service. young woman looking in laptop display watching training course and listening it with headphones. modern study technology concept
Online training

We’re going online with our presentations and tutorials. This makes it easy to reach more people and make our trainings more flexible.

We’re starting with short introductory trainings, but we have bigger plans. Keep an eye on our events (shared on Twitter, LinkedIn, this blog and the newsletter) to see what the offerings are. And you’re very welcome to join!

On 4 October (new date) there will be an OpenCL 101 of two hours for free. Target timezone is East-America and Europe.

Agenda Online OpenCL 101

  • Introductions (20 minutes)
    • StreamComputing
    • GPUs and paralellism
    • OpenCL
  • By example: Getting started with OpenCL (30 minutes)
  • By example: Porting a simple program to OpenCL (30 minutes)
  • Q&A in parallel (30 minutes). Ask us any question, for instance:
    • General OpenCL.
    • OpenCL on GPUs.
    • OpenCL on FPGAs.
    • What algorithms work well with GPUs, CPUs and FPGAs.
    • StreamComputing services.
  • The next steps (5 minutes).
  • Closing words (5 minutes).

Read more here…

Tutorial server

You can already test if the tutorial server works for you by looking around in our demo room. The tutorial itself will be in another room. Use your own name and password “ap“.

See you soon!

How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)

Hampstead flooding

How water moves through an area given a certain pace of instream, can be fully simulated. We got a request to make such simulation faster, as it took already too much time to do moderate simulations. As the customer wanted to be able to have more details, larger areas and more alternative situations computed, the current performance did not suffice.

The code was already ported to MPI to scale to 8 cores. This code was used as a base for creating our optimised GPU-code. Using a single GPU we managed to get an 44 to 58 times speedup over single core CPU, which is 5 to 7 times faster than MPI on 8 to 32 CPU cores.

For larger experiments we could increase the performance advantage over MPI-code from 7 times to a total of 35 times, using multiple GPUs.

We solved both the weak-scaling problem and the mapping on GPUs

If you add the 9x speedup of the initial performance-optimisation, the total is over 2600x. What could be done in a year, now can be done in 3.5 hours. This clearly shows the importance of software performance engineering. Most code already had some optimisations applied (just like here) and 5 to 7 times speedup is quite achievable.

Read below for some more details. Continue reading “How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)”

Get ready for conversions of large-scale CUDA software to AMD hardware

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

CUDA Compute Capability 6.1 Features in OpenCL 2.0

On the CUDA page of Wikipedia there is a table with compute capabilities, as shown below. While double checking support for AMD Fijij GPUs (like Radeon Nano and FirePro S9300X2) I got curious how much support is still missing in OpenCL. For the support of Fiji it looks like there is 100% support of all features. For OpenCL 2.0 read on.

CUDA features per Compute Capability on Wikipedia

Continue reading “CUDA Compute Capability 6.1 Features in OpenCL 2.0”