What is Khronos as of today?

The Khronos Group is the organization behind APIs like OpenGL, Vulkan and OpenCL. Over one hundred companies are a member and decide together what your next year phone, camera, computer or media device will be capable of.

We’re at the right, near the bottom.

We work most with OpenCL, but you probably noticed we work with OpenGL, Vulkan and SPIR too. Currently they have the following APIs:

  • COLLADA, a file-format intended to facilitate interchange of 3D assets
  • EGL, an interface between Khronos rendering APIs such as OpenGL ES or OpenVG and the underlying native platform window system
  • glTF, a file format specification for 3D scenes and models
  • OpenCL, a cross-platform computation API.
  • OpenGL, a cross-platform computer graphics API
  • OpenGL ES, a derivative of OpenGL for use on mobile and embedded systems, such as cell phones, portable gaming devices, and more
  • OpenGL SC, a safety critical profile of OpenGL ES designed to meet the needs of the safety-critical market
  • OpenKCam, Advanced Camera Control API
  • OpenKODE, an API for providing abstracted, portable access to operating system resources such as file systems, networks and math libraries
  • OpenMAX, a layered set of three programming interfaces of various abstraction levels, providing access to multimedia functionality
  • OpenML, an API for capturing, transporting, processing, displaying, and synchronizing digital media
  • OpenSL ES, an audio API tuned for embedded systems, standardizing access to features such as 3D positional audio and MIDI playback
  • OpenVG, an API for accelerating processing of 2D vector graphics
  • OpenVX, Hardware acceleration API for Computer Vision applications and libraries
  • OpenWF, APIs for 2D graphics composition and display control
  • OpenXR, an open and royalty-free standard for virtual reality and augmented reality applications and devices
  • SPIR, a intermediate compiler target for OpenCL and Vulkan
  • StreamInput, an API for consistently handling input devices
  • Vulkan, a low-overhead computer graphics API
  • WebCL, a JavaScript binding to OpenCL within a browser
  • WebGL, a JavaScript binding to OpenGL ES within a browser on any platform supporting the OpenGL or OpenGL ES graphics standards

Too few people understand that the organization is very unique, as the biggest processor vendors are discussing collaborations and how to move the market, while they’re normally the fiercest competitors. Without Khronos it would have been a totally different world.

AMD ROCm 1.5 Linux driver-stack is out

ROCm is AMD’s open source Linux-driver that brings compute to HSA-hardware. It does not provide graphics and therefore focuses on monitor-less applications like machine learning, math, media processing, machine vision, large scale simulations and more.

For those who do not know HSA, the Heterogeneous Software Architecture defines hardware and software such that different processor types (like CPU, GPU, DSP and FPGA) can seamlessly work together and have fine-grained memory sharing. Read more on HSA here.

About ROCm and it’s short history

The driver stack has been on Github for more than a year now. Development is done internally, while communication with users is done mostly via Gitlab’s issue tracker. ROCm 1.0 was publicly announced on 25 April 2016. After version 1.0, there now have been 6 releases in only one year – the 4 months of waiting time between 1.4 and 1.5 was therefore relatively long. You can certainly say the development is done at a high pace.

ROCm 1.4 was released end of December and besides a long list of fixed bugs, it had the developer preview of OpenCL 2.0 kernel support added. Support for OpenCL was limited to Fiji (R9 Fury series) and Baffin/Ellesmere (Radeon RX 400 series) GPUs, as these have the best HSA support of current GPU offerings.

Currently not all parts of the driver stack is open source, but the binary blobs will be open sourced eventually. You might think why a big corporation like AMD would open source such important part of their offering. This makes totally sense if you understand that their most important customers spend a lot of time on making the drivers and their code work together. By giving access to the code, debugging becomes a lot easier and will reduce development time. This will result in less bugs and a shorter time-to-market for the AMD-version of the software.

The OpenCL language runtime and compiler will be open sourced soon, so AMD offers full OpenCL without any binary blob.

What does ROCm 1.5 bring?

Version 1.5 adds improved support for OpenCL, where 1.4 only gave a developer preview. Both feature-support and performance have been improved. Just like in 1.4 there is support for OpenCL 2.0 kernels and OpenCL 1.2 host-code – the tool clinfo mentions there is even some support of 2.1 kernels, but we haven’t fully tested this yet.

The command-line based administration (ROCm-SMI) adds power monitoring, so power-efficiency can be measured.
The HCC compiler was upgraded to the latest CLANG/LLVM. There also have been big improvement in C++ compatibility.

Other improvements:

  1. Added new API hipHccModuleLaunchKernel which works exactly as hipModuleLaunchKernel but takes OpenCL programming models launch parameters. And its test
  2. Added new API hipMemPtrGetInfo
  3. Added new field to hipDeviceProp_t -> gcnArch which returns 803, 700, 900, etc.,

Bug fixes:

  1. Fixed Copyright and header names
  2. Fixed issue with bit_extract sample
  3. Enabled lgamma and lgammaf
  4. Added guard for GFX8 specific intrinsics
  5. Fixed few issues with operator overloading of vector data types
  6. Fixed atanf
  7. Added guard for __half data types to work with clang version more than 3. (Will be removed eventually).
  8. Fixed 4_shfl to work only for gfx803 as hawaii don’t support permute ops

Current hardware support:

  • GFX7: Radeon R9 290 4 GB, Radeon R9 290X 8 GB, Radeon R9 390 8 GB, Radeon R9 390X 8 GB, FirePro W9100 (16GB), FirePro S9150 (16 GB), and FirePro S9170 (32 GB).
  • GFX8: Radeon RX 480, Radeon RX 470, Radeon RX 460, Radeon R9 Nano, Radeon R9 Fury, Radeon R9 Fury X, Radeon Pro WX7100, Radeon Pro WX5100, Radeon Pro WX4100, and FirePro S9300 x2.

If you’re buying new hardware, pick a GPU from the GFX8 list. FirePro S9300 X2 is currently the server-grade solution of choice.

Keep an eye on the Phoronix website, which is usually first with benchmarking AMD’s open source drivers.

Install ROCm 1.5

Where 1.4 had support for Ubuntu 14.04, Ubuntu 16.04 and Fedora 23, 1.5 added support for Fedora 24 and dropped support for Ubuntu 14.04 and Fedora 23. On other distributions than Ubuntu 16.04 or Fedora 24 it *could* work, but there are zero guarantees.

Follow the instructions on Github step-by-step to get it installed via deb or rpm. Be sure to uninstall any previous release of ROCm to avoid problems.

The part on Grub might not be clear. For this release the magic GRUB_DEFAULT line on Ubuntu 16.04 is:

GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 4.9.0-kfd-compute-rocm-rel-1.5-76"

You need to alter this line with every update, else it’ll keep using the old version.

Make sure “/opt/rocm/bin/” is in your PATH when wanting to do some coding. When running the test, you should get:

/opt/rocm/hsa/sample$ sudo make
gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
/opt/rocm/hsa/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.

Also clinfo (installed from the default repo) should work.

Got it installed and tried your code? Did you see improvements? Share your experiences in the comments!

Not really ROCk-music, but this blog has been written while listening to the latest album of the Gorillaz

DHPCC++ Program known

During IWOCL a workshop takes place that discusses the opportunities that C++ brings to OpenCL-enabled processors. A well-know example is SYCL, but various other approaches are talked about.

The Distributed & Heterogeneous Programming in C/C++ Workshop just released their program:

  • HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking.
  • SYCL C++17 and OpenCL interoperability experimentation with triSYCL.
  • KART – A Runtime Compilation Library for Improving HPC Application Performance.
  • Using SYCL as an Implementation Framework for HPX Compute.
  • Adding OpenCL to Eigen with SYCL.
  • SYCL-BLAS: Leveraging Expression Trees for Linear Algebra.
  • Supporting Distributed and Heterogeneous Programming Models in ISO C++.
  • Towards an Asynchronous Data Flow Model for SYCL 2.2.
  • Communicating Execution Contexts using Channels.
  • Towards a Unified Interface for Data Movement.
  • Panel: What do you want in C++ for Heterogeneous.

As C++ is an important direction for OpenCL, we expect to see most of the discussions on programmability of OpenCL-enabled processors be done here.

The detailed program is to be released soon. For more information see the DHPCC++ webpage. At the IWOCL website you can buy tickets and passes – combining with IWOCL gives a discount.

IWOCL 2017 – all the talks

An overview of all the tutorials and talks for easy reading.

You can also download the PDF.

Heterogeneous Computing Using Modern C++ with OpenCL Devices – Rod Burns and Ruyman Reyes (Codeplay)

This hands-on session will provide an opportunity to get experience with SYCL using ComputeCpp™ Community Edition, a free to use implementation of the SYCL 1.2 standard. Attendees will be shown how to set up ComputeCpp and use it to write their own SYCL code to run on supported GPUs and CPUs.

SYCL is already able to dispatch to heterogeneous devices and it implements C++17 ParallelSTL, augmenting it with ability to dispatch to GPUs in addition to CPUs. This tutorial will demonstrate how to write parallel SYCL code and how to use the Khronos Group’s experimental Parallel STL implementation. The course outline is as follows

  • Start with a basic SYCL program that shows how to submit queues in a single task and stream-like object, comparing CPU, SYCL and OpenCL versions
  • Demonstrate how to access data across host and GPUs using buffers and accessors, the importance of life-time, and basic parallel constructs

Attendees are expected to have programming experience with C++ and a laptop either running Linux or having a VM manager installed such as VirtualBox. The required software will be provided on USB-sticks. This course is suitable for beginners, but is focused on intermediate to advanced parallel programming using C++.

Harnessing the Power of FPGAs with the Intel FPGA SDK for OpenCL- Byron Sinclair, Andrew Ling and Genady Paikin (Intel)

In this tutorial, we will introduce you to the reconfigurable hardware architecture and programming of Field Programmable Gate Arrays (FPGAs).

You will learn why FPGAs have become so popular in recent years, and understand the many advantages of using FPGAs in your HPC application. In particular, we will cover architectural features of FPGAs that make them well suited to many complex operations, including matrix multiplications and convolutions. In addition, we will introduce you to programming FPGAs using the Intel® FPGA SDK for OpenCL, and how specific OpenCL coding techniques can lead to efficient circuits implemented on the FPGA.

Finally, we will go over several case studies where FPGAs have shown very competitive performance when programmed using OpenCL, including convolutional neural nets, FFTs, and astronomy de-dispersion algorithms.

Unlock Intel GPUs for High Performance Compute, Media and Computer Vision Capabilities with Intel OpenCL Extensions – Jeff Mcallister, Biju George, Adam Herr and Ben Ashbaugh (Intel)

The keys to unlock the full performance potential of Intel GPUs for emerging workloads in general compute, media, computer vision, and machine learning are in the rich suite of Intel OpenCL extensions. These give developers direct access to unique Intel hardware capabilities, which until now have been difficult to master.
This tutorial builds step by step with multiple examples, including:

  • How to write high performance general compute applications based on the core concept of OpenCL subgroups.
  • How to use additional subgroup operations described in the Intel subgroups and media block read/write extensions.
  • Then using the framework of subgroups, we explain the device-side motion estimation extension which leverages the unique Intel GPU media sampler to accelerate motion estimation operations from OpenCL kernels.
  • Finally we explain the Video Enhancement (VEBOX) extension, which is an OpenCL host level API extension to leverage a powerful media fixed function unit to accelerate many frame level video enchancement operations.

Faster, smarter computer vision with AI and OpenCL – Uri Levy and Jeffrey Mcallister (Intel)

Learn how to use Intel machine learning and computer vision tools to get from concept to market faster for machine learning applications based on OpenCL and OpenVX. Build two example scenarios: autonomous driving with FPGA inference and a smart camera app using Intel Graphics inference. This presentation will show how a unified set of tools can reduce the complexity of developing heterogeneous machine learning apps – from training a model with input images, to creating a custom classifier, to building an optimized traditional computer vision pipeline around the classifier to create a full computer vision application

GPGPU Acceleration using OpenCL for a Spotlight SAR Simulator – Eric Balster, Jon Skeans and David Fan (University of Dayton) Marc Hoffman (US Air Force Research Laboratory)

In this paper, OpenCL is used to target a general purpose graphics processing unit (GPGPU) for acceleration of 2 modules used in a synthetic aperture radar (SAR) simulator. Two of the most computationally complex modules, the Generate Return and Back Projection modules, are targeted to an AMD FirePro M5100 GPGPU. The resulting speedup is 2.5X over multi-threaded C++ implementations of those algorithms running on an 8-core Intel I7 2.8GHz processor, 5X over singlethreaded C++ implementations, and 24X over native MATLAB implementations, on average.

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL – Javier Alejandro Varela and Norbert Wehn (University of Kaiserslautern)

In this work, we exploit OpenCL to efficiently map the nested simulation of complex portfolios with multiple algorithms on heterogeneous computing systems. Code portability and customizations allow us to profile the kernels on different accelerating platforms, such as CPU, Intel’s Xeon Phi and GPU. The combination of OpenCL, a new bit-accurate algorithmic optimization and the extension of an existing numerical interpolation scheme allows us to achieve 1000x speedup compared to the state-of-the-art approach. Our system design minimizes costly host-device transfers and global memory, enabling complex portfolios to be easily scaled.

A Performance and Energy Evaluation of OpenCL-accelerated Molecular Docking – Leonardo Solis Vasquez and Andreas Koch (Technische Universität Darmstadt)

This work presents an OpenCL implementation of AutoDock, and a corresponding performance evaluation on two different platforms based on multi-core CPU and GPU accelerators. It shows that OpenCL allows highly efficient docking simulations, achieving speedups of ∼4x and ∼56x over the original serial AutoDock version, as well as energy efficiency gains of ∼2x and ∼6x. respectively. To the best of our knowledge, this work is the first one also considering the energy efficiency of molecular docking programs.

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations – Nuno Fachada and Agostinho Rosa (Instituto Superior Técnico, Portugal)

In this paper we evaluate the feasibility of using CPU-oriented OpenCL for high-performance simulations of agent-based models. We compare a CPU-oriented OpenCL implementation of a reference ABM against a parallel Java version of the same model. We show that there are considerable gains in using CPU-based OpenCL for developing and implementing ABMs, with speedups up to 10x over the parallel Java version on a 10-core hyper-threaded CPU.

Enabling FPGAs as a True Device in the OpenCL Standard – Vincent Mirian and Paul Chow (University Of Toronto)

As FPGA capacities continue to increase, the ability to partition and partially reconfigure the FPGA will become even more desirable. The fundamental issue is how FPGAs are currently viewed as devices in the OpenCL model. In this paper, we propose a small change to the OpenCL definition of a device that unlocks the full potential of FPGAs to the programmer.

Applying Models of Computation to OpenCL Pipes for FPGA Computing – Nachiket Kapre and Hiren Patel (University of Waterloo)

We propose imposing a communication discipline inspired from models of computation (e.g.Ptolemy) such as SDF (synchronous dataflow), bulk synchronous (BSP), or Discrete Event (DE). These models offer a restricted subset of communication patterns that enable implementation tradeoffs and deliver performance and resource guarantees. This is useful for OpenCL developers operating within the constraints of the FPGA device. We hope to facilitate a preliminary analysis and evaluation of supporting these patterns in OpenCL and quantifying associated FPGA implementation costs.

Accelerating Applications at Cloud Scale using FPGAs – Sarah Siripoke, Fernando Martinez Vallina and Spenser Gilliland (Xilinx)

The acceptance and success of cloud computing has given application developers access to computing and new customers at a scale never seen below. The inherent ability of an FPGA to reconfigure and be workload optimized is a great advantage given the fast-moving needs of cloud computing applications. In this talk we will discuss how users can develop, accelerate and deploy accelerated applications in the cloud at scale. You will learn how to get started on a turn-key OpenCL development environment in the cloud using Xilinx FPGAs.

Creating High Performance Applications with Intel’s FPGA OpenCL SDK – Andrew Ling, Utku Aydonat, Davor Capalija, Shane O’Connell and Gordon Chiu (Intel)

After decades of research, High-Level Synthesis has finally caught on as a mainstream design technique for FPGAs. However, achieving performance results that are comparable to designing at a hardware description level still remains a challenge. In this talk, we illustrate how we achieve world class performance results on HPC applications by using OpenCL. Specifically, we show how we achieve 1Tflop of performance on a matrix multiply and over 1.3Tflops on a CNN application, run on Intel’s 20nm Arria 10 FPGA device. Finally, we will describe spatial coding techniques that lead to efficient structures, such as systolic-arrays, to ensure that the FPGA runs efficiently.

Symphony – Task Scheduling and Memory Management in Heterogeneous Computing – Amit Jindal and Wenjia Ruan (Qualcomm Technologies)

Task scheduling and memory management are challenges that make Heterogeneous Computing difficult for the masses. There are several programming models and tools that exist targeting partitioning of workload and accessibility of data between CPU and GPU. We have developed and deployed Symphony SDK – a framework that makes workload partitioning, scheduling and memory management ‘simple’ for developers. In this talk, we will introduce Symphony architecture, elaborate how existing OpenCL kernels can be reused with heterogeneous task synchronization, task scheduling, and memory management capabilities of Symphony. We will also share real-world cases where Symphony has provided 2x-6x performance speed-ups.

CUDA-on-CL: A compiler and runtime for running modern CUDA c++11 applications on OpenCL 1.2 devices – Hugh Perkins (ASAPP)

Cuda-on-cl addresses the problem of creating and maintaining OpenCL forks by leaving the reference implementation entirely in NVIDIA CUDA, and writing both a compiler and a runtime component, so that any CUDA c++11 application can in theory be compiled and run directly on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run Tensorflow and Eigen kernels directly, with no modification to the original CUDA source-code. Performance studies are also undertaken, and show that the cuda-on-cl program runs at about 25% of the original CUDA-compiled version.

OpenCL in Scientific High Performance Computing—The Good, the Bad, and the Ugly – Matthias Noack (Zuse Institute Berlin)

We present experiences with utilising OpenCL alongside C ++ , MPI, and CMake in two real-world scientific codes. Our targets are a Cray XC40 supercomputer with multi- and many-core (Xeon Phi) CPUs, as well as multiple smaller systems with Nvidia and AMD GPUs. We shed light on practical issues arising in such a scenario, like the interaction between OpenCL and MPI, discuss solutions, and point out current limitations of OpenCL in the domain of scientific HPC from an application developer’s and user’s point of view.

Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices – Andrew Richards, Mehdi Goli and Luke Iwanski (Codeplay)

Codeplay has been working with Google to add SYCL back-end support in TensorFlow, one of the most popular machine learning frameworks, enabling developers to use OpenCL devices with their machine learning applications. SYCL provides an abstraction layer that simplifies parallel development, giving developers access to the computing power of OpenCL devices and reducing the amount of code required. Andrew Richards will talk about how machine learning applications can harness the power of OpenCL using open standards and how, by using SYCL, TensorFlow can be extended to include customized operations running on OpenCL devices.

Analyzing and improving performance portability of OpenCL applications via auto-tuning – James Price and Simon McIntosh-Smith (University of Bristol)

In this talk, we present an approach for analyzing performance portability that exploits that black-box nature of automatic performance tuning techniques. We demonstrate this approach across a diverse range of GPU and CPU architectures for two simple OpenCL applications. We then discuss the potential for auto-tuning to aid the generation of performance portable OpenCL kernels by incorporating multi-objective optimization techniques into the tuning process.

Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms – Biju George and Ben Ashbaugh (Intel)

In this presentation we focus on the application of the wavefront pattern to design efficient GPGPU implementations of video encoding algorithms using OpenCL kernels. We present our experiences in implementing and evaluating four solutions of WPP for inter and intra estimation for AVC on GPUs. We explain the reasoning behind each solution and present the results of our analysis.

Challenges and Opportunities in Native GPU Debugging with OpenCL – Uri Levy (Intel)

In this technical session we’ll present the open architectural design of the debugger and how it fits into the OpenCL JIT compilation flow and the underlying compute technology of the HW with focus on Intel processor graphics. We’ll demonstrate a show case on how to natively work with the debugger to solve functional bugs, as-well-as low-level debugging techniques on SIMD thread level which help to solve complex issues such as misaligned or out of range accesses to local\global memory, stack overflows, Illegal instructions, etc. Finally, we’ll cover the challenges in debugging

Modeling Explicit SIMD Programming with Subgroup Functions – Biju George and Ben Ashbaugh (Intel)

In this presentation, based on our experience in developing publicly released vendor extensions based on subgroups, we explain the advantages of the “explicit SIMD” programming paradigm using OpenCL subgroup and how the subgroups framework can be leveraged to: (1) Model features for performance in OpenCL that are commonly available in programming languages or interfaces based on an “explicit SIMD” programming paradigm such as the AVX intrinsics supported in GCC; and to (2) Model features to expose functionality available in GPU accelerator units that are more conveniently and efficiently exposed using a block API.

StreamComputing is 7 years!

As of 1 April we are 7 years old. Because of all the jokes on that day, this post is a bit later.

Let me take you through our journey how we grew up from a 1-person company to what we’re now. With pride I can say that (with ups and downs) StreamComputing (now rebranded to StreamHPC) has become a brand that equals to (extremely) fast software, HPC, GPUs and OpenCL.

7 years of changes

Different services

After 7 years it’s also time for changes. Initially we solely worked on OpenCL related services, mostly GPUs. And this is what we’re currently doing:

  • HPC GPU computing: OpenCL, CUDA, ROCm.
  • Embedded GPU computing: OpenCL, CUDA, RenderScript, Metal.
  • Networked FPGA programming: OpenCL.
  • GPU-drivers testing and optimisation.
  • Software architecture optimisations.

While you see OpenCL a lot, our expertise in vendor-specific CUDA (NVidia), ROCm (AMD), RenderScript (Google) and Metal (Apple) cannot be ignored. Hence the “Performance Engineers” and not “GPU consultants” or “OpenCL programmers”.

From Fixers to Builders and getting new competition

Another change is that we have been going from fixing code afterwards to building software.

This has been a slow process and had to do with the confidence in performance engineering as an expert profession instead of a trick. We’re seeing new companies coming into the market and providing GPU-computing next to their usual services. This is a sign of the market growing up.

We’re confident in growing further in our market, as we have the expertise to design fast software while the newcomers have gained expertise to write code that runs on the GPU with only little speedup.

Community: OpenCL:PRO to OpenCL.org

There have been more times when we wanted to support the community more. The first try was OpenCL:PRO and did not live long, as it was actually unclear to us what “the community” wanted.

In the end it was not that hard. Everybody who starts with OpenCL has the same problems:

  • Lack of convenience code, resulting in many, many wrappers and libraries that are incompatible.
  • Lack of practice projects.
  • Lack of overview on what’s available.

With OpenCL.org we aim to solve these problems together with the community. All is shared on Github and anybody can join to complete the information we’ve shared. While our homepage had around 40 pages on these subjects, it was only our personal view on the subjects or had outdated info.

So we’re going to donate most of the OpenCL-related technical pages we’ve written over the years to the community.

There is much more to share – watch our blog, the OpenCLorg twitter and newsletter!

Different Logo

For who remembered: in 2010 the logo looked quite different. We still use the blocks in the background (like on our Twitter account), but since 2014 the colours and font are quite different. This change has been going along with the company growing up. The old logo is careful, while the new one is bold – now we’re more confident about our expertise and value.

Over the past 3 years the new logo has stayed the same and has fully become our identity.

Same kind of customers

It has been quite a journey! We could not have done it without all the customers we served over those 7 years.

Thank you!

Looking for the company’s GPU-pioneers

Getting from the technical advantages to the business advantages we have extensive experience.

Several projects were introduced to us via the company’s GPU-pioneer. These collaborations were very successful and pleasant to do, due to the internal support within the company. This is a reason we would like to do more of these – this text is dedicated to the GPU-pioneers out there.

Seeing the potential of GPUs is not easy, even when you’ve carefully read the 13 types of algorithms that OpenCL can speed up. So it’s even harder to convince your boss that GPUs are the way to go.

Here is where we come in. This is what you need to do. Continue reading “Looking for the company’s GPU-pioneers”

GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017

What brings 2017 in technology? Gartner gives their vision with the start of each year to give insight in which technologies to invest in. When looking through them, the most important enabling technologies are the GPU and Internet-of-Things (IoT) – see the image below. Whereas the last 4 are IoT based, the first 4 would not have been possible without GPUs.

The middle two are more mature technologies, as they’re based on technology progress of many years – it happens to be that the GPU has played a big role to get here. And ofcourse not only GPUs and IoT are the reason these 10 are on this year’s list.

Continue reading “GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017”

When to use Artificial Intelligence and when to use Algorithms?

The main strength of Artificial Intelligence is it’s easy to understand by anybody. This results in new applications in all industries at a rapid pace. Are there new possibilities generated or have the possibilities always been possible? The answer is both.

In case of totally unknown input, AI is better capable of adapting. A clearly defined algorithm has much more unpredictable behavior in such cases.

If the input is known well, the tables are turned. AI could give unpredictable results and with that introducing hard-to-solve bugs. Algorithms do exactly what it is designed for, also often at a much higher performance.

Making the wrong choice here results often in a much more expensive solution. Continue reading “When to use Artificial Intelligence and when to use Algorithms?”

Double the performance on AMD Catalyst by tweaking subgroup operations

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux. Continue reading “Double the performance on AMD Catalyst by tweaking subgroup operations”

Win an OpenCL mug!

The first batch is in and you can win one from the second batch!

We’re sending a mug to a random person who subscribes to out newsletter before the end of 17 April 2017 (Central European Time). Yes, that’s a Monday.

Two winners

We’ll pick two winners: one from academia and one from industry. If you select “other” as your background, then share which category you fall in the last field.

Did you already subscribe and also want to win? I am not forgetting you – more details are in a newsletter next quarter.

More winners, by referring to a friend

If you refer a colleague, a friend or even a stranger to subscribe, you can both win a mug. Just be sure he/she remembers to mention you to me when I ask. Before you ask: the maximum referral-length is 5 (so referral of referral of referral of referral, etc) plus the one who started it.

UPDATE: If you win a mug and were not referred by somebody, you can pick a co-winner yourself. Joy should be shared.

Newsletter
Sign up for our Newsletter
* = required field

You can also use this link http://eepurl.com/bgmWaP.

Customer: “So you also do full projects?”

One of these moments when you find out that the company is not seen as I want it to be seen. Compared to generic software engineering companies, we have the advantage of creating software that is capable of processing more data.

For years we have promoted this unique advantage, for which we use OpenCL, CUDA, HIP and several other languages. Always having full projects in mind, but unfortunately not clearly communicating this.

During discussions with several existing and new customers, it became suddenly clear that we are seen as a company that fixes code, not one that builds the full code.

It became most clear that when was suggested to let us collaborate with another party, where our role would be to make sure they would not make mistakes regarding performance and code-quality.

  • Customer: You can work with a team we hired before.
  • Us: We also do full projects.
  • Customer: Really?

This would mean we would be the seniors in the group, but not own the project – a suboptimal situation, as important design decisions could be ignored. Continue reading “Customer: “So you also do full projects?””

Should SPIRV be supported in CUDA?

Would you like to run CUDA-kernels on the OpenCL framework? Or Python or Rust? SPIRV is the answer! Where source-to-source translations had several limitations, SPIRV 1.1 even supports higher level languages like C++.

SPIRV is the strength of OpenCL and it will only get bigger.

Currently Intel drivers best supports SPIRV, making it the first target for the new SPIRV-frontends. It is unknown which vendor will be next – probably one which (almost) has OpenCL 2.0 drivers already, such as AMD, ARM, Qualcomm or even NVidia.

How interesting is SPIRV really?

So if SPIRV is really that important a reason to choose for OpenCL, I thought of framing it towards CUDA-devs on twitter in a special way:

Should CUDA 9 support SPIRV?

2 people voted no, 29 people voted yes.

So that’s a big yes for SPIRV-support on CUDA. And why not? Don’t we want to program in our own language and not be forced to use C or C++? SPIRV makes it possible to quickly add support in any language out there without official support of the vendor. Let wrappers handle the differences per vendor and let SPIRV be the shared language for GPU-kernels. Where do we need OpenCL or CUDA for, if the real work is defined by SPIRV?

What do you think? Leave your comment below how you see the future of GPGPU with SPIRV in town. Is 2017 the year of SPIRV?

And if you worked on a SPIRV-frontend, get in touch to continue your project on https://github.com/spirv. Yes, it’s empty right now, but you don’t know what’s hidden.

NVIDIA beta-support for OpenCL 2.0 works on Linux too

In the release notes for 378.66 graphics drivers for Windows (February 2017), NVIDIA officially spoke about supporting OpenCL 2.0 for the first time. Unfortunately, this is partial support only and, as NVIDIA said, these new [OpenCL 2.0] features are available for evaluation purposes only.

We did our own tests on a GTX 1080 on Windows and could confirm that for Windows the green team is halfway there. NVIDIA still has to implement pipes, enable non-uniform work-group sizes (this happens when in ND-range global_work_size is not divisible by the local_work_size), and fix a few bugs in device side enqueue.

Today we decided to test out NVIDIA latest driver (378.13) for 64-bit Linux and check its support for OpenCL 2.0.

NVIDIA, OpenCL 2.0 and Linux

Just like on Windows, our GTX 1080 reports that it is an OpenCL 1.2 devices. It is understandable since support for OpenCL 2.0 is only in beta stage. In the following table you’ll find an overview of the 2.0 functions supported by this Linux driver.

OpenCL 2.0 featureSupportedNotes
SVMYesOnly coarse-grained SVM is supported. Fine-grained SVM (optional feature) is not.
Device side enqueuePartially. Surprisingly, it
works better than
on Windows
Almost OpenCL programs with device side queue we have tested work.

Some advanced examples with multi-level device side kernel enqueuing
and/or CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP fail. When using device
side queue, it's only possible to use 1D nd-range with uniform work
groups (or without specifying local size). 2D and 3D nd-ranges
don't work.
Work-group functionsYes
PipesNoPipe functions are defined in libOpenCL.so in 378.13 drivers,
but using them cause run-time errors.
Generic address spaceYes
Non-uniform work-groupsNo
C11 AtomicsPartiallyUsing atomic_flag_* functions cause an CL_BUILD_ERROR error.
Subgroups extensionNo

The host-side functions clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties() are also present and working.

As you can see, the support for OpenCL 2.0 on Linux is almost exactly the same as on Windows. But in contrast with the Windows-drivers, we were able to successfully compile and run several more kernels that use device side queue. It may indicate that this feature is being actively developed and maybe in future drivers it will work much better – for both Linux and Windows.

What you can do to make it better

As NVIDIA only adds new functionality to OpenCL driver when requested, it is very important that they receive these requests. So when you or your employer is a paying customer, do keep requesting the features you need. Know that NVIDIA knows that lacking required functionality will be bad for their sales.

NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

The 8 reasons why our customers had their code written or accelerated by us

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these. Continue reading “The 8 reasons why our customers had their code written or accelerated by us”

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. Continue reading “Master+PhD students, applications for two PRACE summer activities open now”

How many threads can run on a GPU?

5x5x3x3x3
Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

Funded PhD internships at StreamHPC

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamHPC.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.

 

The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamHPC”

IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

We have been awarded the Khronos project to upgrade the OpenCL test suite to 2.2!

Some weeks ago we started with implementing the Compiler Test Suite for OpenCL 2.2. The biggest improvement of OpenCL 2.2 is C++ kernels, which originally was planned for 2.1. SPIRV 1.1 is another big improvement.

We are very happy to have a part in making OpenCL better! We find OpenCL C++ kernels very important, even if it has its limitations. Thanks to SPIRV 1.1 it gets easier to have more (unofficial) kernel languages next to C and C++, and to get SYCL. Also upgrading from 2.0 to 2.2 is rather easy thanks to the open source libclcxx.

Personally I found this project to also be very important for our internal knowledge building, as almost every function would be touched and discussed.

OpenCL 2.2 CTS RFQ has been awarded to StreamHPC

Khronos issued a Request For Quote (RFQ) back in September 2016 to enhance and expand the existing OpenCL 2.1 conformance tests to create an OpenCL 2.2 test suite to be used to define conformance for OpenCL 2.2 implementations. The contract has been awarded to StreamHPC. StreamHPC is a software consultancy company specialized in performance tuned software development for CPU, GPU and FPGA. A large part of their clients hires them for their OpenCL expertise.

Already improvements have been added, bugs splatted and documentation improved. We hope to continue this the coming months!

We’ll be ready in March. Hopefully the first implementations are ready by then, as there is a test suite ready to iron out any bug discovered. Which three OpenCL drivers do you think will be first to have OpenCL 2.2? Intel, AMD, NVidia, ARM, Imagination, Qualcomm, TI, Intel FPGA (Altera), Xilinx, Portable OpenCL or another?

AMD gets into Machine Intelligence with “MI” range of hardware and software

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post. Continue reading “AMD gets into Machine Intelligence with “MI” range of hardware and software”