The Exascale rat-race vs getting-things-done with HPC

slide-12-638
IDC Forecasts 7 Percent Annual Growth for Global HPC Market – HPCwire

When the new supercomputer “Cartesius” of the Netherlands was presented to the public a few months ago, the buzz was not around FLOPS, but around users. SARA CEO Dr. Ir. Anwar Osseyran kept focusing on this aspect. The design of the machine was not pushed by getting into the TOP500, but by improving the performance of the current users’ software. This was applauded by various HPC experts, including StreamHPC. We need to get things done, not to win a virtual race of some number.

In the description about the supercomputer, the top500-position was only mentioned at the bottom of the page:

Cartesius entered the Top500 in November 2013 at position 184. This Top500 entry only involves the thin nodes resulting in a Linpack performance (Rmax) of 222.7 Tflop/s. Please note that Cartesius is designed to be a well balanced system instead of being a Top500 killer. This is to ensure maximum usability for the Dutch scientific community.

What would happen if you go for a TOP500 supercomputer? You might get a high energy bill and an overpriced, inefficient supercomputer. The first months you will not have full usage of the machine, and you won’t be able to easily turn off some parts, hence the spill of electricity. This results, finally, in that it is better to run unoptimized code on the cluster than to take time for coding.

The inefficiency is due to the fact that some software is data-transfer limited and other is compute-limited. No need to explain that if you go for a Top 500 and not for software optimized design, you end up buying extra hardware to get all kinds of algorithms performing. Cartesius therefore has “fat nodes” and “light nodes” to get the best bang per buck.

There is also a plan for expanding the machine over the years (on-demand growth), such that the users will remain happy instead of having an adrenaline-shot at once.

The rat-race

The HPC Top 500 is run by the company behind ISC-events. They care about their list being used, not if there is Exascale now or later. There is one company who has a particular interest in Exascale: Intel and IBM. It hardly matters anymore how it begun. What is interesting is that Intel has bought Infiniband and is collecting companies that could make them the one-stop shop for a HPC-cluster. IBM has always been strong in supercomputers with their BlueGene HPC-line. Intel has a very nice infographic on Intel+Exascale, which shows how serious they are.

But then the big question comes: did all this pushing speed up the road to Exascale? Well, no… just the normal peaks and lows round the logarithmic theoretic line:

Top500-exponential-growth
source: CNET

What I find interesting in this graph is that the #500 line is diverging from the #1 line. With GPGPU is would was quite easy to enter the top 500 3 years ago.

Did the profits rise? Yes. While PC-sales went down, HPC-revenues grew:

Revenues in the high-performance computing (HPC) server space jumped 7.7 percent last year to $11.1 billion surpassing the $10.3 billion in revenues generated in 2011, according to numbers released by IDC March 21. This came despite a 6.8 percent drop in shipments, an indication of the growing average selling prices in the space, the analysts said. (eWeek.)

So, mainly the willingness of buying HPC has increased. And you cannot stay behind when the rest of the world is focusing on Exascale, can you?

Read more

Keep your feet on the ground and focus on what matters: papers and answers to hard questions.

Did you solve a compute problem and got published with an sub-top250 supercomputer? Share it in the comments!

Basic concepts: malloc in the kernel

22489954_ml
Pointers and allocated memory space with a hint to Oktoberfest.

During the last training I got a question how to do malloc in the kernel. It was one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or global) memory from within the kernel. Luckily it’s possible, but it is somewhat hidden in another function.

clSetKernelArg to the rescue

The way to do it is from the host, using one of the kernel arguments.

cl_int clSetKernelArg ( cl_kernel kernel,
cl_uint arg_index,
size_t arg_size,
const void *arg_value)

This function allocates the memory on the device for you. Just as with normal malloc, it doesn’t clear the memory for you.

To make sure the host cannot access it (and you don’t accidentally pin/write/read it, when using host-generation scripts), you can use a flag for that: CL_MEM_HOST_NO_ACCESS. All the flags have been explained in a previous article about this same function, setting flags for creating kernel arguments.

The advantage of only allowing malloc to be done from the host, before the kernel is launched, is that the memory-planning can be done more efficiently.

Local memories

When you need a local space, you can specify that at the kernel-side. For example:

__kernel void foo(__local int* bar) { ... }

This mallocs an area in all local memories with size specified by arg_size.

Basic Concepts

This short article is in the basic concept series. It contains several subjects I did not see well-enough explained in books or the reference manual. If you see a subject that you would like to see in this series, just contact us.

Partner up with StreamHPC for Horizon 2020!

For those working in a research-department at a company or university within the EU, Horizon 2020 might be a bit of a familiar sound.

horizon2020

For us this is an important program and a source possibilities for collaboration in the coming years. Our expertise in enabling ultra-fast computations, combined with your expertise can make Europe more competitive. We are interested in applied GPGPU, in the commercialization of tools, in co-developing new software with SMEs and also in universities based on the EU, Switzerland or Israel.

Fields Europe wants to focus on:

  • Micro- and nano-electronics; photonics
  • Nanotechnologies
  • Advanced materials
  • Biotechnology
  • Advanced manufacturing and processing

The development of these technologies requires a multi-disciplinary knowledge and a capital-intensive approach.

Each of these industries has opportunities for using GPus and Accelerators.

Contact us today for more information. We’ll have a lot to share!

Events are organised throughout Europe to inform universities and companies about the programme. Are you a Dutch university or company? Check this site of the Dutch government.

StreamHPC launches monthly trainings in Europe

OpenCL_LogoEvery second Monday of the month StreamHPC offers an OpenCL training in Mathematics or Media-operations. The target is OpenCL 1.2 (or 1.1 when NVIDIA is discussed). OpenCL 2.0 trainings will start in Q2/Q3, or when on request. All trainings will be given by experienced OpenCL developers/trainers.

Trainings

Trainings take 3½ days, from basics on the first morning to the special requests on the 4th morning, either with your own laptop, or logging to a compute-server.

The Media-operations module is based on Heterogeneous Computing with OpenCL, second edition by Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa.

It covers convolution, video-processing, histogram and mixed particle simulation. Extra subjects are OpenCL-OpenGL interop and code-optimisation.

A good fit if you work with images, sound and video.

 

 

The Mathematics module is based on OpenCL in Action by Matthew Scarpino.

It covers reduction, sorting, matrix-operations and signal processing.

If you work on graphs, matrices and data-manipulation, this is for you!

 

 

Continue reading “StreamHPC launches monthly trainings in Europe”

OpenCL at SC13

Unluckily I am not at SC13, so I’ll enjoy from a distance. Luckily I don’t miss one of the most beautiful 15km runs in the Netherlands. When there is more news, I’ll add to this post – below is mostly taken from the Khronos website and the SC2013 website.

OpenCL Booth

Meet members of the OpenCL workgroup in Booth #4137 to get hot news from the OpenCL experts and an OpenCL reference card. Also learn about next year’s plans for IWOCL (International Workshop on OpenCL). Be sure not to miss the BOF on OpenCL 2.0 (in bold in the schedule).

Schedule

Sunday, 17 November

8:30 – 17:00 Tutorials Structured Parallel Programming with Patterns Michael McCool, James Reinders, Arch Robison, Michael Hebenstreit 302
8:30 – 17:00 Tutorials OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools Luiz DeRose, Alistair Hart, Heidi Poxon, James Beyer 401

Monday, 18 November

8:30 – 17:00 Tutorials OpenCL: A Hands-On Introduction Tim Mattson, Alice Koniges, Simon McIntosh-Smith 403

Tuesday, 19 November

11:00 & 15:00 ACM Student Research Competition Poster Reception Introduction to OpenCL on FPGAs AcceleWare Altera booth
17:15 – 19:00 Short presentation local_malloc: malloc for OpenCL __local memory [Poster] John Kloosterman Mile High Pre-Function

Wednesday, 20 November

11:00 & 15:00 Short presentation Introduction to OpenCL on FPGAs AcceleWare Altera booth
16:00 Case Study Accelerating Full Waveform Inversion via OpenCL on AMD GPUs AcceleWare AMD booth
17:30 – 19:00 BOF OpenCL: Version 2.0 and Beyond Tim Mattson, Ben Bergen, Simon McIntosh-Smith 405/406/407

Thursday, 21 November

11:30 – 12:00 Exhibitor Forum OpenCL 2.0: Unlocking the Power of Your Heterogeneous Platform Tim Mattson 501/502
10:30 – 11:00 Papers General Transformations for GPU Execution of Tree Traversals Michael Goldfarb, Youngjoon Jo, Milind Kulkarni 205/207
11:00 – 11:30 Papers A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael F.P. O’Boyle 205/207
11:30 – 12:00 Papers Semi-Automatic Restructuring of Offloadable Tasks for Many-Core Accelerators Nishkam Ravi, Yi Yang, Tao Bao, Srimat Chakradhar 205/207
11:30 – 12:00 Short presentation Introduction to OpenCL on FPGAs AcceleWare Altera booth
16:00 – 16:30 Paper Accelerating Sparse Matrix-Vector Multiplication on GPUs using Bit-Representation-Optimized Schemes Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, Weng-Fai Wong 401/402/403

There are other interesting SC13-events around OpenCL, so be sure to check the schedule carefully.

image001

Khronos Members Exhibiting at SC13

Complete floor plan is available here.

Below links are updated from the company-homepages to special SC13 landing-pages.

  • Altera Corporation – Booth 4332.
  • AMD – Booth 1113. See this list for a schedule.
  • ARM – Booth 3141.
    • OpenCL on MALI demos at their booth
    • AccelerEyes (#310) showcases ArrayFire (with OpenCL-on-ARM backend) running on Mali T604.
    • Many more – just look for it.
  • Khronos OpenCL – Booth 4137.
  • IBM – Booth 126, 2713. OpenCL on IBM PowerLinux 7R2 and IBM Flex System. Also collaboration with Altera.
  • Intel – Booth 2501, 2701. OpenCL on their CPUs and GPUs.
  • NEC Corporation – 3109. Vector supercomputer – Khronos probably hints to OpenCL running on this machine – see for yourself.
  • NVIDIA – Booth 613. Have fun hearing them say: “Do you use CUDA or are you locked-in to OpenCL?” and variations on this.
  • Texas Instruments – Booth 3725. OpenCL on DSP demo.
  • XilinX – To schedule a private appointment (for an OpenCL-demo) visit Xilinx at the Convey booth (#3547) or the Alpha Data booth (#4237).

The floorplan can be downloaded here or here (mirrored on 15-Nov).

At SC13 and saw great OpenCL demos or news?

Share this info and photos in the comments, for others to pick up.

CUDA 6 Unified Memory explained

unified-mem
A) Unified Memory Access (UMA). B) NVIDIA’s Unified Virtual Addressing (UVA), now rebranded as “Unified Memory”.

AMD, ARM-vendors and Intel have been busy unifying CPU and GPU memories for years. It is not easy to design a model where 2 (or more) processors can access memory without dead-locking each other.

NVIDIA just announced CUDA 6 and to my surprise includes “Unified Memory”. Am missing something completely, or did they just pass their competitors as it implies one memory? The answer is in their definition:

Unified Memory — Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

The official definition is:

Unified Memory Access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

HPCGuru_not-shared-memSee the difference?

The image at the right explains it differently. A) is how UMA is officially defined, and B is how NVIDIA has redefined it.

So NVIDIA’s Unified Memory solution is engineered by marketeers, not by hardware engineers. On Twitter, I seem not to be the only one who had the need to explain that it is different from the terminology the other hardware-designers have been using.

So if it is not unified memory, what is it?

It is intelligent synchronisation between CPU and GPU-memory. The real question is what the difference is between Unified Virtual Addressing (UVA, introduced in CUDA 4) and this new thing.

UVA-1024x407

UVA defines a single Address Space, where CUDA takes care of the synchronisation when the addresses are physically not on the same memory space. The developer has to give ownership to or the CPU or the GPU, so CUDA knows when to sync memories. It does need CudeDeviceSynchronize() to trigger synchronisation (see image).

CudeDeviceSynchronize

From AnandTech, which wrote about Unified (virtual) Memory:

This in turn is intended to make CUDA programming more accessible to wider audiences that may not have been interested in doing their own memory management, or even just freeing up existing CUDA developers from having to do it in the future, speeding up code development.

So its to attract new developers, and then later taking care of them being bad programmers? I cannot agree, even if it makes GPU-programming popular – I don’t bike on highways.

From Phoronix, which discussed the changes of NVIDIA Linux driver 331.17:

The new NVIDIA Unified Kernel Memory module is a new kernel module for a Unified Memory feature to be exposed by an upcoming release of NVIDIA’s CUDA. The new module is nvidia-uvm.ko and will allow for a unified memory space between the GPU and system RAM.

So it is UVM 2.0, but without any API-changes. That’s clear then. It simply matters a lot if it’s true or virtual, and I really don’t understand why NVIDIA chose to obfuscate these matters.

In OpenCL this has to be done explicitly with mapping and unmapping pinned memory, but is very comparable to what UVM does. I do think UVM is a cleaner API.

Let me know what you think. If you have additional information, I’m happy to add this.

AMD updates the FirePro S10000 to 12GB and passive cooling

203061_FirePro_S10000Passive_AngleLet the competition on large memory GPUs begin!

Some algorithms and continuous batch processes will have the joy of the extra memory. For example when inverting a large matrix or doing huge simulations, you need as much memory as possible. or to avoid memory-bank conflicts by duplicating data-objects (possible only when the data is in memory for a longer time to pay for the time it costs to duplicate the data).

Another reason for larger memories is dual precision computations (this one has a total of 1.48 TFLOPS), which doubles memory-requirements. With Accelerators getting better fit for HPC (true support for IEEE-754 double precision storage format, ECC-memory), memory-size becomes one of limits that needs to be solved.

The other choice is swapping on GPUs or to use multi-core CPUs. Swapping is not an option as it nulls all the speed-up. A server with 4 x 16-core CPUs are as expensive as one Accelerator, but use more energy.

AMD seems to have identified this as an important HPC-market therefore just announced the new S10000 with 12GB of memory. To be mailed at AMD-partners in January, and on the market in April. Is AMD finally taking the professional HPC market serious? They now do have the first 12GB GPU-accelerator built for servers.

Old vs New

Still a few question-marks, unfortunately


Functionality FirePro S10000 6GB FirePro S10000 12GB
GPU-Processor count 2 2
Architecture Graphics Core Next Graphics Core Next
Memory per GPU-processor 3 GB GDDR5 ECC 6GB GDDR5 ECC
Memory bandwidth per GPU-processor 240 GB/s per GPU 240 GB/s per GPU
Performance (single precision, per GPU-proc.) 2.95 TFLOPS per GPU 2.95 TFLOPS per GPU
Performance (double precision, per GPU-proc.) 0.74 TFLOPS per GPU 0.74 TFLOPS per GPU
Max power usage for whole dual-GPU card 325 Watt 325 Watt (?)
Greenness for whole dual-GPU card (SP) 20.35 GFLOPS/Watt 18.15 GFLOPS/Watt
Bus Interface PCIe 3.0 x16 PCIe 3.0 x16
Price for whole dual-GPU card $3500 ?
Price per GFLOPS (SP) $0.60 ?
Price per GFLOPS (DP) $2.43 ?
Cooling Active (!) Passive

The biggest differences are the doubling of memory and the passive cooling.

Competitors

Biggest competitor is the Quadro K6000, which I haven’t discussed at all. That card throws out 5.2 TFLOPS using one GPU, being able to access all 12GB of memory via a 384-bit bus at 288 GB/s (when all cores are used). It is actively cooled, so it’s not really fit for servers (like the S10000, 6GB version). The S10000 has a higher bandwidth, but cannot access only half the 12GB from one core at full speed. So the K6000 has the advantage here.

Intel is planning to have 12GB and 16GB XeonPhi’s. I’m curious to more benchmarks of the new cards, as the 5110P does not have very good results (benchmark 1, benchmark 2). It compares more to a high-end Xeon CPU than a GPU. I am more enthusiastic about the OpenCL-performance on their CPUs.

What’s next on this path?

A few questions I asked myself and tried to find answers on.

Extendible memory, like we have for CPUs? Probably not, as GDDR5 is not designed to be upgradable.

Unified memory for multi-GPUs? This would solve the disadvantage of multi-die GPU-cards, as 2, 4 or more GPUs could share the same memory. A reason to watch HSA hUMA‘s progress, which now specifies unified memory access between GPU and CPU.

24GB of memory or more? I’ve found below graph to have an idea of the costs of GDDR-memory, so it’s an option. These prices are of course excluding supplementary parts and R&D-costs for getting more memory accessible to the GPU-cores.

GPU-parts pricing table
GPU-parts pricing table – Q3 2011

At least the question we are going to get answered now: is the market which needs this amount of memory large enough and thus worth serving.

Is there more need for wider memory-bus? Remember that GDDR6 is promised for 2014.

What do you think of a 12GB GPU? Do you think this is the path that distinguishes professional GPUs from desktop-GPUs?

CUDA’s multiple targets, the OpenCL version

I’d like to share two images.

The following image is being shared for quite some time, to show the technical capabilities of CUDA.

cuda or opencl?

I replaced “CUDA source” by “OpenCL source” and worked from there. Result:

fixed_CUDA-multi-target

I know it is not optimised for a certain architecture, but neither is the CUDA-source for the two extra targets.

Upcoming soon is an article, where the real stuff is currently happening: the higher-level languages being built on top of OpenCL.

Altera published their OpenCL-on-FPGA optimization guide

Altera-doc

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

Subjects

The following subjects and optimisation tricks are discussed:

  • FPGA Overview
  • Pipelines
  • Good Design Practices
  • Avoid Pointer Aliasing
  • Avoid Expensive Functions
  • Avoid Work-Item ID-Dependent Backward Branching
  • Aligned Memory Allocation
  • Ensure 4-Byte Alignment for All Data Structures
  • Maintain Similar Structures for Vector Type Elements
  • Optimization of Data Processing Efficiency
  • Specify a Maximum Work-Group Size or a Required Work-Group Size
  • Loop Unrolling
  • Resource Sharing
  • Kernel Vectorization
  • Multiple Compute Units
  • Combination of Compute Unit Replication and Kernel SIMD Vectorization
  • Resource-Driven Optimization
  • Floating-Point Operations
  • Optimization of Memory Access Efficiency
  • General Guidelines on Optimizing Memory Accesses
  • Optimize Global Memory Accesses
  • Perform Kernel Computations Using Constant, Local or Private Memory
  • Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

ARM forums to find useful information for OpenCL development

OpenCL on ARM is hot, but it just is getting started. Currently it takes some time to find needed information about the processors concerning

For OpenCL-discussions the best place is the Khronos OpenCL board. So where can you go when you want to ask questions specifically on ARM-based GPUS like MALI, PowerVR, Adreno and Vivante?

ARM’s new community site for all

ARM just launched the Connected Community (ARM CC). It is the place to connect to, when you have general information-needs of ARM-IP, such as ARM MALI, Cortex A9 and Cortex A15.

arm-forums

And here is how ARM themselves explains this initiative on one slide:

ARMConnectedCommunityIntro

Be sure to connect to StreamHPC. We hope this will indeed be the central place for the whole ecosystem, including Imagination, Qualcomm and Vivante.

ARM MALI

Mali-developer

The MALI Developer Center has its forums on ARM Connected Community.

Imagination PowerVR

The graphics-section of their developer forums seems to be the best place.

imgtec dev forums

(Not @ ARM CC)

Qualcomm Adreno

Qualcomm has dev-forums too and has a section called Mobile Gaming & Graphics Optimization (Adreno™).

qualcomm-forum-adreno

(Not @ ARM CC)

Vivante

Vivante does not have a forum, but Freescale does. The i.MX forums seem to be the best place to ask your questions.

freescale-forums

@ARM CC

Others

Where do find a good source to find and share interesting information on mobile GPUs? Share it with the others via the comments – chances increase your questions gets answered when more people visit the forums.

Guest-blog: Accelerating sequential machine vision algorithms with OpenMP and OpenCL

Jaap van de LoosdrechtGuest-blogger Jaap van de Loosdrecht wants to share his thesis with you. He leads the Centre of Expertise in Computer Vision department at NHL University of applied sciences and is the owner of his own company, and still managed to study and write a MSc-thesis. The thesis is interesting because it extensively compares OpenCL with OpenMP, especially chapters 7 an 8.

For those who are interested, my thesis “Acceleration sequential machine vision algorithms using commodity parallel hardware” is available at www.vdlmv.nl/thesis.

Keywords: Computer Vision, Image processing, Parallel programming, Multi-core CPU, GPU, C++, OpenMP, OpenCL.

Many other related research projects have considered using one domain specific algorithm to compare the best sequential implementation with the best parallel implementation on a specific hardware platform. This work was distinctive because it investigated how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms.This work has:

  • Examined, compared and evaluated 22 programming languages and environments for parallel computing on multi-core CPUs and GPUs.
  • Chosen to use OpenMP as the standard for multi-core CPU programming and OpenCL for GPU programming.
  • Re-implemented a number of standard and well-known algorithms in Computer Vision using both standards.
  • Tested the performance of the implemented parallel algorithms and compared the performance to the sequential implementations of the commercially available software package VisionLab.
  • Evaluated the test results with a view to assessing:
    • Appropriateness of multi-core CPU and GPU architectures in Computer Vision.
    • Benefits and costs of parallel approaches to implementation of Computer Vision algorithms.

Using OpenMP it was demonstrated that many algorithms of a library could be parallelized in an economical way and that adequate speedups were achieved on two multi-core CPU platforms. With a considerable amount of extra effort, OpenCL was used to achieve much higher speedups for specific algorithms on dedicated GPUs.

At the end of the project, the choice of standards was re-evaluated including newly emerged ones. Recommendations are given for using standards in the future, and for future research and development.

Algorithmic improvements are suggested for Convolution and Connect Component Labelling.

Your feedback and/or questions are welcome.

If you put comments here, I’ll make sure Jaap van de Loosdrecht will get to know and answer your questions on the subjects discussed in his thesis.

All the members of the OpenCL working group 2013

In the below list are the members of the OpenCL workgroup as of November 2013.

OCL-WG-members-Nov12

We can expect small changes each year, but this is close to the actual state. I need the rest of Q4 to finalise all the info – any help is appreciated.

This list has also been compiled in 2010, and you can see several differences. If the company has an SDK available, there is a link. That is a whole difference with the last list – this one is much more concrete. Continue reading “All the members of the OpenCL working group 2013”

Reducing downtime with OpenCL… Ever thought of that?

downtimeSomething that creates extra value for Open CL is the flexibility with which it runs on an important variety of hardware. A famous strategy is running the code on CPUs to find data-races and debug the code more easily. Another is to develop on GPUs and port to FPGAs to reduce the development-cycles.

But there’s one, quite important, often forgotten: replacement of faulty hardware. You can blame the supplier, or even Murphy if you want, but what is almost certain is that there’s a high chance of facing downtime precisely when the hardware cannot be replaced right-away.

Fail to plan is planning to fail

To limit downtime, there are a few options:

  • Have a good SLA in place for 24/7 hardware-replacement.
  • Have spare-hardware in stock.
  • Have over-capacity on your compute-servers.

But the problem is that all three are expensive in some form if you’re not flexible enough. If you use professional accelerators like Intel XeonPhi, NVidia Tesla or AMD FirePro, you risk having unexpected stock shortage at your supplier.

With OpenCL the hardware can be replaced by any accelerator, whereas with vendor-specific solutions this is not possible.

Flexibility by OpenCL

I’d like to share with you one example how to introduce flexibility in your hardware-management, but there are various others which are more tailored to your requirements.

To detect faulty hardware, you can think of a server with three GPUs and let selected jobs be run by all three – any hardware-problem will be detected and pin-pointed. Administrating which hardware has done which job completes the mechanism. Exactly this can be used to replace faulty hardware with any accelerator: let the replacement-accelerator run the same jobs as the other two as an acceptance-test.

If you need your software to be optimised for several accelerators, you’re in the right place. We can help you with both machine and hand optimizations. That’s a plan that cannot fail!

Products using OpenCL on ARM MALI are coming

mali-product-feature-CLSDK-940x300_vX1The past year you might not have heard much from OpenCL-on-ARM, besides the Arndale developer-board. You have heard just a small portion of what has been going on.

Yesterday the (Linux) OpenCL-drivers for the Chromebook (which contains an ARM MALI T604) the have been released and several companies will launch products using OpenCL.

Below are a few interviews with companies who have built such products. This will give an idea of what is possible on those low-power devices. To first get an idea of what this MALI T604 GPU can do if it comes to OpenCL, here a video from the 2013-edition of the LEAP-conference we co-organised.

Understand that the whole board takes less than ~11.6 Watts – that is including the CPU, GPU, memory , interconnects, networking, SD-card, power-adapter, etc. Only a small portion of that is the GPU. I don’t know the exact specs as this developer-board was not targeted towards energy-optimisation goals. I do know this is less than the 225 Watts of a discrete GPU alone.

Interviews with ARM partners Continue reading “Products using OpenCL on ARM MALI are coming”

Basic Concepts: Writing OpenCL code for single and double precision

Precision
What’s precise enough?

Support for double precision floating-point type double in OpenCL kernels requires an extension. AMD provides cl_khr_fp64 for newer high-edn hardware, but also a non-fully compliant cl_amd_fp64 extension for other hardware. NVIDIA and Intel support the cl_khr_fp64, so no exceptions need to be made for those drivers.

The code you see bellow these lines is based on a page you can find on Bealto and it was written by Eric Bainville. I added extra typedefs, removed a constant and added DOUBLE_SUPPORT_AVAILABLE for easier fallback.

#if CONFIG_USE_DOUBLE

#if defined(cl_khr_fp64)  // Khronos extension available?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#elif defined(cl_amd_fp64)  // AMD extension available?
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#endif

#endif // CONFIG_USE_DOUBLE

#if defined(DOUBLE_SUPPORT_AVAILABLE)

// double
typedef double real_t;
typedef double2 real2_t;
typedef double3 real3_t;
typedef double4 real4_t;
typedef double8 real8_t;
typedef double16 real16_t;
#define PI 3.14159265358979323846

#else

// float
typedef float real_t;
typedef float2 real2_t;
typedef float3 real3_t;
typedef float4 real4_t;
typedef float8 real8_t;
typedef float16 real16_t;
#define PI 3.14159265359f

#endif

A macro is defined by the OpenCL C compiler for each available extension, which is cl_khr_fp64 in this example. This macro can be tested to enable the extension with #pragma OPENCL EXTENSION cl_khr_fp64 : enable.

Now, you need to use the defined constant(s) and real_t, real2_t types instead of float or double. The definition of CONFIG_USE_DOUBLE is passed as compilation option to clBuildProgram to make the switch between double and single precision. If there is no double-support, it falls back to single precision.

Enjoyed this post? Share it!

Basic Concepts: out of resources with clEnqueueReadBuffer

Oops
“Oops! The best way to learn, when you love trial-on-error”™

In the series “Basic Concepts” various basics of GPGPU and OpenCL are discussed. This time we go into a typical one: when an error does not imply the actual problem. It is therefore good to have an overview of all errors with their descriptions.

When you get an out-of-resources error or when you get a crash when using clEnqueReadBuffer, you are sort of left in the dark. What does it mean? And how can you solve it?

Typical: one driver crashes/segfaults and another one gives this error.

Officially the error is defined as:

CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.

Which means that there can more reasons than the device being out of resources. A better name would have been CL_RESOURCE_ALLOCATION_ERROR. It can be thrown by various functions, but we focus on this one function. It cannot by thrown by clEnqueWriteBuffer, as that depends on the limits of the host.

Finding out the cause

The oldest trick of ‘m all: try to use the CPU and check what the error is then. CPUs are great to detect data-races (correct on CPU, not on GPU) and CPUs are a bit more stable when you have buggy code plus have more RAM. Be sure to install both Intel’s and AMD’s drivers.

Calling clFinish at each line, helps you pinpoint the actual line it happens or to get an error instead of a crash.

Then you have the following options:

  1. 9 out of 10 times you have a pointer problem at the host or are writing out of bounds. So you try to write to an illegal memory location, or try to cram in an 35×35 float* into 10x10x10 float* space (buffer-overflow). Double check the host memory-sizes, and if the host-pointers are correct.
  2. You read out of bounds on the device. Double-check the used memory-sizes.
  3. You might have hit a limit of the driver, such as the 5s timeout if the NVidia card is also being used as a display. Rule out you have used up all memory by using both smaller and larger(!) objects. Also note down memory object sizes over time. Be sure you clean up non-used objects. Fragmentation of device-memory can also be the problem it eventually goes wrong.

The last one I have not encountered myself, but found on the Nvidia forums. I recently had this error (type 1), because I had introduced clear naming in the code I was working on. When I introduced the standard ‘h_‘ and ‘d_‘ prefixes for all variables, I immediately found the cause.

Hope it has helped you understand the resource allocation error. If you found other reasons, please share via the comments and I’ll add it. If you have requests what to discuss in this series, let me know via Twitter or the comments.

Help write the book “Numerical Computations with GPUs”

9783319065472There is an interesting book coming up: “Numerical Computations with GPUs” – a book explaining various numerical algorithms with code in CUDA or OpenCL.

edit: At the moment there are 21 articles to be included in the book.

edit 2: book should be out in July

edit 3: Order via Springer International or Amazon US.
TOC:

  • Accelerating Numerical Dense Linear Algebra Calculations with GPUs.
  • A Guide to Implement Tridiagonal Solvers on GPUs.
  • Batch Matrix Exponentiation.
  • Efficient Batch LU and QR Decomposition on GPU.
  • A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems.
  • Sparse Matrix-Vector Product.
  • Solving Ordinary Differential Equations on GPUs.
  • GPU-based integration of large numbers of independent ODE systems.
  • Finite and spectral element methods on unstructured grids for flow and wave propagation problems.
  • A GPU implementation for solving the Convection Diffusion equation using the Local Modified SOR method.
  • Pseudorandom numbers generation for Monte Carlo simulations on GPUs: Open CL approach.
  • Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA.
  • GPU-Accelerated computation routines for quantum trajectories method.
  • Monte Carlo Simulation of Dynamic Systems on GPUs.
  • Fast Fourier Transform (FFT) on GPUs.
  • A Highly Efficient FFT Using Shared-Memory Multiplexing.
  • Increasing parallelism and reducing thread contentions in mapping localized N-body simulations to GPUs.

 

Continue reading “Help write the book “Numerical Computations with GPUs””

Mobile Processor OpenCL drivers (Q3 2013) + rating

saveFor your convenience: an overview of all ARM-GPUs and their driver-availability. Please let me know if something is missing.

I’ve added a rating, to friendly push the vendors to get to at least an 7. Vendors can contact me, if they think the rating does not reflect reality.

ZiiLabs

SDK-page@StreamHPC

Drivers can be delivered by Creative, when you pledge to order ZMS-40 processors. Mail us for a contact at Creative. Minimum order size is unknown.

This device can therefore only be used for custom devices.

[usr=4]

Vivante

SDK-page@StreamHPC

They are found on public devices. Android-drivers that work on FreeScale processors are openly available and can be found here.

[usr=8]

Even though the processors are not that powerfull, Vivante/FreeScale offers the best support.

Qualcomm

SDK-page@StreamHPC

Drivers are not shipped on devices, according various sources. Android-drivers are in the SDK-drivers though, which can be found here.

[usr=7]

Rating will go up, when drivers are publicly shipped on phones/tablets.

ARM MALI

Samsung SDK-page@StreamHPC

There are lots of problems around the drivers for Exynos, which only seem to work on the Arndale-board when the LCD is also ordered.Android-drivers can be downloaded here.

[usr=5]

All is in execution – half-baked drivers don’t do it. It is unclear whom to blame, but it certainly has had influence on creating a new version of Exynos 5, the octa.

Imagination Technologies

SDK-page@StreamHPC

TI only delivers drivers under NDA. Samsung has one board coming up with OpenCL 1.1 EP drivers.

[usr=5]

Rating will go up, when drivers from TI come available without obstacles, or Samsung delivers what they failed to do with the previous Exynos 5.

Exciting times coming up

Mostly because of a power-struggle between Google and the GPU-vendors, there is some hesitation to ship OpenCL drivers on phones and tablets. Unfortunately, Google’s answer to OpenCL RenderScript Compute, does not provide the needs wanted by developers. Google’s official answer is that it does not want fragmentation nor code that is optimised for a certain GPU. The interpreted answer is that Google wants vendor-lockin and therefore blocks the standard. Whatever the reason is, OpenCL is used as sword to show teeth who has a say about the future of Android – only the advertisement-company Google or also the group of named processor-makers and various phone/tablet-vendors?

In H2 2014 Nvidia will ship CUDA-drivers with their Tegra 5 GPUs, making the soap complete.

There are rumours Apple will intervene and will make OpenCL available on iOS. This would explain why there is put so much effort in showing OpenCL-results by Imagination and Qualcomm

And always keep a close watch on POCL, the vendor-independent OpenCL implementation.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need a programmer for any of the above devices? Hire us!

[/bordered_box]

Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK

mosaic3dxUpdate: we are very sorry to tell that due to a deadline in a project we were forced to cancel Vincent’s talk.

StreamHPC will be at Mosaic3DX in Cambridge, UK, on 30+31 October. The brand new conference managed to get big names on-board, I’m happy to be amongst. Mosaic3DX describes itself as:

an international event comprising a conference, an exhibition, and opportunities for networking. Our intended audience are users as well as developers of Imaging, Visualisation, and 3D Digital Graphics systems. This includes researchers in Science and Engineering subjects, Digital Artists, as well as Software Developers in different industries.

Continue reading “Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK”

“That is not what programmers want”

the-miracle-middle-colour2
I think you should be more explicit here in step two” (original print)

This post is part of the series Programming Theories, in which we discuss new and old ways of programming.

When discussing the design of programming languages or the extension of existing ones, the question What concepts can simplify the tasks of the programmer? always triggers lots of interesting debates. After that, when an effective solution is found, inventors are cheered, and a new language is born. Up ’till this point all seems ok, but the problem comes with the intervention of the status quo: C, C++, Java, C#, PHP, Visual Basic. Those languages want the new feature implemented in the way their programmers expect it. But this would be like trying to implement the advantages of a motorcycle into a car without paying attention to the adjustments needed by the design of the car.

I’m in favor of learning concepts instead of doing new things the old way… but only when the latter has proven to be better than the former. The lean acceptance of i.e. functional languages tells a lot about how it goes in reality (with great exceptions like LINQ). That brings a lot of trouble when moving to multi-core. So, how do we get existing languages to change instead of just evolve?

High Level Languages for Multi-Core

Let’s start with a quote from Edsger Dijkstra:

Projects promoting programming in “natural language” are intrinsically doomed to fail.

In other words: a language can be too high level. A programmer needs the language to be able to effectively micro-manage what is being done. We speak of concerns for a reason. Still, the urge to create the highest programming language is strong.

Don’t get me wrong. A high-level language can be very powerful once its concepts define both ways. One way concerns the developer: does the programmer understand the concept and the contract of the command or programming style being offered? The other concerns the machine: can it be effectively programmed to run the command, or could a new machine be made to do just that? This two-side contract is one of the reasons why natural languages are not fit for programming.

And we have also found out that binary programming is not fit for humans.

The cartoon refers to this gap between what programmers want and what computers want.

Continue reading ““That is not what programmers want””

AMD OpenCL Programming Guide August 2013 is out!

AMD-OpenCLAMD has just released an update to their AMD programming guide.

Download the guide (PDF) August version

Download the guide (PDF) November version

Download TOC (PDF)

For more optimisation guides, see the tutorials page of the knowledge base.

Table of Contents

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Software Overview
1.1.1 Synchronization

1.2 Hardware Overview for Southern Islands Devices

1.3 Hardware Overview for Evergreen and Northern Islands Devices

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL Continue reading “AMD OpenCL Programming Guide August 2013 is out!”