Aparapi: OpenCL in Java

Posted by Vincent Hindriksen on 3 August 2011 with 2 Comments

Edit: Aparapi has been open sourced and many issues have already been fixed and improved.

If you have an AMD GPU/APU, you should try Aparapi. This software lets you write OpenCL-code in Java pretty high-level. The idea is that is sort of that it processes the Java intermediate code to search for loops and then create optimised OpenCL-kernels. Just download Aparapi and try the two examples. As the current version is still in alpha, it is not flawless yet. What I think is important when having worked with Aparapi is that you learn how to keep it simple – like you know that you can gain most speed on straight roads and turns slow down.

The Aparapi-team tries to avoid explicit defining of local memory, but it is still possible by using the @Local annotation. Such decisions show the team wants Aparapi to be high-level. It also integrates well with JavaCL and JOCL, so for the kernels you already have created, you can mix. You can also check out a video introducing Aprapi (it is video 15, if #-linking doesn’t work).

Time to create your own project. As not all errors are documented or are solved in the upcoming version, below you will find a list of common errors and how to easily solve them.

Continue reading “Aparapi: OpenCL in Java” →

Aparapi and GPU-code in Java

Aparapi is an open source framework used to write OpenCL-code in Java. It translates Java byte-code into OpenCL for AMD GPUs and all CPUs to get much faster performing code. Furthermore, Aparapi is also quite a fit for existing code (*). And there’s more: Since late 2011, a stable version is being released and new features have been introduced.

(*) You can read more about the alpha-version of Aparapi in this blog post

AMD and Oracle have agreed to collaborate on implementing support for GPU-programming in Java. This means that it’s very likely that this upcoming implementation by Oracle will resemble Aparapi, which in turn also means that it would be safe to invest in this technology.

Below is an example of a simple vector-addition. As you see, the code is very clean.

StreamHPC helps you to find bottlenecks in your Java-code and increase performance by using several types code optimisations and implementing heavy computations in Aparapi. We have a track-record of bringing down running time of batch-processes 2 to even 300 times. In case more performance is needed, Aprapi-code can be converted to pure OpenCL. We use libraries JOCL and JavaCL for this purpose.

If you want faster Java-code and/or want a training in Aparapi, contact us!

OpenCL Videos of AMD’s AFDS 2012

Posted by Vincent Hindriksen on 12 October 2012 with 4 Comments

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Continue reading “OpenCL Videos of AMD’s AFDS 2012” →

Thalesians talk – OpenCL in financial computations

Posted by Vincent Hindriksen on 25 November 2010 with 1 Comment

End of October I had a talk for the Thalesians, a group that organises different kind of talks for people working or interested in the financial market. If you live in London, I would certainly recommend you visit one of their talks. But from a personal perspective I had a difficult task: how to make a very diverse public happy? The talks I gave in the past were for a more homogeneous and known public, and now I did not know at all what the level of OpenCL-programming was of the attendants. I chose to give an overview and reserve time for questions.

After starting with some honest remarks about my understanding of the British accent and that I will kill my business for being honest with them, I spoke about 5 subjects. Some of them you might have read here, but not all. You can download the sheets [PDF] via this link: Vincent.Hindriksen.20101027-Thalesians. The below text is to make the sheets more clear, but certainly is not the complete talk. So if you have the feeling I skipped a lot of text, your feeling is right.

Continue reading “Thalesians talk – OpenCL in financial computations” →

Separation of Compute and Transfer from the rest of the code.

Posted by Vincent Hindriksen on 21 March 2012 with 4 Comments

tree_parts_poster72 — What if trees had the roots, trunk and crown were mixed up? Would it still have the advantage over other plants?

In the beginning of 2012 I spoke with Patrick Viry, former CEO of Ateji – now out-of-business. We shared ideas on GPGPU, OpenCL and programming in general. While talking about the strengths of his product, he came with a remark which I found important and interesting: separation of transfer. This triggered me to think further – those were the times when you could not read on modern computing, but had to define it yourself.

Separation of focus-areas are known to increase effectiveness, but are said to be for experts only. I disagree completely – the big languages just don’t have good support for defining the separations of concerns.

For example, the concepts of loops is well-known to all programmers, but OpenCL and CUDA have broken with that. Instead of using huge loops, those languages describe what has to be done at one location in the data and what the data is to be processed. From what I see, this new type of loop is getting abandoned in higher level languages, while it is a good design pattern.

I would like to discuss separation of compute and transfer from the rest of the code, to show that this will improve the quality of code. Continue reading “Separation of Compute and Transfer from the rest of the code.” →

Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

For many “good enough”
Faster code, the easy way.
Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Better caching can give more boost than using faster hardware
Software running in parallel is a first step to GPU-computing
Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

The highest performance is guaranteed
Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
Code-review of CPU-code (Java, C, C++ and more).
Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

Find parallellizable computations
Give the fitness to run on GPUs
Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Architecture check-up
Data-transport measurements
Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

N-Queens project from over 10 years ago

Posted by Vincent Hindriksen on 10 August 2023

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading →

OpenCL Wrappers

Mostly providing simplified kernel-code with more convenient error-checking, but sometimes with quite advanced additions: the wrappers for OpenCL 1.x. As OpenCL is an open standard, these projects are an important part of the evolution of OpenCL.

You won’t find solutions that provide a new programming paradigm or work with pragmas, generating optimised OpenCL. This list is in the making.

C++

Goopax: Goopax is an object-oriented GPGPU programming environment, allowing high performance GPGPU applications to be written directly in C++ in a way that is easy, reliable, and safe.

OCL-Library: Simplified OpenCL in C++. Documentation. By Prof. Tim Warburton and David Medina.

Openclam: possibility to write kernels inside C++ code. Project is not active.

EPGPU: provides expressions in C++. Paper. By Dr. Lawlor.

VexCL: a vector expression template library. Documentation.

Qt-C++. The advantages of Qt in the OpenCL programmer’s hands.

Boost.Compute. An extensive C++ library using Boost. Documentation.

ArrayFire. A wrapper and a library in one.

OpenCLHelper. Easy to run kernels using OpenCL.

SkelCL. A more advanced C++ wrapper, providing various skeleton functions.

HPL, or the Heterogeneous Programming Library, where special Arrays are used to easily communicate between CPU and GPU.

ViennaCL, a wrapper around an OpenCL-library focused on linear algebra.

C, Objective C

C Framework for OpenCL. Rapid development of OpenCL programs in C/C++.

Simple OpenCL. Much simpler host-code.

COPRTHR: STDCL: Simplified programming interface for OpenCL designed to support the most typical use-cases in a style inspired by familiar and traditional UNIX APIs for C programming.

Grand Central Dispatch: integration into Apple’s environment. Documentation [PDF].

GoCL: For combination with Gnome GLib/GObject.

Computing-Language-Utility: C/C++ wrapper by Intel. Documentation included, slides of presentation here.

Delphi/Pascal

Delphi-OpenCL: Delphi/Pascal-bindings for OpenCL. Seems not active.

OpenCLforDelphi: OpenCL 1.2 for Delphi.

Fortran

Howto-article: It describes how to link with c-files. A must-read for Fortran-devs who want to integrate OpenCL-kernels.

FortranCL: OpenCL interface for Fortran 90. Seems to be the only matured wrapper around. FAQ.

Wim’s OpenCL Integration: contains a very simple f95 file ‘oclWrapper.f95’.

Go

GOCL: Go OpenCL bindings

Go-OpenCL: Go OpenCL bindings

Haskell

HopenCL: Haskell-bindings for OpenCL. Paper.

Java

JavaCL: Java bindings for OpenCL.

ClojureCL: OpenCL 2.0 wrapper for Clojure.

ScalaCL: Much more advanced integration as could be done with JavaCL.

JoCL by JogAmp: Java bindings for OpenCL. Good integration with siter-projects JoGL and JoAL.

JoCL.org: Java bindings for OpenCL.

The Lightweight Java Game Library (LWJGL): Support for OpenCL 1.0-1.2 plus extensions and OpenGL interop.

Aparapi, a very high level language for enabling OpenCL in Java.

JavaScript

Standardised WebCL-support is coming via the Khronos WebCL project.

Nokia WebCL. Javascript bindings for OpenCL, which works in Firefox.

Samsung WebCL. Javascript bindings for OpenCL, which works in Safari and Chromium on OSX.

Intel Rivertrail. built on top of WebCL.

Julia

JuliaGPU. OpenCL 1.2 bindings for Julia.

Lisp

cl-opencl-3b: Lisp-bindings for OpenCL. not active.

.NET: C#, F#, Visual Basic

OpenCL.NET: .NET bindings for OpenCL 1.1.

Cloo: .NET bindings for OpenCL 1.1. Used in OpenTK, which has good integration with OpenGL, OpenGL|ES and OpenAL.

ManoCL: Not active project for .NET bindings.

FSCL.Compiler: FSharp OpenCL Compiler

Perl

Perl-OpenCL: Perl bindings for OpenCL.

Python

General article

PyOpenCL: Python bindings for OpenCL with convenience wrappers. Documentation.

Cython: C-extension for Python. More info on the extension.

PyCL: not active.

PythonCL: not active.

Clyther: Python bindings for OpenCL. No code yet, but check out the predecessor.

Ruby

Ruby-OpenCL: Ruby bindings for OpenCL. Not active.

Barracuda: seems to be not active.

Rust

Rust-OpenCL: Rust-bindings for OpenCL. Blog-article by author

Math-software

Mathematica

OpenCLLink. OpenCL-bindings in Mathematica 9.

Matlab

There is native support in Matlab.

OpenCL-toolbox. Alternative bindings. Not active. Works with Octave.

R

R-OpenCL. Interface allowing R to use OpenCL.

Suggestions?

If you know not mentioned alike projects, let us know! Even when not active, as that is important information too.

Code Review

Code reviews are one of the fastest ways to get the dev-team back on track in order to add performance to the code. We offer two types of code reviews, all safely under an NDA. This way you keep in control of the development, while getting expert-knowledge in.

A quick scan gives you an overview of the main ways to speed up the code and how it can be done.
This quick scan can be delivered in one week, if necessary, to give you the direction you may require in times of pressure.

Also, an extensive code review can provide all the necessary information for a redesigned architecture.

GPU-code (OpenCL, CUDA, Aparapi, and more)

Writing GPU-code and performing host-code can be tricky. The best method to learn CUDA or openCL is by doing. Nevertheless, you may need feedback sometimes to be sure you’re doing the right thing. We can check your code and give you a report with hand-on tricks to make it optimal.

CPU-code (Java, C, C++ and more)

Many CPU-codes, like Java, C, C++ and C# are written with functionality in mind, but not performance. Adding performance (cache-optimisation, memory-usage reduction, parallelisation of computations, adding OpenMP-threads, etc) is quite doable, but only when you know how. We can help you increase performance of the software through feedback and clear steps.

Let us help you!

If you are interested in this service, request more information today and we will get back to you as soon as possible. Of course, you can also contact us via phone (+31 6 45400456), or e-mail (info@streamhpc.com).

OpenCL Developer support by NVIDIA, AMD and Intel

Posted by Vincent Hindriksen on 14 April 2011 with 2 Comments

There was some guy at Microsoft who understood IT very well while being a businessman: “Developers, developers, developers, developers!”. You saw it again in the mobile market and now with OpenCL. Normally I watch his yearly speech to see which product they have brought to their own ecosphere, but the developers-speech is one to watch over and over because he is so right about this! (I don’t recommend the house-remixes, because those stick in your head for weeks.)

Since OpenCL needs to be optimised for each platform, it is important for the companies that developers start developing for their platform first. StreamComputer is developing a few different Eclipse-plugins for OpenCL-development, so we were curious what was already there. Why not share all findings with you? I will keep this article updated – know this article does not cover which features are supported by each SDK.

Continue reading “OpenCL Developer support by NVIDIA, AMD and Intel” →

Targetting various architectures in OpenCL and CUDA

Posted by Vincent Hindriksen on 23 October 2012

bigstock-Different-Technologies-and-Ope-15769229 — “Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Continue reading “Targetting various architectures in OpenCL and CUDA” →

When Big Data needs OpenCL

Posted by Vincent Hindriksen on 25 August 2012

Big Data in the previous century was the archive full of ring-binders/folders/ordners, which would grow each year at the same pace. Now the definition is that it should grow each year as much as all years before combined.

A few months ago SunGard named 10 Big Data trends transforming financial services. I have used their list as a base to have my own focus: on increased computation-demands and not specific for this one market. This resulted in 7 general trends where Big Data meets/needs OpenCL.

Since the start of StreamHPC we sought customers who could no compute through their whole data in time. Back then Big Data was still a buzz word catching on, but it best describes this one core businesses.

Continue reading “When Big Data needs OpenCL” →

AMD GPUs & CPUs

[infobox type=”information”]

Need a programmer for OpenCL on AMD FirePro, Radeon or APU? Hire us!

[/infobox]

AMD has support for all their recent GPUs and CPUS, and has good performance on products starting from 2010/2011:

[list1]

[/list1]
AMD does not provide a standard SDK kit which contains both hardware and software, as their hardware is available at many computer-shops.

SDK

The OpenCL SDK (software) needs to be downloaded in several steps:

[list1]

Graphics drivers
APP SDK with code-examples
CodeXL:
- CPU profiling
- GPU profiling
- GPU debugging
- Static OpenCL kernel analysis
CodeAnalyst Performance Analyzer

[/list1]

CodeXL replaces the following software in de AMD APP software family:

[list1]

[/list1]

These are still available for download.

Training

There is (free) training material available:

[list1]

[/list1]

Other AMD software for OpenCL

The APP Math Libraries contain FFT and BLAS functions optimised for AMD GPUs.

OpenCL-in-Java can be done using Aparapi.