Is the CPU slowly turning into a GPU?

It's all in the plan
It’s all in the plan?

Years ago I was surprised by the fact that CPUs were also programmable with OpenCL – I solely chose that language for the cool of being able to program GPUs. It was weird at start, but cannot think of a world without OpenCL working on a CPU.

But why is it important? Who cares about the 4 cores of a modern CPU? Let me first go into why CPUs have had mostly 2 cores for so long, about 15 years ago. Simply put, it was very hard to program multi-threaded software that made use of all cores. Software like games did, as they needed all the available resources, but even the computations in MS Excel are mostly single-threaded as of now. Multi-threading was maybe used most for having a non-blocking user-interface. Even though OpenMP was standardised 15 years ago, it took many years before the multi-threaded paradigm was used for performance. If you want to read more on this, search the web for “the CPU frequency wall”.

More interesting is what is happening now with CPUs. Both Intel and AMD are releasing CPUs with lost of cores. Intel has recently a 18-core processor (Xeon E5 2699-v3) and AMD was offering 16-core CPUs for a longer time (Opteron 6300 series). Both have SSE and AVX, which means extra parallelism. If you don’t know what this is precisely about, read my 2011-article on how OpenCL uses SSE and AVX on the CPU.

AVX3.2

Intel now steps forward with AVX3.2 on their Skylake CPUs. AVX 3.1 is in XeonPhi “Knight’s Landing” – see this rumoured roadmap.

It is 512-bits wide, which means that 8 times as much vector-data can be computed! With 16 cores, this would mean 128 float operations per clock-tick. Like a GPU.

The disadvantage is alike the VLIW we had in the pre-GCN generation of AMD GPUs: one needs to fill the vector-instructions to get the speed-up. Also the relatively slow DDR3 memory is an issue, but lots of progress is being made there with DDR4 and stacked memory.

B6r22cCIQAAEPmP

So is the CPU turning into a GPU?

I’d say yes.

With AVX3.2 the CPU gets all the characteristics of a GPU, except the graphics pipeline. That means that the CPU-part of the CPU-GPU is acting more like a GPU. The funny part is that with the GPU’s scalar-architecture and more complex schedulers, the GPU is slowly turning into a CPU.

In this 2012-article I discussed the marriage between the CPU and GPU. This merger will continue in many ways – a frontier where the HSA-foundation is doing great work now. So from that perspective, the CPU is transforming into a CPU-GPU; and we’ll keep calling it a CPU.

This all strengthens my believe in the future of OpenCL, as that language is prepared for both task-parallel and data-parallel programs – for both CPUs and GPUs, to say it in current terminology.

Using OpenCL 1.2 with 1.1 devices

1.1.1.2
1.1 and 1.2 nice together

Recent code uses OpenCL 1.2 more and more, which is a problem when wanting to use OpenCL 1.1 devices, or even 1.0 devices. Most times those computers have OpenCL 1.1 libraries installed. So what to do?

When you want to make code that runs on both on Nvidia, AMD and Intel, then you have two options:

  • Use the OpenCL 1.2 library.
  • Go back to OpenCL 1.1 (from 2010)

Below I’ll explain both option. Understand that the difference between 1.1 and 1.2 is not that big, and that this article is mostly here to get you started to prepare for 2.0.

Use the OpenCL 1.2 library

Not many people seem to know that you can just use OpenCL 1.2 libraries with 1.1 devices. So if you have opencl.dll or libOpenCL.so of version 1.2 and not use the new functions. The advantage is that you can use the 1.2 host-functionalities with the 1.2 devices. Else those were not really usable, except the kernel-functions. To get the version 1.2 of opencl.dll or libOpenCL.so, you need to install the drivers of AMD or Intel or just copy it from a friend.

You can go one step further: we have a wrapper that translates most 1.2 functions to 1.1, to port code to 1.1 more easily. This is a preparation for OpenCL 2.0, to keep partly backwards compatible – versioning is getting more important with each new version of OpenCL.

When distributing your code, you need to distribute the library too. Problem is that it’s not really allowed, and should be installed by the user – and that user installs OpenCL 1.1 when they have NVidia hardware. Khronos distributes the code of the library, but it doesn’t compile easily under Windows. Please read the license.txt well.

Go for OpenCL 1.1 completely

The advantage is that you avoid all kinds of problems, by giving up some progress in the standard. As OpenCL 1.0 libraries are not really out there anymore, you probably won’t need to distribute the library yourself. And it’s just what Nvidia wanted: keeping the competition back. Congrats, Nvidia!

A good reason would

Problems arise when using deprecated functions, or have libraries that have calls to deprecated functions. Those can usually be avoided, but not when coding. For instance the cl.hpp for OpenCL 1.1 (found here) gives loads of warnings when you have the 1.2 library. Luckily this can easily be solved by adding some pragmas (works with GCC 4.6 and higher).

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
 --> the contents of cl.hpp
#pragma GCC diagnostic pop

The source of this trick has some more remarks when using GCC 4.5.

A word towards Linux distributions

OSX has OpenCL 1.2 by default for the host. Linux distributions have dependencies on OpenCL 1.1 for Nvidia-drivers, which is not needed – as explained above. They can easily go for a dependency on the OpenCL 1.2 library. The code provided by Khronos could be used.

AMD Hawaii power-management fix on Linux

od6configThe new Hawaii-based GPUs from AMD (Radeon R9 2xx, FirePro W9100 and Firepro S9150) have a lot of improvements, one being a new OverDrive 6 (AMD’s version of NVIDIA GPU Boost). Problem is that it’s not supported yet in the Linux drivers and you will get too low performance – it probably will be solved in the next version. Luckily there is od6config, made by Jeremi M Gosney.

Do the below steps to get the GPU at normal speed.

  1. Download the zip or tar.gz from http://epixoip.github.io/od6config/ and unpack.
  2. Go to the directory where you unpacked the archive.
  3. run:
    make
  4. run:
    sudo make install
  5. check if it’s needed to fix the power management:
    od6config --get clocks,temp,fan
  6. if the values are too low, run:
    od6config --autofix --set power=10
  7. check if it worked:
    od6config --get clocks,temp,fan

Only OverDrive6 devices are set, devices using OverDrive5 will be ignored.

The PowerTune of 10 was what we found convenient for us, but you might find better values for your case. There are several more options, which are on the homepage of 0d6config. You need to run “od6config –autofix –set power=10” on each reboot.

Remember it’s third party software, so no guarantees to you and no “you killed my GPU” to us.

Call for Papers, Presentations, Workshops and Posters for IWOCL in Stanford

iwocl2015The IWOCL 2015 call for OpenCL Papers is now open and is looking for submissions from industry and academia relating to the use of OpenCL. Submissions may refer to completed projects or those currently in progress and are invited in the form of:

  • Research Papers
  • Technical Presentations
  • Workshops and Tutorials
  • Posters

Examples of sessions from 2014 can be found here.

Deadlines at a Glance

Call for submissions OPENS: Wednesday 19th November, 2014
Call for submissions CLOSES: Saturday 14th February, 2015 (23:59 AOE)
Notifications: Within 4 weeks of the final closing date

Selection Criteria

The IWOCL Technical Committee will select submissions based on the following criteria;

  • Concept of the submission and its relevance and timeliness
  • Technical Depth
  • Clarity of the submissions; clearly conveying what your presentation
  • Research findings and results of your work
  • Your credentials and expertise in the subject matter

Unpublished Technical Papers

We solicit the submission of unpublished technical papers detailing original research related to OpenCL. All topics related to OpenCL are of interest, including OpenCL applications from any domain (e.g., scientific computing, video games, computer graphics, multimedia, information retrieval, optimization, text processing, data mining, finance, signal and image processing and numerical solvers), OpenCL performance analysis and modeling, OpenCL performance and correctness tools and proposed OpenCL extensions. IWOCL will publish formal proceedings of the accepted papers in The ACM International Conference Series. Please Submit an Abstract which should be between 1 and 4 pages long.

Technical Presentations

We solicit the submission of technical presentations detailing the innovative use of OpenCL. All topics related to OpenCL are of interest, including but not limited to applications, software tools, programming methods, extensions, performance analysis and verification. Please Submit an Abstract which should not exceed 4 pages. The accepted presentations will be published in the online workshop proceedings.

Workshops & Tutorials

IWOCL includes a day of tutorials that provide OpenCL users an opportunity to spend more time exploring a specific OpenCL topic. Tutorial submissions should assume working knowledge of OpenCL by the attendees and can for example cover OpenCL itself, any of the related APIs such as SPIR and SYCL, the use of OpenCL libraries or parallel computing techniques in general using OpenCL. Please Submit an Abstract which should not exceed 4 pages. Please include the preferred length of the tutorial or workshop (e.g. 2, 3 or 4 hours).

Posters

To encourage discussion of the latest developments in the OpenCL community, there will be a poster session running in parallel to the main sessions and open during the breaks and lunch sessions. The abstracts of the accepted posters will be published in the form of short communications in the workshop proceedings, provided that at least one of the authors has registered for the workshop. Please Submit an Abstract which should not exceed 2 pages.

Submit your abstract today

Go to Easychair, log in or register, and click on “New Submission”. Deadline is 14 February.

We sponsor HiPEAC again this year

HiPEAC is an academic oriented, 3-day, international conference around HPC, compilers and processors. Last year was in Vienna, this year in Amsterdam – where StreamHPC also is based. HiPeacThat was an extra reason to go for silver sponsorship, besides I find this conference very important.

Compilers have the job to do magic. Last year I had nice feedback on my request to give the developers feedback where in the code the compiler struggles – effectively slapping the guy/gal instead of trying to solve it with more magic. Also learned a lot about compilers in general, listened to GPGPU-talks, discussed about HPC, and most of all: met a lot of very interesting people.

Why you should come too? I give you five reasons:

  • Learn about compilers and GPU-techniques, in depth.
  • Have great discussions about the latest and greatest research, before it’s news.
  • Meet great people who create the compilers you use (or the reverse).
  • Visit Amsterdam, Netherlands – I can be your guide. Flights are cheap.
  • Only spend €400 for the full 3 day programme and a unique dinner with 500 people – compare that to SC14 and GTC!

If you are seeking for a job in HPC, compilers and GPGPU, you should really come over. We’re there, but several other sponsors are also looking for new employees too.

See the tracks at HiPEAC, which has a lot more GPU-oriented talks than last year. I selected a few from the list in bold.

Monday

  • Opening address
  • William J. Dally, Challenges for Future Computing Systems
  • Euro-TM: Final Workshop of the Euro-TM COST Action
  • Session 1. Processor Core Design
  • CS²: Cryptography and Security in Computing Systems
  • IMPACT: Polyhedral Compilation Techniques
  • MCS: Integration of mixed-criticality subsystems on multi-core and manycore processors
  • EEHCO: Energy Efficiency with Heterogeneous Computing
  • INA-OCMC: Interconnection Network Architecture: On-Chip, Multi-Chip
  • WAPCO: Approximate Computing
  • SoftErr: Mitigation of soft errors: from adding selective redundancy to changing the abstraction stack
  • Session 2. Data Parallelism, GPUs
  • James Larus, It’s the End of the World as We Know It (And I Feel Fine)
  • ENTRE: EXCESS & NANOSTREAMS
  • SiPhotonics: Exploiting Silicon Photonics for energy-efficient high-performance computing
  • HetComp: Heterogeneous Computing: Models, Methods, Tools, and Applications
  • Session 3. Caching
  • Session 4. I/O, SSDs, Flash Memory
  • Student poster session / Welcome reception

Tuesday

Don’t forget to meet us at the industrial poster-sessions.

  • Rudy Lauwereins, New memory technologies and their impact on computer architectures
  • Thank you HiPEAC
  • Session 5. Emerging Memory Technologies
  • EMC²: Mixed Criticality Applications and Implementation Approaches
  • ADEPT: Energy Efficiency in High-Performance and Embedded Computing
  • MULTIPROG: Programmability Issues for Heterogeneous Multicores
  • WRC: Reconfigurable Computing
  • TISU: Transfer to Industry and Start-ups
  • HiStencils: High-Performance Stencil Computations
  • MILS: Architecture and Assurance for Secure Systems
  • Programmability: Programming Models for Large Scale Heterogeneous Systems
  • Industrial Poster Session
  • INNO2015: Innovation actions in Advanced Computing CFP
  • Session 6. Energy, Power, Performance
  • DCE: Dynamic Compilation Everywhere
  • EUROSERVER: Green Computing Node for European Micro-servers
  • PolyComp: Polyhedral Compilation without Polyhedra
  • HiPPES4CogApp: High-Performance Predictable Embedded Systems for Cognitive Applications
  • Industrial Session
  • Session 7. Memory Optimization
  • Session 8. Speculation and Transactional Execution
  • Canal tour / Museum visit / Banquet

Wednesday

  • Burton J. Smith, Resource Management in PACORA
  • HiPEAC 2016
  • Session 9. Resource Management and Interconnects
  • PARMA-DITAM: Parallel Programming and Run-Time Management Techniques for Many-core Architectures + Design Tools and Architectures for Multi Core Embedded Computing Platforms
  • ADAPT: Adaptive Self-tuning Computing System
  • PEGPUM: Power-Efficient GPU and Many-core Computing
  • HiRES: High-performance and Real-time Embedded Systems
  • RAPIDO: Rapid Simulation and Performance Evaluation: Methods and Tools
  • MemTDAC: Memristor Technology, Design, Automation and Computing
  • DataFlow, Computing in Space: DataFlow SuperComputing
  • IDEA: Investigating Data Flow modeling for Embedded computing Architectures
  • TACLe: Timing Analysis on Code-Level
  • EU Projects Poster Session
  • Session 10. Compilers
  • HIP3ES: High Performance Energy Efficient Embedded Systems
  • HPES: High Performance Embedded Systems
  • Session 11. Concurrency
  • Session 12. Methods (Simulation and Modeling)

Hopefully till then!

How to introduce HPC in your enterprise

eviljaymz-spare-time
Spare time in IT – © jaymz.eu

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

  • Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
  • NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
  • AMD: For the S9150 get a W9100.
  • Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

P_setting_fff_1_90_end_500.pngIf your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!

AMD now leads the Green500

green500With SC14 behind us, there are a few things I’d like to share with you. I’d like to start with the biggest win for OpenCL: AMD leading in the most power-efficient GPU-cluster.

A few months ago I wrote a theoretical article on how to build the cheapest and greenest supercomputer to enter the Top500 and Green500. There I showed that AMD would theoretically win on both GFLOPS/costs and GFLOPS/Watt. Last week I learned a large cluster is actually being built in Germany, which now leads the Green500 (GFLOPS/Watt). It is powered by Intel Ivy Bridge CPUs, an FDR Infiniband network and accelerated by air-cooled(!) AMD FirePro S9150 GPUs, as can be seen on the Green 500 report of November. The score: 5.27 GFLOPS per Watt, mostly because of AMD’s surprise act: extremely efficient SGEMM and DGEMM.

green5

The first NVIDIA Tesla-based system on the list is at #3 with 4.45 GFLOPS per Watt for a liquid cooled system. If the AMD FirePro S9150 would be oil or water cooled, the system could go to over 6 GFLOPS per Watt. I’m expecting such system on the Green500 of June. The PEZY-SC (#2 on the list) is a very interesting, unexpected newcomer to the field – I’ll share more with you later, as I heard it supports OpenCL.

The price metric

The cluster at GSI Helmholtz Center has around 1.65 double precision PetaFlops (theoretical). Let’s do the same calculation as with the 150 GFLOPS system using the latest prices, only taking the accelerator part.

640 x AMD FirePro S9150.

  • 2.53 GFLOPS * 640 = 1.62 TFLOPS (I rounded down to 2.0 GFLOPS in the other article)
  • US$ 3300. Total price: $2.112M. Price per TFLOPS: $1.304M
  • 235 Watt * 640 = 150 kWatt (excluding network, CPU, etc)

640 x NVIDIA Tesla K40x

  • 1.42 GFLOPS * 640 = 0.91 TFLOPS
  • US$ 3160 (got down a lot due to introduction K80!). Total price: $2.022M. Price per TFLOPS: $2.225M
  • 235 Watt * 640 = 150 kWatt

640 x Intel XeonPhi 7120P

  • 1.21 GFLOPS * 640 = 0.65 TFLOPS
  • US$ 3450. Total price: 2.208$M. Price per TFLOPS: $3.397M
  • 300 Watt * 640 = 192 kWatt

So it’s pretty clear, why GSI chose AMD: $92M or $209M less costs for the same GFLOPS. Also note that more GFLOPS per accelerator is important to lower overhead.

What to expect from June’s Green500

Next year Nvidia probably comes with Maxwell, which probably will do very well in the Green500. Intel has their new XeonPhi, but it’s a very new architecture and no samples have arrived yet – I would be surprised, as they over-promised for too long now. Besides bringing surprises, Intel’s other strengths are its vast collaborations and strong fanbase – the past years I heard the most ridiculous responses on why such underperforming accelerator was chosen instead of FirePro or Tesla, so it’s certainly aiming for a rampage (based on hope). AMD did not enclose any information on a new version of the S9150 (Something like S9200 or S9250).

Then there are the dual GPUs, which have no advantages but lower energy-usage. The K80 just arrived, but the number don’t add up yet – we’ll have to see when the samples arrive. AMD did not say anything about the next version of the S10000, but probably arrives next year – no ETA. Intel did not do dual-chip cards until now. These systems can be built more compact, as 4 GPUs per system is becoming a standard.

Another important change will be the CPUs with embedded CPU being used in the clusters, where now mostly Intel Xeons rule the world. Intel’s Iris Pro line and AMD new Carrizo APU could certainly get more popular, as more complex code can be accelerated very well by such processors. Also 64-bit ARM-processors we’ll see more – hopefully with GPU. This subject I’ll handle in a separate article, as OpenCL could be a big enabler for easy offloading.

Based on the current information I have available, Nvidia aims for Maxwell based Teslas, AMD with S9150 and the dual-GPU variant, Intel with none (aiming for November 2015). It’ll be exciting to see HPC get to 6+ GFLOPS/Watt as a standard – I find that more important than building the biggest cluster.

OpenCL will help select hardware from that year’s winner, not being locked in to that year’s loser. Meanwhile at StreamHPC we will keep building OpenCL-based software, to help our customers pick that winner.

What does Khronos has more to offer than OpenCL and OpenGL?

opencl_from_accelerate_your_worldThe OpenCL standard is from the not-for-profit industry consortium Khronos Group. But they do a lot more, like the famous standard OpenGL for graphics. Focus of the group has always been on multimedia and getting the fastest results out of the hardware.

Now open source and open standards are getting more important, collabroations like the Khronos Group, get more attention. At StreamHPC we are very happy with this trend, as the business models are more focused on collaborations and getting things done than on making sure the customer cannot ever leave.

Below is an overview of the most important APIs that Khronos has to offer.

OpenCL related

  • OpenCL: compute
  • WebCL: web compute
  • SPIR/SPIR-V: intermedia language for compute-kernels, like those of OpenCL and OpenGL’s GSLS
  • SYCL: high-level language for OpenCL

OpenGL related

  • Vulkan: state-less graphics
  • OpenGL: graphics
  • OpenGL ES: embedded graphics
  • WebGL: web graphics
  • glTF: runtime asset format for WebGL, OpenGL ES, and OpenGL
  • OpenGL SC: Graphics for Safety Critical operations
  • EGL: interface between rendering APIs such as OpenGL ES and the underlying native platform window system, such as X.

Streaming input and output

  • OpenMAX: interface for multimedia codecs, platforms and hardware
  • StreamInput: interface for sensors
  • OpenVX: OpenCV-alternative, built for performance.
  • OpenKCam: interface for cameras and sensors

Others

One video called “OpenRoad” to show them all:

Want to learn more? Feel free to ask in the comments, or check out https://www.khronos.org/

Starting with GROMACS and OpenCL

Gromacs-OpenCLNow that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

  • -DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
  • -DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
  • -DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
  • -DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
  • -DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
  • -DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
  • -DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
  • -DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
  • -DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
  • -DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
  • -DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

 

Why this new AMD FirePro Cluster is important for OpenCL

FirePro cluster
FirePro S9150 cluster

Then it hit the doormat:

AMD is proud to collaborate with ASUS, the Frankfurt Institute for Advanced Studies, (FIAS) and GSI to support such important physics and computer science research,” said David Cummings, senior director and general manager, professional graphics, AMD. “This installation reaffirms AMD’s leading role in HPC with the implementation of the AMD FirePro S9150 server GPUs in this three petaFLOPS supercomputer cluster. AMD and ASUS are enabling OpenCL applications for critical science research usage for this cluster. We’re committed to building our HPC leadership position in the industry as a foremost provider of computing applications, tools and technologies.

You read more here and the official news here.

Why is this important?

It could be that there is more flops for the same price, as AMD hardware is cheaper? Nice, but secondary.

That it runs OpenCL? We like that, but from a broader perspective this is not the most important.

It is important because it creates more diversity in the world of HPC. Currently there are a few XeonPhi clusters and only one big AMD FirePro S10000 cluster. The rest is NVidia Tesla or CPU only. With more AMD clusters the HPC market is democratised. That means that more software will be written in vendor-neutral software like OpenCL (with high-level software/libraries on top), and prices of HPC accelerators will not be kept high.

How to further democratise the HPC world?

We started with porting Gromacs to OpenCL, and we will continue to port large projects to OpenCL. This software will simply run on XeonPhi, Tesla and FirePro with just little porting time, reducing costs in many ways. We can not do it alone, but together we can. Start by telling us which software needs to be ported from OpenMP to OpenCL or OpenMP 4, or from CUDA to OpenCL. And if you are porting open source software to OpenCL, drop us a line for free advice and help with testing the software.

And the best you can do to break the monopoly of CUDA, is to simply buy AMD or Intel hardware. The price difference is enough to buy lots of extra FLOPS and to pay for a complete porting project to OpenCL of a large application.

 

OpenCL at SC14

SC14During SC14 (SuperComputing Conference 2014), OpenCL is again all over New Orleans. Just like last year, I’ve composed an overview based on info from the Khronos website and the SC2014 website.

Finally I’m attending SC14 myself, and will give two talks for you. Tuesday I’ll be part of a 90 minute session of Khronos, where I’ll talk a bit about GROMACS and selecting the right accelerator for your software. Wednesday I’ll be sharing our experiences from our port of GROMACS to OpenCL. If you meet me, then I can hand you over a leaflet with the decision chart to help select the best device for the job.

Continue reading “OpenCL at SC14”

OpenCL integer rounding in C

Square_rounding
Square pant rounding can simply be implemented with “return (NAN);“.

Getting about the same code in C and OpenCL has lots of advantages, when maximum optimisations and vectors are not needed. One thing I bumped into myself was that rounding in C++ is different, and decided to implement the OpenCL-functions for rounding in C.

The OpenCL-page for rounding describes many, many functions with this line:

destType convert_destType<_sat><_roundingMode>(sourceType)

So for each sourceType-destType combination there is a set of functions: 4 rounding modes and an optional saturation. Easy in Ruby to define each of the functions, but takes a lot more time in C.

The 4 rounding modes are:

Modifier Rounding Mode Description
_rte Round to nearest even
_rtz Round towards zero
_rtp Round toward positive infinity
_rtn Round toward negative infinity

The below pieces of code should also explain what the functions actually do.

Round to nearest even

This means that the numbers get rounded to the closest number. In case of 3.5 and 4.5, they both round to the even number 4. Thanks for Dithermaster, for pointing out my wrong assumption and clarifying how it should work.

inline int convert_int_rte (float number) {
   int sign = (int)((number > 0) - (number < 0));
   int odd = ((int)number % 2); // odd -> 1, even -> 0
   return ((int)(number-sign*(0.5f-odd)));
}

I’m sure there is a more optimal implementation. You can fix that in Github (see below).

Round to zero

This means that positive numbers are rounded up, negative numbers are rounded down. 1.6 becomes 1, -1.6 also becomes 1.

inline int convert_int_rtz (float number) {
   return ((int)(number));
}

Effectively, this just removes everything behind the point.

Round to positive infinity

1.4 becomes 2, -1.6 becomes 1.

inline int convert_int_rtp (float number) {
   return ((int)ceil(number));
}

Round to negative infinity

1.6 becomes 1, -1.4 becomes 2.

inline int convert_int_rtp (float number) {
   return ((int)floor(number));
}

Saturation

Saturation is another word for “avoiding NaN”. It makes sure that numbers are between INT_MAX and INT_MIN, and that NaN returns 0. If not used, the outcome of the function can be anything (-2147483648 in case of convert_int_rtz(NAN) on my computer). Saturation is more expensive, so therefore it’s optional.

inline float saturate_int(float number) {
  if (isnan(number)) return 0.0f; // check if the number was already NaN
  return (number>MAX_INT ? (float)MAX_INT : number

Effectively the other functions become like:

inline int convert_int__sat_rtz (float number) {
   return ((int)(saturate_int(number)));
}

Doubles, longs and getting started.

Yes, you need to make functions for all of these. But you could ofcourse also check out the project on Github (BSD licence, rudimentary first implementation).

You’re free to make a double-version of it.

Mega-kernel versus Micro-kernels in LuxRender (repost)

LuxRenderer demo-rendering
LuxRenderer demo-rendering

Below is a (slightly edited) repost of a blog by

I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.


 

For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:

When programming for GPUs, simply porting a large CPU program
into an equally large GPU kernel is generally not a good approach.
Due to SIMT execution model on GPUs, divergence in control flow
carries substantial performance penalties, as does high register us-
age that lessens the latency-hiding capability that is essential for the
high-latency, high-bandwidth memory system of a GPU. In this pa-
per, we implement a path tracer on a GPU using a wavefront formu-
lation, avoiding these pitfalls that can be especially prominent when
using materials that are expensive to evaluate. We compare our per-
formance against the traditional megakernel approach, and demon-
strate that the wavefront formulation is much better suited for real-
world use cases where multiple complex materials are present in
the scene.

OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.

The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.

PATHOCL Micro-kernels edition, the results

The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.

Occupancy increases from 10% to 30% or more
Occupancy increases from 10% to 30% or more
The performance increase is huge on some platform (Linux + FirePro W8100), 3.6 times:
Speed increases from 0.84M to 3.07M samples/sec
Speed increases from 0.84M to 3.07M samples/sec

A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.

It solves the problems with AMD compiler

Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:

Scene builds correctly on AMD hardware for the first time
Scene builds correctly on AMD hardware for the first time

Try it out yourself

This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.

A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.

To run with micro-kernels, use “path.microkernels.enable=1”.

We ported GROMACS from CUDA to OpenCL

GROMACS does soft matter simulations on molecular scale
GROMACS does soft matter simulations on molecular scale. Let it fly.

GROMACS is an important molecular simulation kit, which can do all kinds of “soft matter” simulations like nanotubes, polymer chemistry, zeolites, adsorption studies, proteins, etc. It is being used by researches worldwide and is one of the bigger bio-informatics softwares around.

To speed up the computations, GPUs can be used. The big problem is that only NVIDIA GPU could be used, as CUDA was used. To make it possible to use other accelerators, we ported it to OpenCL. It took several months with a small team to get to the alpha-release, and now I’m happy to present it to you.

For who knows us from consultancy (and training) only, might have noticed. This is our first product!

We promised to keep it under the same open source license and that effectively means we are giving it away for free. Below I’ll explain how to obtain the sources and how to build it, but first I’d like to explain why we did it pro bono.

Why we did it

Indeed, we did not get any money (income or funds) for this. There have been several reasons, of which the below four are the most important.

  • The first reason is that we want to show what we can. Each project was under NDA and we could not demo anything we made for a customer. We chose for a CUDA package to port to OpenCL, as we notice that there is a trend to port CUDA-software to OpenCL (i.e. Adobe software).
  • The second reason is that bio-informatics is an interesting industry, where we would like to do more work.
  • Third reason is that we can find new employees. Joining the project is a way to get noticed and could end up in a job-proposal. The GROMACS project is big and needs unique background knowledge, so it can easily overwhelm people. This makes it perfect software to test out who is smart enough to handle such complexity.
  • Fourth is gaining experience with handling open source projects and distributed teams.

Therefore I think it’s a very good investment, while giving something (back) to the community.

Presentation of lessons learned during SC14

We just jumped in and went for it. We learned a lot, because it did not go as we expected. All this experience, we would like to share on SuperComputing 2014.

During SC14 I will give a presentation on the OpenCL port of GROMACS and the lessons learned. As AMD was quite happy with this port, they provided me a place to talk about the project:

“Porting GROMACS to OpenCL. Lessons learned”
SC14, New Orleans, AMD’s mini-theatre.
19 November, 15:00 (3:00 pm), 25 minutes

The SC14 demo will be available on the AMD booth the whole week, so if you’re curious and want to see it live with explanation.

If you’d like to talk in person, please send an email to make an appointment for SC14.

Getting the sources and build

It still has rough edges, so a better description would be “we are currently porting GROMACS to OpenCL”, but we’re very close.

As it is work in progress, no binaries are available. So besides knowledge of C, C++ and Cmake, you also need to know how to work with GIT. It builds on both Windows and Linux, and NVIDIA and AMD GPUs are the target platforms for the current phase.

The project is waiting for you on https://github.com/StreamHPC/gromacs.

The wiki has lots of information, from how to build, supported devices to the project planning. Please RTFM, before starting! If something is missing on the wiki, please let us know by simply reporting a new issue.

Help us with the GROMACS OpenCL port

We would like to invite you to join, so we can make the port better than the original. There are several reasons to join:

  1. Improve your OpenCL skills. What really applies to the project is this quote:

    Tell me and I forget.
    Teach me and I remember.
    Involve me and I learn.

  2. Make the OpenCL ecosphere better. Every product that has OpenCL support, gives choice to the user what GPU to use (NVIDIA, AMD or Intel)
  3. Make GROMACS better. It is already a large community and OpenCL-knowledge is needed now.
  4. Get hired by StreamHPC. You’ll be working with us directly, so you’ll get to know our team.

What can you do? There is much you can do. Once you managed to build and run it, look at the bug reports. First focus is to get the failing kernels working – this is top priority to finalise phase 1. After that, the real fun begins in phase 2: add features and optimise for speed on specific devices. Since AMD FirePro is much better in double precision than Nvidia Tesla, it would be interesting to add support for double precision. Also certain parts of the code is done on the CPU, which have real potential to be ported to the GPU.

If things are not clear and obstruct you from starting, don’t get stressed and send an email with any question you have. We’re awaiting your merge request or issue report!

Special thanks

This project wasn’t possible without the help of many people. I’d like to thank them now.

  • The GROMACS team in Sweden, from the KTH Royal Institute of Technology.
    • Szilárd Páll. A highly skilled GPU engineer and PhD student, who pro-actively keeps helping us.
    • Mark Abraham. The GROMACS development manager, always quickly answering our various questions and helping us where he could.
    • Berk Hess. Who helped answering the harder questions and feeding the discussions.
  • Anca Hamuraru, the team lead. Works at StreamHPC since June, and helped structure the project with much enthusiasm.
  • Dimitrios Karkoulis. Has been volunteering on the project since the start in his free time. So special thanks to Dimitrios!
  • Teemu Virolainen. Works at StreamHPC since October and has shown to be an expert on low-level optimisations.
  • Our contacts at AMD, for helping us tackle several obstacles. Special thanks go to Benjamin Coquelle, who checked out the project to reproduce problems.
  • Michael Papili, for helping us with designing a demo for SC14.
  • Octavian Fulger from Romanian gaming-site wasd.ro, for providing us with hardware for evaluation.

Without these people, the OpenCL port would never been here. Thank you.

OpenCL tutorial videos from Mac Research

macresearchA while ago macresearch.com stopped from existing, as David Gohara pulled the plug. Luckily the sources of a very nice tutorial were not lost, and David gave us permission to share his material.

Even if you don’t have a MAC, then these almost 5 year old materials are very helpful to understand the basics (and more) of OpenCL.

We also have the sources (chapter 4, chapter 6) and the collection of corresponding PDFs for you. All material is copyright David Gahora. If you like his style, also check out his podcasts.

Introduction to OpenCL

OpenCL fundamentals

Building an OpenCL Project

Memory layout and Access

Questions and Answers

Shared Memory Kernel Optimisation

Did you like it? Do you have improvements on the code? Want us to share more material? Let us know in the comments, or contact us directly.

Want to learn more? Look in our knowledge base, or follow one of our trainings.

 

We’re looking for an intern to do the cool stuff: benchmarking and Linux wizarding

intern
So, don’t let us retype your documents and blog posts, as that would make us your intern.

We have some embedded devices here, which badly need attention. Some have gotten some private time on the bench, but we did not share anything on the blog yet with our readers. We simply need some extra hands to do this. Because it’s actually cool to do, but admittedly a bit boring when doing several devices, it was the perfect job for an intern. Besides the benchmarking, we have some other Linux-related projects for you. You’ll get an average payment for an internship in the Netherlands (in Dutch: “stagevergoeding”), lunch, a desk and a bunch of devices (aka toys-for-techies).

Like more companies in the Netherlands, we don’t care about how you where born, but who you are as a person. We expect from you that you…

  • know everything about Linux administration, from servers to embedded devices.
  • know how to setup a benchmark.
  • document all what you do, not only the results.
  • speak and write Dutch and English.
  • have great humor! (Even if you’re the only one who laughs at your jokes).
  • study in the EU, or can arrange the paperwork to get to the EU yourself.
  • have a place to live/crash in or nearby Amsterdam, or don’t mind the daily travelling. You cannot sleep in the office.

Together with your educational institute we’ll discuss the exact learning goals of the internship, and make a plan for a period of 3 to 6 months.

If you are interested, send a mail to jobs@streamhpc.com. If you know somebody who would be interested, please tell that person that we’re waiting for him/her! Also tips&tricks on finding the right person are very welcome.

A short story: OpenCL at LaSEEB (Lisboa, Portugal)

3_lab-laseeblaseeb-logosoftThe research lab LaSEEB (Lisboa, Portugal) is active in the areas of Biomedical Engineering, Computational Intelligence and Evolutionary Systems. They create software using OpenCL and CUDA to speed-up their research and simulations.

They were one of the first groups to try out OpenCL, even before StreamHPC existed. To simplify the research at the lab, Nuno Fachada created cf4ocl – a C Framework for OpenCL. During an e-mail correspondence with Nuno, I asked to tell something about how OpenCL was used within their lab. He was happy to share a short story.

We started working with OpenCL since early 2009. We were working with CUDA for a few months, but because OpenCL was to be adopted by all platforms, we switched right away. At the time we used it with NVidia GPUs and PS3’s, but had many problems because of the verbosity of its C API. There were few wrappers, and those that existed weren’t exactly stable or properly documented. Adding to this, the first OpenCL implementations had some bugs, which didn’t help. Some people in the lab gave up using OpenCL because of these initial problems.

All who started early on, recognises the problems described above. Now fast-forward to now.

Nowadays its much easier to use, due to the stability of the implementations, and the several wrappers for different programming languages. For C however, I think cf4ocl is the most complete and well documented. The C people here in the lab are getting into OpenCL again due to it (the Python guys were already into it again due to the excellent PyOpenCL library). Nowadays we’re using OpenCL for AMD and NVidia GPUs and multicore CPUs, including some Xeons.

This is what I hear more often lately: the return of OpenCL to research labs and companies, after a several years of CUDA. It’s a combination of preparing for the long-term, growing interest in exploring other and/or cheaper(!) accelerators than NVIDIA’s GPUs, and OpenCL being ready for prime-time.

Do you have your story of how OpenCL is used within your lab or company? Just contact us.

Are you interested in other wrappers for OpenCL, See this list on our knowledge base.

Why use OpenCL on FPGAs?

9781118942208.pdfAltera has just released the free ebook FPGAs for dummies. One part of the book is devoted to OpenCL, so we’ll quote some extracts here from one of the chapters. The rest of the book is worth a read, so if you want to check the rest of the text, just fill in the form on Altera’s webpage.

In StreamHPC we’re interested in OpenCL on FPGAs for one reason: many companies run their software on GPUs, when they should be using FPGAs instead; and at the same time, others stick to FPGAs and ignore GPUs completely. The main reason, we think, is that converting CUDA to VHDL, or Verilog to CPU intrinsics, is simply too painful. Another reason can be seen in the a amount of investment put on a certain technology. We believe that OpenCL can solve both of these issues. OpenCL is much more portable and can be converted to a new architecture in a relatively short time (if the developer is familiar with the project, the hardware and OpenCL). We have high familiarity with these two latter, which means we’re used to get new projects up-and-running.

Since both Altera and Xilinx have invested in OpenCL, the two FPGAs code has become more portable now. Altera has a public SDK (and they’re proudly loud about it), while Xilinx offers it in their latest tools (although they’re unfortunately much more silent about it).

Now, let us now go back to the quotes from the book that we wanted to share with you.

Andrew Moore describes OpenCL effectively in just a few sentences:

The need for heterogeneous computing is leading to new programming languages to exploit the new hardware. One example is the OpenCL first developed by Apple, Inc. OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs, and other types of processors. OpenCL includes a language for developing kernels (functions that execute on hardware devices) as well as application programming interfaces (APIs) that define and control the various platforms. OpenCL allows for parallel computing using task-based and data-based parallelism.

The author also shares some interesting insights around the reasons why OpenCL should be used on FPGA:

FPGAs are inherently parallel, so they’re a perfect fit with OpenCL’s parallel computing capabilities. FPGAs give you an alternative to the typical data or task parallelism by offering a pipeline parallelism where tasks can be spawned in a push-pull configuration with each task using different data from the previous task with or without host interaction. OpenCL allows you to develop your code in the familiar C programming language but using the additional capabilities provided by OpenCL. These kernels can be sent to the FPGAs without your having to learn the low-level HDL coding practices of FPGA designers. Generally, there are several benefits for software developers and system designers to use OpenCL to develop code for FPGAs:

  • Simplicity and ease of development: Most software developers are familiar with the C programming language, but not low-level HDL languages. OpenCL keeps you at a higher level of programming, making your system open to more software developers.
  • Code profiling: Using OpenCL, you can profile your code and determine the performance-sensitive pieces that could be hardware accelerated as kernels in an FPGA.
  • Performance: Performance per watt is the ultimate goal of system design. Using an FPGA, you’re balancing high performance in an energy-efficient solution.
  • Efficiency: The FPGA has a fine-grain parallelism architecture, and by using OpenCL you can generate only the logic you need to deliver one fifth of the power of the hardware alternatives.
  • Heterogeneous systems: With OpenCL, you can develop kernels that target FPGAs, CPUs, GPUs, and DSPs seamlessly to give you a truly heterogeneous system design.
  • Code reuse: The holy grail of software development is achieving code reuse. Code reuse is often an elusive goal for software developers and system designers. OpenCL kernels allow for portable code that you can target for different families and generations of FPGAs from one project to the next, extending the life of your code.

Today, OpenCL is developed and maintained by the technology consortium Khronos Group. Most FPGA manufacturers provide Software Development Kits (SDKs) for OpenCL development on FPGAs.

You can continue here if you want to read of this ebook. And of course, whenever you want to learn some more more, feel free to write to us, or follow this conversation on Twitter, which goes on through our special account: @OpenCLonFPGAs.

OpenCL support levels

The below table shows the current state of OpenCL, SPIR and HSA for each vendor.

VendorOpenCLSPIRUnified memory architecture
AMD - CPU + GPU (APU)1.2 FP
2.0 in beta
2.0 in betaHSA 1.0 in beta
AMD - discrete GPU1.2 FP?N/A
NVIDIA - CPU + GPU (Tegra)N/AUnavailableCUDA 6 Unified Memory on latest Tegra
NVIDIA - GPU1.1 FPUnavailableN/A
Intel - CPU1.2 FP1.2-
Intel - integrated GPU1.2 (?)?unknown
Intel - Accelerator (XeonPhi)1.2 FP?N/A
Altera1.0 FP?N/A
ARM - CPU1.1 FP?-
ARM - integrated GPU1.1 FP?HSA member
Qualcomm - GPU1.1 EP
1.2 in beta
?HSA member
Imagination - GPU1.1 EP?HSA member
Vivante/FreeScale - GPU1.1 FP?HSA member

EP = Embedded Profile, FP = Full Profile.

OpenCL support on recent Android smartphones

There is more than one way (image by Pank Seelen
There is more than one way (image by Pank Seelen)

The embedded world is so extremely flexible, because it is full of open standards. We therefore expect that big processor vendors will push harder than Google can push back. OpenCL-support is very important for GPGPU-libraries like ArrayFire, VexCL, ViennaCL – these can be ported to Android in less time.

Apple now has introduced Metal on iOS to increase the fragmentation even more. StreamHPC and friends are working hard on getting one language to have on all platforms, so we can build on bringing solutions to you. Understand that if OpenCL gets popular on Android, this increases the chance that it will get accepted on other mobile platforms like iOS and Windows Mobile/Phone.

On the other hand it is getting blocked wherever it can, as GPGPU brings unique apps. A RenderScript-only or Metal-only app is good for sales of one type of smartphone – good for them, bad for developers who want to target the whole market.

Getting the current status

To get more insight on the current situation, Pavan Yalamanchili of ArrayFire has created a spreadsheet (click here to edit yourself). It is publicly editable, so anybody can help complete it. Be clear about the version of Android you are running, as for instance in 4.4.4 there are possibly some blocks thrown up by Google. If you found drivers, but did not get OpenCL running, please put that in the notes. You can easily find out if your smartphone supports OpenCL, using this OpenCL-Info app. Thanks in advance of helping out!

Why not just RenderScript?

We think that RenderScript can be built on top of OpenCL. This helps allowing new programming languages and finding the optimal programming-solution faster than just trusting Google engineers – solving this problem is not about being smart, but about being open to more routes.

Same is for Metal, which even tries to replace both OpenCL and OpenGL. Again it is a higher level language which can be expressed in OpenGL and OpenCL.

Let’s see if Apple and Google serve their dedicated developers, or if we-the-developers must serve them. Let’s hope for the best.

Using async_work_group_copy() on 2D data

async_all_the_things1When copying data from global to local memory, you often see code like below (1D data):
[raw]

if (get_group_id(0)==0) {
  for (int i=0; i < N; i++) {
      data_local[i] = data_global[offset+i]
  }
}
mem_fence(CLK_LOCAL_MEM_FENCE);

[/raw]
This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.

event_t async_work_group_copy ( __local gentype *dst,
const __global gentype *src,
size_t data_size,
event_t event
event_t async_work_group_copy ( __global gentype *dst,
const __local gentype *src,
size_t data_size,
event_t event

The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.

As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.

The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.

[raw]

kernel void using_local(const global float* dataIn, local float* dataInLocal) {
    event_t event;
    const int dataInLocalWidth = (offset*2 + get_local_size(0));
        
    for (int i=0; i < (offset*2 + get_local_size(1)); i++) {
        event = async_work_group_copy(
             &dataInLocal[i*dataInLocalWidth],
             &dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0) 
                 + (get_group_id(0)*get_local_size(0)) - offset],
             dataInLocalWidth,
             event);
   }
   do_other_stuff(); // code that you can execute for free
   wait_group_events(1, &event); // waits until the copy has finished.
   use_data(dataInLocal);
}

[/raw]

On the host (C++), the most important part:
[raw]

cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float) 
          * gsize_x * gsize_y);
cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset) 
          * (lsize_y+2*offset));
queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn);
cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error));
cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y), 
          cl::NDRange(lsize_x, lsize_y));
kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);

[/raw]
This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.

This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.