Market Positioning of Graphics and Compute solutions

positioningWhen compute became possible on GPUs, it was first presented as an extra feature and did not change much to the positioning of the products by AMD/ATI and Nvidia. NVidia started with positioning server-compute (described as “the GPU without a monitor-connector”), where AMD and Intel followed. When the expensive Geforce GTX Titan and Titan Z got introduced it became clear that NVidia still thinks about positioning: Titan is the bridge between Geforce and Tesla, a Tesla with video-out.

Why is positioning important? It is the difference between “I’d like to buy a compute-card for my desktop, so I can develop algorithms that run as well on the compute-server” and “I’d like to buy a graphics card for doing computations and later run that on a passively cooled graphics card”. The second version might get a “you don’t want to do that”, as graphics terminology is used to refer to compute-goals.

Let’s get to the overview.

AMD NVIDIA Intel ARM
Desktop User * A-series APU Iris / Iris Pro
Laptop User * A-series APU Iris / Iris Pro
Mobile User Tegra Iris Mali T720 / T4xx
Desktop Gamer Radeon GeForce
Laptop Gamer Radeon M GeForce M
Mobile High-end Tegra K (?) Iris Pro Mali T760 / T6xx
Desktop Graphics FirePro W Quadro
Laptop Graphics FirePro M Quadro M
Desktop (DP) Compute FirePro W Titan (hdmi) / Tesla (no video-out) XeonPhi
Laptop (DP) Compute FirePro M Quadro M XeonPhi
Server (DP) Compute FirePro S Tesla XeonPhi (active cooling!)
Cloud Sky Grid

* = For people who say “I think my computer doesn’t have a GPU”.

My thoughts are that Titan are to promote compute at the desktop, while also Tesla is promoted for that. AMD has the FirePro W for that, for both Graphics professionals and Compute professionals, to serve all customers. Intel uses XeonPhi for anything compute and it’s is all actively cooled.

The table has some empty spots: Nvidia doesn’t have IGP, AMD doesn’t have mobile graphics and Intel doesn’t have a clear message at all (J, N, X, P, K mixed for all types of markets). Mobile GPUs from ARM, Imagination, Qualcomm and others have a clear message to differentiate between high-end and low-end mobile GPUs, whereas NVidia and Intel don’t.

Positioning of the Titan Z

Even though I think that Nvidia made a right move with positioning a GPU for the serious Compute Hobbyist, they are very unclear with their proposition. AMD is very clear: “Want professional graphics and compute (and play games after work)? Get FirePro W for workstations”, whereas Nvidia says “Want compute? Get a Titan if you want video-output, or Tesla if you don’t”.

See this Geforce-page, where they position it as a gamers-card that competes with the Google Brain Supercomputer and a MAC Pro. In other places (especially benchmarks) it is stressed that it is not meant for gamers, but for compute enthusiasts (who can afford it). See for example this review on Hardware.info:

That said, we wouldn’t recommend this product to gamers anyway: two Nvidia GeForce GTX 780 Ti or AMD Radeon R9 290X cards offer roughly similar performance for only a fraction of the money. Only two Titan-Zs in SLI offer significantly higher performance, but the required investment is incredibly high, to the point where we wouldn’t even consider these cards for our Ultimate PC Advice.

As a result, Nvidia stresses that these cards are primarily intended for GPGPU applications in workstations. However, when looking at these benchmarks, we again fail to see a convincing image that justifies the price of these cards.

So NVIDIA’s naming convention is unclear. If TITAN is for the serious and professional compute developer, why use the brand “Geforce”? A Quadro Titan would have made much more sense. Or even “Tesla Workstation”, so developers could get a guarantee that the code would run on the server too.

Differentiating from low-end compute

Radeon and Geforce GPUs are used for low-cost compute-cluster. Both AMD and NVidia prefer to sell their professional cards for that market and have difficulties to make a clear understanding that game-cards are not designed for compute-only solutions. The one thing they did the past years is to reserve good double precision computations for their professional cards only. An existing difference was the driver quality between Quadro/FirePro (industry quality) and GeForce/Radeon. I think both companies have to rethink the differentiated driver-strategy, as compute has changed the demands in the market.

I expect more differences between the support-software for different types of users. When would I pay for professional cards?

  1. Double Precision GFLOPS
  2. Hardware differences (ECC, NVIDIA GPUDirect or AMD SDI-link/DirectGMA, faster buses, etc)
  3. Faster support
  4. (Free) Developer Tools
  5. System Configuration Software (click-click and compute works)
  6. Ease of porting algorithms to servers/clusters (up-scaling with less bugs)
  7. Ease of porting algorithms to game-cards (simulation-mode for several game-cards)

So the list starts with hardware specific demands, then focuses to developer support. Let me know in the comments, why you would (not) pay for professional cards.

Evolving from gamer-compute to server-compute

GPU-developers are not born, but made (trained or self-educated). Most times they start with OpenCL (or CUDA) on their own PC or laptop.

With Nvidia it would be hobby-compute on Geforce, then serious stuff on Titan, then Tesla or Grid. AMD has a comparable growth-path: hobby-compute on Radeon, then upgrade to FirePro W and then to FirePro S or Sky. Intel it is Iris or XeonPhi directly, as their positioning is not clear at all if it comes to accelerators.

Conclusion

Positioning of the graphics cards and compute cards are finally getting finalised at the high-level, but will certainly change a few more times in the year(s) to come. Think of the growing market for home-video editors in 2015, who will probably need a compute-card for video-compression. Nvidia will come with another solution than AMD or Intel, as it has no desktop-CPU.

Do you think it will be possible to have an AMD APU with NVIDIA accelerator? Do people need to buy a accelerator-box in 2015 that can be attached to their laptop or tablet via network or USB, to do the rendering and other compute-intensive work (a “private compute cloud”)? Or will there always be a market for discrete GPUs? Time will tell.

Thanks for reading. I hope the table makes clear how things are now as of 2014. Suggestions are welcome.

Valgrind suppression file for AMD64 on Linux

valgrind_amdValgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436== at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436== by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux”

Building a 150 TFLOPS cluster with Accelerators in 2014

top500You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

K20-K40So, for this year it will be K40. Here’s an overview:

  • Peak 1.43 DP TFLOPS theoretical
  • Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
  • 5.65 GFLOPS/Watt DGEMM
  • Needs 122 GPUs to get 150 TFLOPS DGEMM
  • Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

FirePro W9100 FirePro W9000 FirePro S9150
Shader count 2816 2048 2816
Mem size 16 GByte 6 GByte 16 GByte
mem-type GDDR5 GDDR5 GDDR5
Interface 512 Bit 384 Bit 512 Bit
Transferrate 320 GByte/s 264 GByte/s 320 GByte/s
TDP 275 Watt 274 Watt 225 Watt (-22%)
Connectors 6 × MiniDP, 3D-Stereo, Frame-/ Genlock 6 × MiniDP, 3D-Stereo, Frame-/ Genlock ?
Multimonitor yes (6) yes (6) Don’t care
SP/DP (TFlops) 5.24 / 2.62 3.99 / 1.0 4.1 / 2.0 (-22%)
ECC yes yes yes
OpenCL 2.0 yes no yes
Price $3999 USD $2999 USD ?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

  • Peak 2.0 DP TFLOPS theoretical
  • Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
  • 7.1 GFLOPS/Watt DGEMM
  • Needs 94 GPUs to get 150 TFLOPS DGEMM
  • No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

  • Peak 1.21 DP TFLOPS theoretical
  • Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
  • 3.56 GFLOPS/Watt DGEMM
  • Needs 140 Phi’s to get 150 TFLOPS DGEMM
  • Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

  • AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
  • NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
  • Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

green5
Green500 with AMD FirePro S9150 at spot #1

The winner

Since there are only three offers, they are all winners. What matters is the order.

  1. AMD FirePro – 16GB with its fast memory, is the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
  2. NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
  3. Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]

Intel promotes OpenCL as THE heterogeneous compute solution

opencl-intel-videoAt Intel they have CPUs (Xeon, Ivy Bridge), GPUs (Isis) and Accelerators (Xeon Phi). OpenCL enables each processor to be used to the fullest and they now promote it as such. Watch the below video and see their view on why OpenCL makes a difference for Intel’s customers.

This is important, because till recently Intel was more pushing OpenMP and their proprietary solutions. I think it has something to do with the specialised processors that can be programmed with OpenCL, such as DSPs and FPGAs. Intel has always made generic processors that solve problems best for most. Customers of OpenCL happen to be the ones that could not be served with generic processors and preferred FPGAs and DSPs, before they tried GPUs. By showing that Intel can do OpenCL, they show they are a trustworthy partner to handle the problems in a few years, when the current problems can be handled by Intel processors.

Of course the Xeon Phi is also a good reason. The latest drivers have shown a huge improvement in performance, and that has increased Intel’s confidence in OpenCL for sure.

At StreamHPC we are very happy that Intel now openly promotes OpenCL and invests in it – this will increase trust in the programming language.

A small side-note. The differences between the Windows-drivers and Linux-drivers are somewhat vague: under Linux, the CPU is visible, but not supported officially. This makes development of multi-processor software not as straightforward as discussed in the video. Probably this will be more extensive in the future, as Intel only officially supports OpenCL on a processor when it’s very stable.

Big announcements: SYCL 1.2, WebCL 1.0 and OpenCL 2.0

opencl20Khronos just announced three OpenCL based releases:

  • SYCL 1.2 Provisional Spec – Abstraction Layer for Leveraging C++ and OpenCL
  • WebCL 1.0 Final Spec – JavaScript bindings to OpenCL
  • OpenCL 2.0 Adopters Program – Conformance for OpenCL 2.0 implementations

Below I’ve quoted the summaries. For each of these I’ve prepared articles, but due to lack of time haven’t been able to finish and publish them. So for now some remarks after the summaries.

Khronos Releases SYCL 1.2 Provisional Specification

Programming abstraction layer to enable applications and high-level frameworks to leverage C++ and OpenCL for heterogeneous parallel acceleration

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the release of SYCL™ 1.2 as a provisional specification to enable community feedback. SYCL is a royalty-free, cross-platform abstraction layer that enables the development of applications and frameworks that build on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of C++. For example, SYCL can provide single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration – and then enable re-use of those templates throughout the source code of an application to operate on different types of data.

https://www.khronos.org/news/press/khronos-releases-sycl-1.2-provisional-specification

Higher level languages are very important, as OpenCL is simply too low-level. SYCL is another effort to help researching & improving this area, as we haven’t found the holy grail. Languages like C++AMP and RenderScript claim they can replace OpenCL, but we all know that some implementations of those languages have been done on top of OpenCL.

Khronos Releases WebCL 1.0 Specification

JavaScript bindings to OpenCL brings heterogeneous parallel computing to Web browsers

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the ratification and public release of the WebCL™ 1.0 specification. Developed in close cooperation with the Web community, WebCL extends the capabilities of HTML5 browsers by enabling developers to offload computationally intensive processing to available computational resources such as multicore CPUs and GPUs. WebCL defines JavaScript bindings to OpenCL™ APIs that enable Web applications to compile OpenCL C kernels and manage their parallel execution. Like WebGL™, WebCL is expected to enable a rich ecosystem of JavaScript middleware that provides access to accelerated functionality to a wide diversity of Web developers.

https://www.khronos.org/news/press/khronos-releases-webcl-1.0-specification

WebCL gets more and more attention, even before it was even official. It would be interesting to see the same growth to higher level language as we have with OpenCL now. for this reason we started the Learning WebCL website, to help you learn WebCL in the future.

Khronos Launches OpenCL 2.0 Adopters Program

Conformance tests now available to certify OpenCL 2.0 implementations

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the availability of the official conformance test suite for the OpenCL 2.0 specification, making it possible for implementers to certify that their implementations are officially conformant thorough the Khronos OpenCL Adopters Program. Khronos has also released a set of header files for OpenCL 2.0 and an updated specification with a number of clarifications and corrections to the specification first released in November 2013.

https://www.khronos.org/news/press/khronos-launches-opencl-2.0-adopters-program

Finally the headers are open. Stay tuned for an extensive OpenCL 1.2 vs OpenCL 2.0 comparison, which I have prepared but were unable to finish without the header files.

I hope you are as happy with these announcements as I am. This tells me that OpenCL is ready for real business.

Khronos Invites Press & Game Developers to Sessions @ GDC San Francisco

Khronos-meetup-march-SanFrancisco

Khronos just sent out the below message to Press and Game Developers. To my understanding, there are many game devs under the readers of this blog, so I’d like you to share the message with you.

JOIN KHRONOS GROUP AT GDC 2014 SAN FRANCISCO
Press Conference, Technology Sessions and Refreshment OasisWe invite you to attend one or more of the Khronos sessions taking place in the Khronos meeting room just off the Moscone show floor. For detailed information on each session, and to register please visit: https://www.khronos.org/news/events/march-meetup-2014.
PRESS CONFERENCE

  • WHEN: Wednesday March 19 at 10:00 AM (Reception 9:30 AM)
  • WHERE: Room 262, West Mezzanine Level, (behind Official Press Room)
  • GUESTS: Members of the Press and Industry by Invitation*
  • RSVP: Jon Hirshon, Horizon PR jh@horizonpr.com

Members of the press are invited to attend the Khronos Press Conference, held jointly again this year with consortium PCGA (PC Gaming Alliance). Khronos will issue significant news on OpenGL ES, WebCL, OpenCL, and several more Khronos technologies, and PCGA will issue news about 2013 Gaming Market numbers. Updates will be delivered by Khronos and PCGA Executives, with insights made by David Cole of DFC and Jon Peddie of Jon Peddie Research.

DEVELOPER SESSIONS

All GDC attendees** are invited to the Khronos Developer Sessions where experts from the Khronos Working Groups will deliver in-depth updates on the latest developments in graphics and media processing. These sessions are packed with information and provide a great opportunity to:

  • Hear about the latest updates from the gurus that invented these technologies
  • See leading-edge demos & applications
  • Put your questions to members of the Khronos working groups
  • Meet with other community members

SESSION SCHEDULE

Wednesday March 19

  • 3:00 – 4:00 : OpenCL & SPIR
  • 4:00 – 5:00 : OpenVX, Camera and StreamInput
  • 5:00 – 6:00 : OpenGL ES
  • 6:00 – 7:00 : OpenGL

Thursday March 20

  • 3:00 – 3:50 : WebCL
  • 4:00 – 4:50 : Collada and glTF
  • 5:00 – 7:00 : WebGL

SESSION REGISTRATION
For information and to register, visit:https://www.khronos.org/news/events/march-meetup-2014

REFRESHMENT OASIS

We thought “Refreshment Oasis” sounded like a nice way to say “sit down and have a cup of coffee while we keep working!” Khronos is happy to offer a hospitality suite conveniently located next to our primary meeting room (and the official GDC Press room) to showcase Khronos Member technology demos and offer a place for GDC guests, Khronos Members and Marketing staff to meet. You are welcome to just drop by for a chat, or please email Michelle@GoldStandardGroup.org to arrange a meeting with any Work Group Chairs, Khronos Execs or Marketing Team.

We look forward to seeing you at the show!

*Admittance to the Press Conference is open to all GDC registered Press, and to members of industry on a “Seating Available” basis. Space is limited so reserve your seat today.

** Admittance to the KHRONOS sessions is FREE but: (1) all attendees must have a GDC Exhibitor or Conference Pass to gain entry to the Khronos meeting room area (GDC tickets details http://www.gdconf.com) and (2) all attendees MUST REGISTER for the individual Khronos API sessions. We expect demand to be high and space is limited.

With open standards becoming more important in the very diverse computer-game industry, Khronos is also growing. If you are in this industry and want to know (or influence) the landscape for the coming years, you should attend.

Commodity and Open Standards – why OpenCL matters

V-UVThis article actually discusses the question: is GPGPU a solution for the masses, or is it for niche-products? For the latter open standards matter a lot less, as you will read.

If you watch the below video on sale&marketing by Victor Antonio, then you get what is so difficult about open standards: It pushes all companies using the standard into a focus on becoming the best. Indeed, survival of the fittest may be the base of (true) capitalism and giving the best products. Problem is that competition on price is not safe for the future of the company.

The key is specialisation, or creating unique value. The below video discusses this. The difference between “a feature” and “unique value” is a discussion on its own, you really should have with your team on your own products. Continue reading “Commodity and Open Standards – why OpenCL matters”

OpenCL hardware test centre @ StreamHPC

S10000
AMD FirePro S10000

Wanting to test your OpenCL-software on specific hardware? From now on, that is possible by logging in on StreamHPCs test-servers.

Update: new prices. Got feedback it should compete with Amazon EC2.

During the beta-period (February to April) we have available:

[list1]

  • Dual FirePro S10000 (total ~11.8 TFLOPS single precision, ~2.9 TFLOPS double precision).
  • Several embedded GPU-boards attached.

[/list1]

By request more servers will be added, but for now we start lean.

You’ll get:

[list1]

  • 64 bit Linux server.
  • A secure SSH access with chrooted harddisk space.
  • Full NDA in case StreamHPC staff needs to assist.
  • Both shared and dedicated usage possible.
  • Assistance via mail and skype.

[/list1]

The costs for shared hours are €1,- per hour, for private hours €10,- €5,- per hour. Discounts are available for academics and existing customers (also from past trainings) – just contact us. Goal of the shared hours is to setup the software, do basic tests, but not to do intensive benchmarks.

We hope you can now make a better decision what hardware to choose, or to have your paper’s conclusions based on more recent hardware.

Want more info or have special requests? Call +31854865760 (office) or +31645400456 (cell), or use the contact form.

video: OpenCL on Android

Michael-Leahy-talk-videoMichael Leahy spoke on AnDevCon’13 about OpenCL on Android. Enjoy the overview!

Subjects (globally):

  • What is OpenCL
  • 13 dwarfs
  • RenderScript
  • Demo

Mr.Leahy is quite critical about Google’s recent decisions to try to block OpenCL in favour of their own proprietary RenderScript Compute (now mostly referred to as just “RenderScript” as they failed on pushing twin “RenderScript Graphics”, now replaced with OpenGL).

Around March ’13 I submitted a proposal to speak about OpenCL on Android at AnDevCon in November shortly after the “hidden” OpenCL driver was found on the N4 / N10. This was the first time I covered this material, so I didn’t have a complete idea on how long it would take, but the AnDevCon limit was ~70 mins. This talk was supposed to be 50 minutes, but I spoke for 80 minutes. Since this was the last presentation of the conference and those in attendance were interested enough in the material I was lucky to captivate the audience that long!

I was a little concerned about taking a critical opinion toward Google given how many folks think they can create nothing but gold. Afterward I recall some folks from the audience mentioning I bashed Google a bit, but this really is justified in the case of suppression of OpenCL, a widely supported open standard, on Android. In particular last week I eventually got into a little discussion on G+ with Stephen Hines of the Renderscript team who is behind most of the FUD being publicly spread by Google regarding OpenCL. One can see that this misinformation continues to be spread toward the end of this recent G+ post where he commented and then failed to follow up after I posted my perspective: https://plus.google.com/+MichaelLeahy/posts/2p9msM8qzJm

And that’s how I got in contact with Micheal: we both are irritated by Google’s actions against our favourite open standards. Microsoft has long learned that you should not block, only favour. But Google lacks the experience and believes they’re above the rules of survival.

Apparently he can dish out FUD, but can’t be bothered to answer challenges to the misinformation presented. Mr. Hines is also the one behind shutting down commentary on the Android issue tracker regarding the larger developer communities ability to express their interest in OpenCL on Android.

Regarding a correction. At the time of the presentation given the information at the time I mentioned that Renderscript is using OpenCL for GPU compute aspects. This was true for the Nexus 4 and 10 for Android 4.2 and likely 4.3; in particular the Nexus 10 using the Mali GPU from Arm. The N4 & N10 were initially using OpenCL for GPU compute aspects for Renderscript. Since then Google has been getting various GPU manufacturers to make a Renderscript driver that doesn’t utilize OpenCL for GPU compute aspects.

I hope you like the video and also understand why it remains important we keep the discussion on Google + OpenCL active. We must remain focused on the long-term and not simply accept on what others decide for us.

Heterogeneous Systems Architecture (HSA) – the TL;DR

HSASolutionStack
Legacy-apps run on HSA-hardware, but less optimal.

The main problem of discrete GPUs is that memory needs to be transferred from CPU-memory to GPU-memory. Luckily we have SoCs (GPU and CPU in one die), but still you need to do in-memory transfers as the two processors cannot access memory outside their own dedicated memory-regions. This is due the general architecture of computers, which did not take accelerators into account. Von Neumann, thanks!

HSA tries to solve this, by redefining the computer-architecture as we know it. AMD founded the HSA-foundation to share the research with other designers of SoCs, as this big change simply cannot be a one-company effort. Starting with 7 founders, it has now been extended to a long list of members.

Here I try to give an overview of what HSA is, not getting into much detail. It’s a TL;DR.

What is Heterogeneous Systems Architecture (HSA)?

It consists mainly of three parts:

  • new memory-architecture: hUMA,
  • new task-queueing: hQ, and
  • an intermediate language: HSAIL.
hsa-overview
HSA enables tasks being sent to CPU, GPU or DSP without bugging the CPU.

The basic idea is to give GPUs and DSPs about the same rights as a CPU in a computer, to enable true heterogeneous computing.

hUMA (Heterogeneous Uniform Memory Access)

HSA changes the way memory is handled by eliminating a hierarchy in processing-units. In a hUMA architecture, the CPU and the GPU (inside the APU) have full access to the entire system memory. This makes it a shared memory system as we know it from multi-core and multi-CPU systems.

HSA-shared-mem-supersimplified
This is the super-simplified version of hUMA: a shared memory system with CPU, GPU and DSP having equal rights to the shared memory.

hQ (Heterogeneous Queuing)

HSA gives more rights to GPUs and DSPs, leveraging work from the CPU. Compared to the Von Neumann architecture, the CPU is not the Central Processing Unit anymore – each processor can be in control and create tasks for itself and the other processors.

heterogeneous-queing
HSA-processors have control over their own and other application task queues.

HSAIL (HSA Intermediate Language)

HSAIL is a sort of virtual target for HSA-hardware. Hardware-vendors focus on getting HSAIL compiled to their processor instruction sets, and developers of high-level languages target HSAIL in their compilers. This is a proven concept of evolving complex hardware-software projects.

It is pretty close to OpenCL SPIR, which has comparable goals. Don’t see them as competitors, but two projects which both need different freedoms and will work along.

What is in it for OpenCL?

OpenCL 2.0 has support for Shared Virtual Memory, Generic Address Space and Recursive Functions. All supported by HSA-hardware.

OpenCL-code can be compiled to SPIR, which compiles to HSAIL, which compiles to HSA-hardware. When the time comes that HSAIL starts supporting legacy hardware, SPIR can be skipped.

HSA is going to be supported in OpenCL 1.2 via new flags – watch this thread.

Final words

Two companies not there: Intel and Nvidia. Why? Because they want to do it themselves. The good news is that HSA is large enough to define the new architecture, making sure we get a standard. The bad news is that the two outsiders will come up with an exception for whatever reason, which gives a need for exceptions in compilers.

You can read more on the website of the HSA-foundation or ask me in the comments below.

PRACE Spring School 2014

prace-spring-school-2014On 15 – 17 April 2014 a 3-day workshop around HPC is organised. It is free, and focuses on bringing industry and academy together.

Research Institute for Symbolic Computation (RISC) / Johannes Kepler University Linz Kirchenplatz 5b (Castle of Hagenberg) 4232 Hagenberg Austria

The PRACE Spring School 2014 will take place on 15 – 17 April 2014 at the Castle of Hagenberg in Austria. The PRACE Seasonal School event is hosted and organised jointly by the Research Institute for Symbolic Computation / Johannes Kepler University Linz (Austria), IT4Innovations / VSB-Technical University of Ostrava (Czech Republic) and PRACE.

The 3-day program includes:

  • A 1-day HPC usage for Industry track bringing together researchers and attendees from industry and academia to discuss the variety of applications of HPC in Europe.
  • Two 2-day tracks on software engineering practices for parallel & emerging computing architectures and deep insight into solving multiphysical problems with Elmer on large-scale HPC resources with lecturers from industry and PRACE members.

The PRACE Spring School 2014 programme offers a unique opportunity to bring users, developers and industry together to learn more about efficient software development for HPC research infrastructures. The program is free of charge (not including travel and accommodations).

Applications are open to researchers, academics and industrial researchers residing in PRACE member countries, and European Union Member States and Associated Countries. All lectures and training sessions will be in English.

Applications are open to researchers, academics and industrial researchers residing in PRACE member countries, and European Union Member States and Associated Countries. All lectures and training sessions will be in English. Please visit http://prace-ri.eu/PRACE-Spring-School-2014/ for more details and registration.

At StreamHPC we support such initiatives.

First Khronos Chapter meeting in Amsterdam: WebGL/OpenGL

highres_225764012.jpeg

Thursday 13 February 2014 the first Khronos meetup in will take place. We expect a small group, so the location will be cozy and there will be enough time to talk with a beer. First round is on me, admission is free.

Goal is to learn about open media-standards from Khronos and others. So when OpenCV is discussed, we’ll also talk OpenVX. The target group is programmers and Indy developers who are interested in creating multi-OS and multi-device software.

Program

I am very thrilled to tell that Ton Roosendaal of the Blender Foundation will talk about the releationship between his Blender and Khronos OpenGL.

Second Maarten and Jurjen of ThreeDee Media will talk about WebGL, from a technical and a market view. Is WebGL ready for prime-time?

Then you can show your stuff. For that I’ll bring a good laptop with Windows 8.1 and Ubuntu 13.10 64.

Prepare for Meetup today!

See the Meetup-page for more information. See you there!

 

 

 

OpenCL alternatives for CUDA Linear Algebra Libraries

While CUDA has had the advantage of having many more libraries, this is no longer its main advantage if it comes to linear algebra. If one thing changed over the past year, then it is linalg library-support for OpenCL. The choices have been increased at a continuous rate, as you can see the below list.

A general remark when using these libraries. When using them you need to handle your data-transfers and correct data-format, with great care. If you don’t think it through, you won’t get the promised speed-up. If not mentioned, then free.

Subject CUDA OpenCL
FFT
cuff_ampchart

The NVIDIA CUDA Fast Fourier Transform library (cuFFT) provides a simple interface for computing FFTs up to 10x faster. By using hundreds of processor cores inside NVIDIA GPUs, cuFFT delivers the…

clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.
Linear Algebra
MAGMA-Logo

MAGMA is a collection of next generation, GPU accelerated ,linear algebra libraries. Designed for heterogeneous GPU-based architectures. It supports interfaces to current LAPACK and BLAS standards.

clMAGMA is an OpenCL port of MAGMA for AMD GPUs. The clMAGMA library dependancies, in particular optimized GPU OpenCL BLAS and CPU optimized BLAS and LAPACK for AMD hardware, can be found in the AMD Accelerated Parallel Processing Math Libraries (APPML).
Sparse Linear Algebra
cusp_logo

CUSP is an open source C++ library of generic parallel algorithms for sparse linear algebra and graph computations on CUDA architecture GPUs. CUSP provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.

clBLAS implements the complete set of BLAS level 1, 2 & 3 routines. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release ViennaCL 1.5.0 provides many new convenience functions and support for integer vectors and matrices.VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported.
Random number generation
cuRandImage

The NVIDIA CUDA Random Number Generation library (cuRAND) delivers high performance GPU-accelerated random number generation (RNG). The cuRAND library delivers high quality random numbers 8x…

The Random123 library is a collection of counter-based random number generators (CBRNGs) for CPUs (C and C++) and GPUs (CUDA and OpenCL). They are intended for use in statistical applications and Monte Carlo simulation and have passed all of the rigorous SmallCrush, Crush and BigCrush tests in the extensive TestU01 suite of statistical tests for random number generators. They are not suitable for use in cryptography or security even though they are constructed using principles drawn from cryptography.
KeyVisual_Primary_verysm

The CUDA Math library is an industry proven, highly accurate collection of standard mathematical functions. Available to any CUDA C or CUDA C++ application simply by adding “#include math.h” in…

Looking into the details of what the CUDA math lib exactly is.
AI
GPU_AI_games

A technology preview with CUDA accelerated game tree search of both the pruning and backtracking styles. Games available: 3D Tic-Tac-Toe, Connect-4, Reversi, Sudoku and Go.

There are many tactics to speed up such algorithms. This CUDA-library can therefore only be used for limited cases, but nevertheless it is a very interesting research-area. Ask us for an OpenCL based backtracking and pruning tree searching, tailored for your problem.
Dense Linear Algebra
CULAtoolslogo2
Provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. Also provides various solvers.
Free (with limitations) and commercial.
See ViennaCL, VexCL and clBLAS above. Kudos to the CULA-team, as they were one of the first with a full GPU-accelerated linear algebra product.
Fortran
RogueWave-IMSL-Box2
The IMSL Fortran Numerical Library is a comprehensive set of mathematical and statistical functions available that offloads CPU work to NVIDIA GPU hardware where the cuBLAS library is utilized.
Free (with limitations) and commercial.
OpenCL-FORTRAN is not available yet. Contact us, if you have interest and wish to work with a pre-release once available.
Subject
arrayfire_logo340

Comprehensive GPU function library, including functions for math, signal processing, image processing, statistics, and more. Interfaces for C, C++, Fortran, and Python. Integrates with any CUDA-program.

Free (with limitations) and commercial.

ArrayFire 2.0 is also available for OpenCL. Note that currently fewer functions are supported in the OpenCL-version than are supported in CUDA-ArrayFire, so please check the OpenCL documentation for supported feature list.Free (with limitations) and commercial.
Subject
nppeye

The NVIDIA Performance Primitives library (NPP) is a collection of over 1900 image processing primitives and nearly 600 signal processing primitives that deliver 5x to 10x faster performance than…

Kudos for NVIDIA for bringing it all at one place. OpenCL-devs have to do some googling for specific algorithms.

So the gap between CUDA and OpenCL is certainly closing. CUDA provides a lot more convenience, so OpenCL-devs still have to keep reading blogs like this one to find what’s out there.

As usual, if you have additions to this list (free and commercial), please let me know in the comments below or by mail. I also have a few more additions to this list myself – depending on your feedback, I might represent the data differently.

European HPC Magazines

If one thing can be said about Europe, is that it is quite diverse. Each country solves or fails to solve its own problems individually, while European goals are not always well-cared for. Nevertheless, you can notice things changing. One of the areas where things have changed, is that of HPC. HPH has always been a well-interconnected research in Europe (with its centre in CERN), but there is a catch-up going on for the European commercial market. The whole of Europe has new goals set for better collaboration between companies and research institutes with programs like Horizon 2020. This means that it becomes necessary to improve interconnections among much larger groups.

In most magazines HPC is a section of a broader scope. This is also very important as this introduces HPC to more people. Now, I’d like to concentrate on the focus magazines. There are mainly two magazines available: Primeur Magazine and HPC Today.

Primeur Magazine

logo-weeklyDe Netherlands based magazine Primeur-magazine has been around for years, with HPC-news from Europe, video-channel, knowledge-base, calendar and more. Issues of past weeks can be read online (for free), but news can also be delivered via a weekly e-mail (paid service, prices range from €125 to €4000 per company/institute, depending on size).

They focus on being a news-channel for what is going on in in the HPC-world, both in the EU and the US. Don’t forget to follow them on Twitter.

HPC Today

Update: the magazine changed its name from HPC Magazine to HPC Today.

With several editions (Americas, Europe and France), websites and TV channels, the France based HPC Today brings an actionable coverage of the HPC and Big Data news, technologies, uses and research. Subscriptions are free, as the magazine is paid-for by advertisements. They balance their articles by targeting both people who deeply understand malloc() and people who want to know what is going on. Their readers are developers and researchers from both academic and private sectors.

With the change to HPC Today the content has slightly changed according the requests from the readers: less science, more HPC news. For the rest it’s about the same.

HPC-today
To get an idea of how they’re doing, check the partners of HPC Magazine: Teratec, ISC events and SC conference.

Other European HPC sources

Not all information around the web is nicely bundled in a PDF. Find a small list below to help you start.

InSiDE

The German National Supercomputing Centers HLRS, LRZ, NIC publish the online magazine InSiDE (Innovatives Supercomputing in Deutschland) twice a year. The articles are available in html and PDF. It gives a good overview of what is going on in Germany and Europe. There are no ways to subscribe via e-mail, so it would be better to put it in your calendar.

e-IRG

The e-Infrastructure initiative‘s main goal is to support the creation of a political, technological and administrative framework for an easy and cost-effective shared use of distributed electronic resources across Europe.

e-IRG is not a magazine, but it is a good start to find information about HPC in Europe. Their knowledge-base is very useful when trying to get an overview what is there in Europe: Projects, Country-statistics, Computer centers and more. They closely collaborate with Primeur-magazine, so can you see some overlap in the information.

PRACE Digest

PRACE (Partnership for Advanced Computing in Europe) is to enable high impact scientific discovery, as well as engineering research and development across all disciplines to enhance European competitiveness for the benefit of society. PRACE seeks to achieve this mission by offering world class computing and data management resources and services through a peer review process.

The PRACE digest appears as twice a year as a PDF.

More?

Did we miss an important news-source or magazine? Let us know in the comments below!

VectorFabrics: 2014 will be parallel

Vector_Fabrics_company_logo_large_square
There is a F and V.

Toolmaker VectorFabrics sent 9 predictions for this year in their newsletter. I’d like to share it with you.

Nine predictions for 2014 that prove the programming landscape is changing

It is not hard to predict that this year will see a lot of activity around multicores and manycores. 2014 will be the year that software has to catch up with highly concurrent hardware. So we expect to see some major changes in how people view multicore programming:

  1. Neither Intel, AMD nor Qualcomm releases any new single core processors in 2014. Therefore, it is less and less acceptable to release pure sequentially operating applications.
  2. Intel releases a 15-core Xeon. On a typical 4-socket motherboard your OS sees 120 available cores. OpenMP is the preferred programming paradigm on such a platform for data-intensive shared-memory calculations. You have to deal with performance bottlenecks, including Amdahl’s law, cache performances and memory bandwidth issues.
  3. Next to ARM big.LITTLE systems, 2014 sees the first true octacore cell phones and tablets. At that point, it becomes painfully clear that applications need changes to benefit from so many cores. Both true octacore and big.LITTLE processors see very little adoption in mobile devices as long as software that can benefit is missing.
  4. At least one major mobile phone vendor loses market share because their hardware may be good, but the software (especially the web browser) cannot utilize the hardware to the max.
  5. Both the XBox One and PlayStation 4 feature AMD Jaguar octacore processors with GP-GPUs. Very few games will using all the compute power, and customers will wonder why to upgrade as the performance difference to their existing console is not so different.
  6. Two great open standards, OpenACC and OpenMP see a nice boost in adoption thanks to upcoming support in the latest open source compilers. Clang 3.5 features OpenMP 4.0 support. In addition GCC 4.9 also receives OpenACC support.
  7. In the mobile space, offloading to GP-GPUs is hot as new architectures from Qualcomm Adreno 420, Nvidia Tegra K1 and Imagination PowerVR Series 6 will each allow offloading. Managing and programming the offloading will remain a problem: OpenCL? OpenACC? CUDA? RenderScript?
  8. Major desktop applications jump on the compute-offloading bandwagon to win performance, using either OpenCL or OpenACC
  9. Programmability of the next-gen Intel Xeon Phi (Knights Landing) will raise some eyebrows. This 2015 chip will have 72 Atom Silvermont cores with local memory, cache and up to 384 GB shared memory. This is outside the comfort zone of most programmers.

You guessed right: their tool has to do with parallel programming.

What are your predictions for 2014?

ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15

mont-blancAccording to the latest newsletter of the Mont-Blanc Project, it was explained that the GPU on a Samsung Exynos 5 is much faster and greener than its CPU: 3.5 times faster with half the energy. They built a supercomputer using 810 Exynos SoCs, that can deliver a 26 TFLOPS of peak performance. With the upcoming mobile GPUs becoming exponentially faster, they have all the expertise to build an even faster and greener ARM-supercomputer after this.

The Mont-Blanc compute cards deliver considerably higher performance; at 50% lower energy consumption, compared with previous ARM-based developer platforms.

The Mont-Blanc prototype is based on the Samsung Exynos 5 Dual SoC, which integrates a dual-core ARM Cortex-A15 and an on-chip ARM Mali-T604 GPU, and has been featured and market proven in advanced mobile devices. The dual-core ARM Cortex-A15 delivers twice the performance of the quad-core ARM Cortex-A9, used in the previous generation of ARM-based prototype, whilst consuming 20% less energy for the same workload. Furthermore, the on-chip ARM Mali-T604 GPU provides 3.5 times higher performance than the dual-core Cortex-A15, whilst consuming half the energy for the same workload.

Each Mont-Blanc compute card integrates one Samsung Exynos 5 Dual SoC, 4 GB of DDR3-1600 DRAM, a microSD slot for local storage and a 1 GbE NIC, all in an 85x56mm card (3.3×2.2 inches). A single Mont-Blanc blade integrates fifteen Mont-Blanc compute cards and a 1 GbE crossbar switch, which is connected to the rest of the system via two 10 GbE links. Nine Mont-Blanc blades fit into a standard BullX 9-blade INCA chassis. A complete Mont-Blanc rack hosts up to six such chassis, providing a total of 1620 ARM Cortex-A15 cores and 810 on-chip ARM Mali-T604 GPU accelerators, delivering 26 TFLOPS of peak performance.

“We are only scratching the surface of the Mont-Blanc potential”, says Alex Ramirez, coordinator of the Mont-Blanc project. “There is still room for improvement in our OpenCL algorithms, and for optimizations, such as executing on both the CPU and GPU simultaneously, or overlapping MPI communication with computation.”

Continue reading “ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15”

Professional and Consumer Media Software using OpenCL

OpenCL_Logo

More and more professional media software now has support for OpenCL. It starts to be a race where you cannot stay behind. If the competitor runs more than twice as fast on the same hardware, then you just can’t say “Sorry, you should buy NVIDIA hardware”. I expected this to happen, but could not tell in what industry they would run fastest. Seems it is fluid dynamics, video-editors and photo-editors.

AMD and Intel mostly have been selected as collaboration partners. Apple has been a main drive, especially with the introduction of their new MAC Pro with two high-end AMD FirePro GPUs.

Sony Catalyst Family

Sony released three new software packages to support video professionals in pre- and post-production.

Sony-catalyst

This new family of products, Catalyst Browse (media management), Catalyst Prepare (video preproduction assistant) and Catalyst Edit (4K and Sony RAW video editing) has OpenCL support from the start.

Colorfront Express Dailies and On-Set Dailies

This software is an on-set dailies processing system (playback and sync, QC, colour grading, audio and metadata management).

The 2014 versions have OpenCL support in their transcoder plugin, Transkoder.

CGE05_a_OnsetDailies

RED REDCINE-X PRO

redcine-x

REDCINE-X is a coloring toolset, integrated timeline, and post effects collection in a professional, flexible environment for your 4K or 5K .R3D files. RED has added support for OpenCL in build 22.

The Foundry Nuke Blink framework

As presented on GPUconf, The Foundry has opened their framework for running OpenCL kernels. It creates OpenCL-kernels (optimised for AMD or NVIDIA) from C++ Blink kernels.

nukestudio

NUKE studio is a node-based VFX, editorial and finishing studio. As with most products on this page, look for the “reel” to get a nice demo of its capabilities.

Magix Hybrid Video Engine

Video Deluxe and Movie Edit both have OpenCL support since 2012, thanks to the new shared video engine.

Adobe CS6 creative suite

Adobe has entered the OpenCL market publicly with . With Premiere Pro (video editing) and Photoshop (photo-editing) two main products with advanced GPU-acceleration via OpenCL.

Video on GPU-effects on Premiere Pro CS5.

FAQ on GPU-acceleration on Photoshop CS6.

Sony Vegas Pro

Vegas Pro is a video editing software package for non-linear editing systems, and has OpenCL support since version 10d. Also in the consumer version (Sony Movie Studio) there is OpenCL-support.

Sony-Vegas-Pro-13

RealFlow Hybrido2 engine

RealFlow is fluid dynamics software and its new engine Hybrido2 has support for OpenCL since this year. And you just have to love their commercial videos.

Autodesk Maya

Maya is a toolsets to help create and maintain the modern, open pipelines you need to address today’s challenging 3D animation, visual effects, game development and post-production projects. Since the 2013 version it is accelerated for physics simulations via Bullet and OpenCL.

ArcSoft SimHD and Sim3D engine

ArcSoft media-engines SimHD and Sim3D have OpenCL support since several years and are used in several of their  products.

simHD

BlackMagic Design

BMD has two suites which use OpenCL, Resolve and Fusion. DaVinci was acquired in 2009 and EyeOn in 2014.

(DaVinci) Resolve

Resolve has real-time colour correction thanks to OpenCL.

(Eyeon) Fusion

EyeonFusionScreenshotSmall

Fusion is an image compositing software program created by eyeon Software Inc. It is typically used to create visual effects and digital compositing for film, HD and commercials.

It uses OpenCL since version 6.

Roxio Creator Suite

Roxio uses OpenCL for accelerated rendering in their suite. They were one of the first to implement OpenCL – I think already in 2010, before OpenCL was even cool.

boxshot-creator

Unluckily they don’t have much information – just a mention that they have support.

Apple Final Cut Pro and iMovie

Apple has support in Final Cut Pro X, Motion 5 and Compressor 4.

finalcutprox_magnetic

Also iMovie works a lot faster when you have an OpenCL capable MAC.

Blender Cycles & Bullet

You cannot find any demonstration of new video hardware without Big Bucks Bunny, the short CG movie created with Blender.

It uses OpenCL in two parts: physics simulations (Bullet) and compositor (Cycles).

Side Effects Houdini

Houdini is a procedural node based 3D animation and visual effects tools for film, broadcast, entertainment and visualisation production.

zMatte_4bDigitalFilmTools

There is support for OpenCL in zMatte, Composite Suite Pro and Film Stocks since Q4 2013.

zMatte is a keyer for blue and green screen composites. Composite Suite Pro is a collection of visual effects plug-ins. Film Stocks simulates color and black and white still photographic film stocks, motion picture films stocks and historical photographic processes.

OTOY OctaneRender 3

OctaneRender is a GPU-based, real-time 3D, unbiased rendering application. In March 2015 OTOY announced OctaneRender 3, which has full OpenCL support:

OpenCL support: OctaneRender 3 will support the broadest range of processors possible using OpenCL to run on Intel CPUs with support for out-of-core geometry, OpenCL FPGAs and ASICs, and AMD GPUs.

Below is a reel of OcateRender 2 with CUDA. According to OTOY the performance on AMD and NVidia is comparable.

SAM Alchemist XF

SAM-alchemist-XF

Alchemist XF supports format and framerate conversion from SD up to 4K for a wide variety of file formats at high speed.

More?

There is a lot more OpenCL-powered software coming up rapidly (we hear things). But we also missed (or accidentally forgot) software. Please help making this list complete and send us an email.

OpenCL SPIR by example

SPIR2OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

 

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

SC14 Workshop Proposals due 7 February 2014

Just to let you know that there should be even more OpenCL and related technologies on SC14

sc14emailheader

Are you interested in hosting a workshop at SC14?

Please mark your calendars as SC will be accepting proposals from 1 January – 7 February for independently planned full-, half-, or multi-day workshops.

Workshops complement the overall SC technical program. The goal is to expand the knowledge base of practitioners and researchers in a particular subject area providing a focused, in-depth venue for presentations, discussion and interaction. workshop proposals will be peer-reviewed academically with a focus on submissions that wil inspire deep and interactive dialogue in topics of interest to the HPC community.

For more information, please consult: http://sc14.supercomputing.org/program/workshops

Important Submission Info

Web Submissions Open: 1 January 2014
Submission Deadline: 7 February 2014
Notification of acceptance: 28 March 2014

Submit proposals via: https://submissions.supercomputing.org/
Questions: workshops@info.supercomputing.org

We’re thinking of proposing one too. Want to collaborate? Mail us! Don’t forget, to go to HiPEAC (20 January) and IWOCL (12 May) to meet us!

FortranCL working example

f90
The ’96 book is still available here, and has some good explanations of numerical mathematics. Oh, the good old times..

Last week I needed to get Fortran working with OpenCL. As the example-page is not up-to-date and not much documentation is on the interwebs outside the official page, this was not as straight-forward as I hoped. The test-suite and this article provided code I could actually use. First I wanted to have things in a module, second I needed to control which device I wanted to use, third I needed function-names that could be used in a larger project. The result is below, and hopefully usable for the Fortran folks around who want to add some OpenCL-kernels to their existing code.

It uses the two-step initialisation we know from C, for safe memory allocation. It is based on the utils.f90 from the test-suite.

The only good way to translate is the Rose-compiler – which is a pain to install. I tried various f2c-scripts (from the 90’s, but they all failed. I must say that continuous switching between Fortran-mode and C-mode was the hardest part of the porting.

If you have tips&tricks to use OpenCL from Fortran, let everybody know in the comments. Also let me know if the code doesn’t work for you, or you have improvements (like better error-handling).

The rest of utils.f90 (which I renamed to clutils.f90 for better integration) is mostly the same – only this subroutine needed changes:

(...)

subroutine cl_initialize(platform_id, device_id, device, context, command_queue)
!use ISO_C_BINDING
type(cl_device_id),     intent(out)     :: device
type(cl_context),       intent(out)     :: context
type(cl_command_queue), intent(out)     :: command_queue
integer                                 :: platform_id
integer                                 :: device_id

integer :: platform_count, device_count, ierr
character(len = 100) :: info
type(cl_platform_id) :: platform
type(cl_platform_id), allocatable, target :: platform_ids(:)
type(cl_device_id), allocatable, target :: device_ids(:)

! get the platform ID
call clGetPlatformIDs(platform_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL platform.')
allocate(platform_ids(platform_count))
call clGetPlatformIDs(platform_ids, platform_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL platform.')

if (platform_id .gt. platform_count .or. platform_id .lt. 1) platform_id = 0
platform = platform_ids(platform_id)

! get the device ID
call clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, device_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL device.')
allocate(device_ids(device_count))
call clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, device_ids, device_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL device.')

if (device_id .gt. device_count .or. device_id .lt. 1) device_id = 1
device = device_ids(device_id)

! get the device name and print it
call clGetDeviceInfo(device, CL_DEVICE_NAME, info, ierr)
print*, "CL device: ", info

! create the context and the command queue
context = clCreateContext(platform, device, ierr)
command_queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, ierr)

end subroutine cl_initialize

(...)

Continue reading “FortranCL working example”

The OpenCL event of the year: IWOCL 2014 – Bristol, UK, 12 & 13 May

iwoclKhronos has supported and organised for the second time the International Workgroup on OpenCL (IWOCL, pronounced as “eye-wok-ul”). Last year the event took place at Georgia Tech, Atlanta, Georgia, in the United States. This year the event will be held in Europe: Bristol University, Bristol, England, UK.

IWOCL 2013 Presentations

Last year there was a varying programme:

  • Porting a Commercial Application to OpenCL: A Case Study
  • Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor
  • Parallelization of the Shortest Path Graph Kernel on the GPU
  • OpenCL-based Approach to Heterogeneous Parallel TSP Optimization
  • clMAGMA: High Performance Dense Linear Algebra with OpenCL
  • Multi-Architecture ISA-Level Simulation of OpenCL
  • Optimizing OpenCL Applications on the Intel Xeon Phi

You can see and download these presentations here. This year the organisation tries to offer a equally exciting programme.

Workshop means it’s an active event

It’s all about sharing, but not just by letting you sit and listen. Below you’ll find some of the options.

Present your work

Did you use OpenCL in your software or research? You are very welcome to present your experience and results. IWOCL is the premier forum for the presentation and discussion of new designs, trends, algorithms, programming models, software, tools and ideas for OpenCL.

Abstract Submission Deadline: Friday 31 January, 2014

It can be in the form of:

  • Research paper
  • Technical presentation
  • Workshops and Tutorial
  • Poster

(StreamHPC’s Vincent Hindriksen is on the Conference Sessions Committee)

Communicate with the workgroup

20-P1020816
Khronos booth at SC13 – some you will see again at IWOCL

The OpenCL workgroup likes to communicate with OpenCL’s users. IWOCL provides a formal channel for community feedback to the Khronos Group’s OpenCL workgroup. This is one of the best moments to be heard, discuss a hack/bug or share a great idea that should be in the next version of OpenCL.

Meet OpenCL developers and enthusiasts

During the breaks, social events and during presentations, you can discuss all your ideas and thoughts on on-topic and off-topic subjects, or you can also join existing talks.

If you are new into compute acceleration, you’ll find many people who are willing to explain what it does and add their personal view.

Test-drive software

We will bring some hardware, on which you can test your kernels. (We’ll put more info about this later!)

Sponsor and present your product

There will be booths available for the sponsors, where you can show your product to the public.

Stay up to date on the event

We will try keep you up-to-date as much as possible, but IWOCL has some channels to keep you informed:

We’ll put on a link when tickets are ready to be sold.

Let others know you plan to be on the event by saying hi in the comments.

Hope to see you there!