Differences from OpenCL 1.1 to 1.2

This article is of interest for you, if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2. As not everything is clear yet without drivers out there, there will be some edits to this article the coming time – feedback is very welcome.

After many meetings with the many members of the OpenCL task force many ideas sprout. And each 17 or 18 months a new version comes out of OpenCL to give all these ideas form. You see ideas coming up which are totally new, already brought outside in another product by a member, or not appearing as other members voted against. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer can expect and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now I concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you look it up.

New Kernel-functions

The most rudiment debug-tool, printf, first needed to have a vendor-specific extension enabled, but now you can flood the standard output without it. For who hasn’t tried printf yet, have a global size of 1000, let the CPU print “pingn” and the kernels “pongn” – then you know exactly why you need to e careful with this function.

The function popcount returns the number of ones in a variable. So if x is 5 (binary 101), then popcount(x) is 2. A nice explanation of fast popcount on SSE is here. It counts bits regardless of what it represents, so it also counts the sign-bit.

Replaced functions

The OpenCL group prefers to change the name of functions when the parameter-list changes. Below are the “new” functions I encountered.

clEnqueueMarker, clEnqueueBarrier and clEnqueueWaitForEvents have been merged into clEnqueueMarkerWithWaitList and clEnqueueBarrierWithWaitList. The barrier and marker functionality are still the same, but if a non-NULL waiting-list is given it will also continue if all the events have occurred. Before this was tricky to program. A new option is that you can fire an event when all previous events have occurred.

clCreateImage2D and clCreateImage3D have been merged into clCreateImageclCreateFromGLTexture2D and clCreateFromGLTexture3D have been merged into clCreateFromGLTexture. As the functions were comparable and the parameter texture_target handles the differences, not much changed. What is new (and a mayor reason for merging these functions) is the adding of 1D images, and support for image-arrays (see below for explanation how they work). 1D images were introduced to be compliant with OpenGL 1D images.

Mem-flags CL_MEM_COPY_HOST_WRITE_ONLY, CL_MEM_COPY_HOST_READ_ONLY and CL_MEM_COPY_HOST_NO_ACCESS have been added to describe how the host can connect to the object at the device, where 1.1 only described how the device could access the object and if the memory was allocated at the device or the host.

clUnloadCompiler and clGetExtensionFunctionAddress got changed to clUnloadPlatformCompiler and clGetExtensionFunctionAddressForPlatform and now must specify the platform. This seems to be logical, as clUnloadCompiler probably removed compilers of all platforms, and the function-address seems to be unspecified when more platforms were loaded. Not much used functions though.

DirectX

Besides the fancy 1D images, support for DirectX 9 and 11 textures also have been added. DX9 is an interesting choice, but this way such software can be given a longer life by adding OpenCL to speed it up. I still disagree it has official KHR-support as it only works under Microsoft code – under Linux (and all its deratives like Android) and OSX it is not supported.

The new functions clCreateFromDX9MediaSurfaceKHR, clEnqueueAcquireDX9MediaSurfacesKHR and clEnqueueReleaseDX9MediaSurfacesKHR are comparable to clCreateFromD3D10Texture2DKHR, clEnqueueAcquireD3D10ObjectsKHR and clEnqueueReleaseD3D10ObjectsKHR. clCreateFromD3D11BufferKHR, clCreateFromD3D11Texture2DKHR, clCreateFromD3D11Texture3DKHR, clEnqueueAcquireD3D11ObjectsKHR and clEnqueueReleaseD3D11ObjectsKHR are like their D3D10-counterparts.

Sharing like cl_khr_d3d10_sharing for DX9 and 11 is enabled with cl_khr_dx9_media_sharing and cl_khr_d3d11_sharing. The counterparts of clGetDeviceIDsFromD3D10KHR are clGetDeviceIDsFromD3D11KHR and clGetDeviceIDsFromDX9MediaAdapterKHR.

Multi-user and Multi-device

As OpenCL-devices get more powerful, it is very probable the device can better be shared. Also it gets more common to have multiple GPUs in a system, and/or have various capable devices now CPUs get better support.

clEnqueueMigrateMemObjects helps with multiple devices to copy memory objects from one device to another; first this had to be done by copying via the host.

clCreateSubDevices partitions a device in sub-devices. It can be partitioned in equal parts, specified sizes, or depending on specific hardware. The last option can split the devices based on i.e. cache-hierarchy, so that the different subdevices have shared cache at the given level. The functions clRetainDevice and clReleaseDevice have been altered to handle sub-devices. First this was under the extension device_fission.

Intitalisation of data

clEnqueueFillBuffer and clEnqueueFillImage help with initialising data by filling it with a pattern or a colour. This was first best done at the host, or with a kernel specially written for it, or just ignored. Now our lives have been improved.

Building

It seems that more effort is put in making sure the kernels are better protected. The function clBuildProgram can be split up between clCompileProgram and clLinkProgram. If I understand correctly, it is comparable to how clCreateProgramWithBinary works, as this takes compiled binaries.

clGetProgramInfo en clGetProgramBuildInfo have been extended to get information on how the program has been built. The new function clGetKernelArgInfo returns specified information on the arguments used for building the kernel. This is useful when the building of the software is separated from the program, such as is the case when binaries are used.

Image arrays

An array of 1D or 2D images can be written by write_image{f|i|ui|h}. The image ID is given by the y (1D) or z (2D) value. With read_image{f|i|ui|h} you need to specify the coordinates plus the image-number, int2 for 1D and int3 for 2D images.

get_image_array_size returns the number of images in an array. It is the responsibility of the software to keep things in order, as it does not give an array of image-numbers.

Other

Macros CL_VERSION_1_2 and __OPENCL_C_VERSION__ have been added. The first one gives 120 just like CL_VERSION_1_1 gives 110, the last one gives 100, 110 or 120.

Double-precision is now an optional core feature instead of an extension. Meaning, you just need to check if the device supports it, but you don’t need to pragma it in.

CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE has been deprecated. It gives the smallest alignment in bytes which can be used for any data type. It is quite comparable to CL_DEVICE_MEM_BASE_ADDR_ALIGN. This could help select the best device for an alignment-optimised kernel, but is rarely used.

A new flag CL_MAP_WRITE_INVALIDATE_REGION has been added to cl_map_flags. This is comparable to CL_MAP_WRITE, but without guarantees memory is not being overwritten.

Storage class specifiers extern and static are now supported. A storage class settles the scope of the variable (c definition here). I need to get deeper into this, as I would think extern is __global, and static is __local – I’ll keep you posted to get this more clear.

Video

Tim Mattson of Intel explains some of the highlights of OpenCL 1.2 in this 12 minute video


  • ZHAO Peng

    Nice review! I am looking forward to your post about the future version of OpenCL!
    I hope the implementation of OpenCL 1.2 would be released as soon as possible.

  • Horst Hrubesch

    clEnqueueFillImage makes me most happy. Good stuff!

  • Michael Zucchi

    Thanks for the summary, but why would static=local, extern=global?

    It’s required for the new linking stuff for the same reason it’s required in normal c environments.

Get in contact now!

We offer training in GPU-programming (OpenCL, CUDA, etc),
and consultancy-services for performance engineering.

Mail to info@streamcomputing.eu or fill in below form.

The web-form currently does not work.
Please send an e-mail while we resolve the issue.