1: Device Access FlagsWhen defining a kernel, you define each memory-object as read-only, write-only or read-and-write. The memory-objects you create must be of the same. Just like with the kernels, you can pick one:
- <empty>. Same as CL_MEM_READ_WRITE.
- CL_MEM_READ_ONLY: Kernel can only read from the memory object. Write from the memory object is undefined.
- CL_MEM_WRITE_ONLY: Kernel can write to memory object. Read from the memory object is undefined.
- CL_MEM_READ_WRITE: Kernel can read and write to the memory object.
2: Host Access FlagsIn case of "CL_MEM_READ_WRITE" it is completely unpredictable what will be done ate the host. Describing how the host will use the memory-object would give compiler-hints for possible optimisations, especially when working with PCIe. Flags to specify this were introduced in OpenCL 1.2. You can pick one:
- <empty>. The host will read and write to this object.
- CL_MEM_HOST_WRITE_ONLY. The host will only write to the memory object.
- CL_MEM_HOST_READ_ONLY. The host will only read the memory object.
- CL_MEM_HOST_NO_ACCESS. The memory-object is used as buffer-object between two or more kernels (should be specified as CL_MEM_READ_WRITE).
Please let me know, if you have encountered cases where one of these two combinations resulted in faster transfers.CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_READ_ONLY CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_WRITE_ONLY
3: Host Pointer FlagsWhere it gets interesting (and messy) is when is defined how host and device memory interact. This we can describe with host-pointer flags. First let have a look on the normal way to send over a buffer of floats to the device:
cl_mem cl_input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); int err = clEnqueueWriteBuffer(commands, cl_input, CL_TRUE, 0, sizeof(float) * dataCount, p_input, 0, NULL, NULL); err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_input);Here p_input is a pointer to an array of floats, cl_input an OpenCL memory object. Let us see the definitions of the flags, which influences how data is transferred between host and device.
- <empty>. Normal behaviour, with need for explicit writing and reading buffers, as in the above example.
- CL_MEM_ALLOC_HOST_PTR. Allocate memory at device, accessible from host. Used for shared memory devices (only ARM MALI has support, afaik).
- CL_MEM_COPY_HOST_PTR. Allocate memory at device and initialise with data at host_ptr.
- CL_MEM_USE_HOST_PTR. Allocate memory at device and pin it to host_ptr.
Let OpenCL handle the copyThe flag CL_MEM_COPY_HOST_PTR results in only one big difference with the default. It does the allocation of device-memory and copying to the device in one step. The code looks like:
cl_mem cl_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * count, p_input, NULL); err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_input);It is a matter of taste, what you use (and I personally don't use this version). It has no advantages for faster transfer, known to me. This flag is not meant for getting data back. There are cases known that buffer-object 'p_input' in this example cannot be used by clEnqueueWriteBuffer or clEnqueueReadBuffer; in that case CL_MEM_ALLOC_HOST_PTR needs also be specified too. Be careful with this, as this is a contradictive contract.
Pinned memoryThe flag CL_MEM_USE_HOST_PTR enables the so called "pinned memory". Pinned memory makes it possible to use DMA-transfers over PCIe, which is much faster. The idea of pinned memory is that there is an exact copy of an object at both the device and the host. Exclusive access rights is given to or the host or the device. This is the above example with pinned memory:
cl_mem cl_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(float) * count, p_input, NULL); void* p_map_input = clEnqueueMapBuffer(queue, cl_input, CL_TRUE, CL_MAP_READ, 0, sizeof(float) * count, 0, NULL, NULL, &err); // here we can write data to the buffer from the host clEnqueueUnmapMemObject(queue, cl_input, p_map_input, 0, NULL, NULL); // here all changes have been sent back to the device err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_input); clEnqueueNDRangeKernel(queue, dimension, NULL, global_size, local_size, 0, NULL, NULL); err = clReleaseMemObject(input);With clEnqueueMapBuffer the access rights are given to the host. With clEnqueueUnmapMemObject the exclusive access rights to the memory-objects are given to the device. For reading from the device, most noticeable difference is CL_MAP_WRITE. For read&write, use "CL_MAP_READ | CL_MAP_WRITE". This flag triggers memory-transfers:
- <empty>. undefined
- CL_MAP_WRITE. Will not do any transfers during mapping, will copy data back to the host during unmapping.
- CL_MAP_READ. Will copy data to the device during mapping, will not do any transfers during unmapping.
- CL_MAP_READ | CL_MAP_WRITE. Will do transfers during both mapping and unmapping.