Blog

OpenCL SPIR by example

  |   General   |   1 Comment
SPIR2OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I'll discuss HSAIL, on where it differs from SPIR - I thought SPIR was a better way to start introducing these. In my next article I'd like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we're going to, and why.
 
Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don't just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.
 
For now, just see what SPIR is - as it can help us understand how the compiler work and write better OpenCL code. I used Intel's offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).
__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}
There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

The LLVM-SPIR output + comments

The LLVM SPIR-version is below with explanations added, using references to SPIR 1.2 specifications [pdf].
To get a good overview of the full output, you can download 'sum.ll' here. Now let's get to it!
; ModuleID = '/tmp/vincent/spirBinaryTempFile.tmp'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
The variable "target datalayout" can define SPIR or SPIR64. In this case SPIR is defined, as you can see in chapter 3.2 of the specs. Defining "#pragma OPENCL EXTENSION cl_khr_fp64 : enable" and *using* 64 bits long or double, does not put this to SPIR64, but removes some trunc-operations (see below). Compiling with "-llvm-spir64=sum.ll" gets it into SPIR64-mode. Then "e-p:32:32:32" will be replaced with "e-p:64:64:64".
 
The variable "target triple" defines SPIR or SPIR64. The two "unknown"s are there  to be compatible with CLANG - as SPIR is an intermediate language, you cannot define architectures.
define spir_kernel void @sum(i32 %size, float addrspace(1)* %vec1, float addrspace(1)* %vec2) nounwind {
The definition of 'sum'. 'int' translates to 'i32', 'const' is ignored (I'll get back to this later), __global translates to addrspace(1). There are 4 address spaces defined:
  • 0 – private
  • 1 – global
  • 2 – constant
  • 3 – local
As you see, addrspace is can be undefined (for size) - see chapter 2.2 "Address space qualifiers" for more information. If there would be GPUs with extended memory (like we have on a CPU motherboard) and we would add an OpenCL extension for "extern memory", then we'd add a addrspace(4) here - just as an example.
 
The "nounwind" means that the function never returns with an unwind or exceptional control flow. As you probably might have noticed several times with driver crashes: there is no support for exception handling at this level.
  %1 = alloca i32, align 4
  %2 = alloca float addrspace(1)*, align 4
  %3 = alloca float addrspace(1)*, align 4
Allocating memory for 'size' and for pointers to 'vec1' and 'vec2' (see below for actual value setting). The "align 4" means the variable should be align per 4 bytes - try vectors and arrays and see what happens here, for example "float4" gives an "align 16".
  %ii = alloca i32, align 4
Line 1, first step: allocation of the pointer.
  store i32 %size, i32* %1, align 4
Storing value 'size' to %1. You can learn from this that if you set a const variable as a parameter, this implies a load from global memory to a local variable.
  store float addrspace(1)* %vec1, float addrspace(1)** %2, align 4
  store float addrspace(1)* %vec2, float addrspace(1)** %3, align 4
Setting the pointers %2 and %3 to point to the position 0 of vec1 and vec2.
  %4 = call spir_func i64 @_Z13get_global_idj(i32 0) nounwind readnone
  %5 = trunc i64 %4 to i32
  store i32 %5, i32* %ii, align 4
Line 1, execution. See below for the declaration of "_Z13get_global_idj". The function returns a 64 bit variable, and is truncated to 32 bit. I suspect SPIR must truncate all 64-bit variables to 32 bit to be valid, but could not find
  %6 = load i32* %ii, align 4
  %7 = load i32* %1, align 4
  %8 = icmp slt i32 %6, %7
Stack-operations: pushing 'ii' (%ii) and 'size' (%1), executing 'icmp'. According to the LLVM-manual, the 'icmp' instruction returns a boolean value or a vector of boolean values based on comparison of its two integer, integer vector, pointer, or pointer vector operands. The operator 'slt' stands for "(signed) less than".
  br i1 %8, label %9, label %19
The ‘br il‘ instruction is used to cause control flow to transfer to a different basic block in the current function, based on a condition.
 
So this lines reads: "if (ii < size) then goto label %9 else goto label %19". Yeah, I admit I learned peek, poke, push and pop on an Apple II using AppleSoft Basic.
; < label>:9                                       ; preds = %0
  %10 = load i32* %ii, align 4
  %11 = load float addrspace(1)** %2, align 4
  %12 = getelementptr inbounds float addrspace(1)* %11, i32 %10
  %13 = load float addrspace(1)* %12, align 4
(The space added to the label-tag is because WordPress tries to help out by adding a closing tag). These four lines load the value of vec1[ii] into %13. %10 and %11 are to push the values %ii and %2 (pointer to the start of vec1) onto the stack. The function 'getelementptr inbounds' performs an address calculation (no memory access), and can return a 'poison value' (for simplicity: undefined) when out of range. So if all went ok, then %12 is a pointer to vec1[ii], and the memory access is done with the load-function for %13.
  %14 = load i32* %ii, align 4
  %15 = load float addrspace(1)** %3, align 4
  %16 = getelementptr inbounds float addrspace(1)* %15, i32 %14
  %17 = load float addrspace(1)* %16, align 4
As above, to load vec2[ii] into %17.
  %18 = fadd float %17, %13
  store float %18, float addrspace(1)* %16, align 4
Here the actual computation at last: 'fadd'. When the '+' is replaced by a '*', then this will be 'fmul'. See page 20 of the SPIR-specs for more functions. Finally the result is stored into %16, which is the pointer to vec2[ii].
  br label %19
Goto label %19. Looks a bit redundant in this case. I tried: if there was an "else" used in the kernel, an extra label was created for the second branch - this line would then step over the second branch, which is now omitted.
; <label>:19                                      ; preds = %9, %0
  ret void
</label>
End the function, return nothing.
}

declare spir_func i64 @_Z13get_global_idj(i32) nounwind readnone
The kernel was "defined" - this is declared. This is a promise that the function will be implemented. I could not find much information how functions like get_global_id are implemented in SPIR, strangely enough. If you know more information, please tell me via the comments.

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
These metadata variables are defined in chapters 2.8, 2.9, 2.11, 2.12 and 2.13. For example "!opencl.kernels" points to the metadata entries (defined below) which are kernels - in this case we only have kernel !0. The variables !6 to !8 are defined further below.

!0 = metadata !{void (i32, float addrspace(1)*, float addrspace(1)*)* @sum, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 0, i32 1, i32 1}
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int", metadata !"float*", metadata !"float*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"const", metadata !"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"size", metadata !"vec1", metadata !"vec2"}
This is described in chapter 2.4 of the specs "Kernel Arg Info" (page 13), and gives extra information about the parameters of the kernel. This is exactly as is I defined the kernel - no questions asked or optimisations done.
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
This is needed for !opencl.spir.version, !opencl.ocl.version and !opencl.used.extensions. So we have SPIR version 1.0, OpenCL version 0.0 (?) and no extensions.

Final words

For developers it is important to see what actually happens with their code, for confrontational reasons to learn about assumptions. Ofcourse the first step would be reading OpenCL-kernels generated by pragmas, if you did not already and use pragma'ed code.
 
For code-distribution it is also very important: it lays the foundations for signed kernels and secure kernel-loading, so software holding IP can more safely be distributed.
 
You can  read more at Codedivine's blog. Rahul Garg explains pretty well what SPIR is, and is not. If you found this post interesting, also check this blog-post at Stackdenis on how to compile OpenCL with LLVM Clang, and compare the result with what iocl64 generated. You find there are quite some differences between Clang-llvm, iocl64-llvm and llvm-spir32, but that is out of scope for this blog-post.

Liked this article? We are the best known experts in OpenCL, CUDA and performance optimization. We can help you to maximize the performance of your software.

  • Jan Brothánek

    Hi, the declaration you are thinking about

    declare spir_func i64 @_Z13get_global_idj(i32) nounwind readnone
    means that the spir compiler declares that there is such a function which will be defined later. In this case the get_global_id() is defined by each opencl vendor (manufacturer), not by the spir compiler. The function call and it’s actual implementation is linked when compiled to the final device code.