Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their OpenCL optimisation-document (page 28 and further) with 3 random factors as input: S, K and T, and two configuration-constants R and V. NVidia is easy to compare to Intel’s, while AMD chose to write down the algorithm quite different.
So we have three different but comparable kernels in total. What will happen if we run these, all optimised for specific types of hardware, on the following devices?

Intel(R) Core(TM) i7-2600 CPU @3.4GHz, Mem @1333MHz
GeForce GTX 560 @810MHz, Mem @1000MHz
Radeon HD 6870 @930MHz, Mem @1030MHz

Three different architectures and three different drivers. To complete the comparison I also try to see if there is a difference when using Intel’s and AMD’s driver for CPUs. The following drivers were used:

Intel OpenCL SDK 1.5
AMD APP 2.5, driver version 8.902-111012a-127183C-ATI, Catalyst 11.10 (and Catalyst 11.12)
NVidia CUDA 4.0 SDK, 280.13 drivers

The times are an average of 5 tests, except if there was too much fluctuation. It is best that you look at the code of the the SDKs too, so you understand what is run.

These results will be used for future articles, so only results of the first test here and no extensive discussions on the results.

Intel’s example

Since Intel provides the smallest kernel, here you have an idea how Black-Scholes works. As you can see, it is very nicely parallel with only references to global id tid.

__kernel __attribute__((vec_type_hint(float4)))
void BlackScholes(
	__global float4 *callResult,
	__global float4 *putResult,
	const __global float4* S,
	const __global float4* K,
	const __global float4* T,
	float r,
	float v)
{
	size_t tid = get_global_id(0);
	float4 d1 = 0.0f;
	float4 d2 = 0.0f;
	//int4 calls = (int4)'c';
	d1 = (log(S[tid] / K[tid]) + (r + v * v / 2) * T[tid]) / (v * sqrt(T[tid]));
	d2 = d1 - v * sqrt(T[tid]);
	callResult[tid] = S[tid] * CND4(d1)- K[tid] * exp(T[tid] * -r) * CND4(d2);
	putResult[tid] = K[tid] * exp(T[tid] * -r) * CND4(-d2) - S[tid] * CND4(-d1);
}

It is used for approximating the best price for an option in the financial world. Read this post on the matter, which explains in easy language what it does. It is also interesting to know that it was developed for computers slower than a Casio calculator.

Benchmarks

I used the original code of the SDKs and edited the selection-methods for platform and device-type.

NVidia’s code

NVidia GPU: 0.00082 s
AMD GPU: 0.00159 s (0.00120 to 0.00202 on Catalyst 11.12)
AMD CPU: 0.07400 s  (0.074520 on Catalyst 11.12)
Intel CPU: 0.01525 s

Very clear results: pretty optimised for NVIDIA GPUs. See that for CPUs Intel’s drivers give a much faster result than AMD.

Intel’s code

As there was no software, I hacked the kernel into NVidia’s version. I made a float4-version of NVidia’s CND and altered Intel’s code a little so it gave back both calls and puts. Unluckily it did not give correct results for AMD and Intel according to the program, so more work for later. For the code I used see above listing. Add “unsigned int optN” to be complete, and the trick with callMask in the original code you can use in the float4-version of CND.

“__attribute__((vec_type_hint(float4)))” did not make the Intel CPU work faster, even if they claimed it in their article. This compiler-hint also gave an error for NVIDIA and AMD.

NVidia GPU: 0.00019 s
AMD GPU: 0.00035 s (0.00025 to 0.00090 on Catalyst 11.12)
AMD CPU: 0.00529 s (0.00469 on Catalyst 11.12)
Intel CPU: 0.00159 s

Knowing that NVidia gave correct results with this version, we can assume float4 is a good thing.

AMD’s code

As AMD’s code is totally different (both kernel and host), you cannot compare these results with the other two.

NVidia GPU: 0.514 s
AMD GPU: 0.154 s (0.126s on Catalyst 11.12 - not much fluctuation)
AMD CPU: 0.105 s (0.102s on Catalyst 11.12)
Intel CPU: 0.161 s

AMD had the strangest results. It seems the CPU via AMD’s drivers is 5 times as fast as NVidia and also faster than AMD’s GPU. They successfully made Radeons get faster results than Geforces on this one.

A last word

Next time let’s try an algorithm that needs more memory-optimisations. But as stated above, I am going to use all the results later.

StreamHPC communications

Black-Scholes mixing on SandyBridge, Radeon and Geforce

Intel’s example

Benchmarks

NVidia’s code

Intel’s code

AMD’s code

A last word

Related Posts

The history of the PC from 2000 – 2012

Problem solving tactic: making black boxes smaller

Market Positioning of Graphics and Compute solutions

Improving FinanceBench for GPUs Part II – low hanging fruit

StreamHPC communications

Intel’s example

Benchmarks

NVidia’s code

Intel’s code

AMD’s code

A last word

Related Posts

The history of the PC from 2000 – 2012

Problem solving tactic: making black boxes smaller

Market Positioning of Graphics and Compute solutions

Improving FinanceBench for GPUs Part II – low hanging fruit

Discover more from StreamHPC