OpenCL Optimization

Sometimes the best optimization strategies are the simplest ones. Over-engineering your GPU code can be detrimental to raw performance

OpenCL is one of the tools that allow software developers to unlock the full computational power of Graphics Cards.
It is designed from the ground up to exploit the parallel architecture of GPUs, a feature that is paramount to make neural networks converge within reasonable timeframes.
However extracting the best performance out of OpenCL code can be difficult.
In this article we give some basic tips that are often overlooked when writing parallel code.

Benchmarking a simple "fill" function

One of the very basic first task that needs to be performed with OpenCL is to load data into the GPU.
Our task here will be to fill a table of 10,000 rows and 10,000 columns with the exact same value.
In non-parallel, sequential code, this would be realized with a loop such as :

void fill_with_cpu_naive(float* table, const float value) {
    const cl_uint nb_cols = 10000;
    for (cl_uint row=0; row<10000; ++row) {
        for (cl_uint col=0; col<10000; ++col) {
            cl_ulong position = row * nb_cols + col;
            table[position] = value;

For this benchmark, we call this function 10 times in a loop on a Core-i5 4250U CPU running under Windows 10.
In order to keep this test meaningful for the purpose of comparing sequential code vs parallel code, we disable C++ compiler optimizations.
The total execution time is 4.404 seconds.
Now let's look at what the GPU is able to do.

2D kernels vs. 1D kernels

Our table is a 2-dimensional array of 10,000 rows and 10,000 columns, so the natural inclination is to write a 2D-kernel that will scan all the table entries across rows and columns.
As we will show in a minute, this is a typical mistake.

__kernel void fill_with_gpu_naive (__global float* table, const float value) {
	const uint row = get_global_id(0);
    const uint col = get_global_id(1);
    const uint nb_cols = get_global_size(1);
    const ulong position = row * nb_cols + col;
    table[position] = value;

In order to execute this kernel, we need to call the clEnqueueNDRangeKernel() function with a work_dim parameter of 2 (i.e. 2 dimensions in our table), and a global_work_size parameter of {10000, 10000} (i.e. 10,000 rows and 10,000 columns).
We run this code 10 times in a loop on an Intel HD Graphics 5000 GPU, and the execution time goes down to 2.018 seconds. Our GPU is about twice faster than our CPU.
However, looking closer at this code the following thoughts arise :

  • Each kernel needs to calculate its absolute position index in the table from the current row and column. While the operation row * nb_cols + col is simple , it would be great if we could avoid it altogether
  • More importantly, we have no control over the way memory is accessed. If OpenCL reads memory by row first, we will be fetching data in contiguous memory zones. But if OpenCL reads memory by columns first, we will be making big jumps across memory zones. This can have a substantial impact over performance because in the former case contiguous memory zones can be loaded from the cache, which is not possible in the latter case.

It turns out that these two issues can be solved by thinking about our table as one big flat 1D vector. Instead of walking across rows and columns along 2 dimensions, we can simply walk across indexes in 1 single dimension.

__kernel void fill_with_gpu_unrolled (__global float* table, const float value) {
    table[get_global_id(0)] = value;

We now need to execute the kernel by calling clEnqueueNDRangeKernel() with a work_dim parameter of 1 (only 1 dimension), and a global_work_size parameter of 10000 * 10000 (100 million entries in total).
Not only is the code much simpler, but it becomes also way faster : 0.353 seconds. Our GPU is now 12.5 times faster than our CPU !
For those interested about the details of memory alignment and data caching, there is an excellent document by the Intel folks which is available here.

Work-group sizes

In the OpenCL model, each parallel task (a work-item) is executed independently.
On top of this simple concept, OpenCL adds a layer of abstraction called work-groups.
The rationale for work-groups is that sometimes, work-items need to share data with each other. Work-groups are the entities where data can be shared between work-items.
As far as we are concerned, our code is so simple that there is no such data to share. In that case, OpenCL recommends not to play with work-groups and let the runtime manage this parameter.

However we are stubborn and we want to know whether we can do better.
We now modify the local_work_size parameter of the clEnqueueNDRangeKernel() function. Instead of passing nullptr to the parameter (which lets OpenCL determine the work-group size), we set it manually to various different values.
And we manage to extract a bit of additional performance ! With work-group sizes of 64, 128, and 256 items the execution time is again reduced, down to around 0.320-0.330 seconds.
But the interesting bit is elsewhere :

  • With a work-group size higher than 512, we crash the program. Another GPU may have another limit, but this is the maximum our GPU can handle.
  • With a work-group size of 512, we crash the program. 100,000,000 items is not divisible by 512.
  • With a work-group size of 100, 200, 400 or 500, the performance is worse than our base case (0.400-0.500 seconds)
  • With a work-group size of less than 16, the execution time is now significantly deteriorated (1.5-3.0 seconds)

In fact, the choice of a good work-group size is not only hardware-dependent, but also problem-dependent.
It needs to be high enough (but not to high), a divisor of the total number of items, a power of two, and ideally a multiple of 16.
In other words, the risk of choosing a bad value for this parameter outweighs the performance benefits.
In summary, we should have sticked to the OpenCL recommendation : don't play with work-group sizes.


Now here comes the mother of all optimization algorithms, aka vectorization.
What does this mean ?
Well, it is a rather simple method. It only means that instead of processing the values of our table one by one, we will process them by small blocks of 4 items.
But why do we choose to process by block of 4 items rather than say 2, or 8, or anything else ?
The answer lies in the GPU hardware itself. It turns out that if you look into the detailed specifications of our GPU, you will find out that the hardware is made of SIMD units that are able to process data by blocks of 4 floating-point numbers.
So the rationale here is simple : If we process the values one by one, we are only using 1/4th of the hardware true capacity.
So let's modify our code again

__kernel void fill_with_gpu_vectorized (__global float4* table, const float4 vector) {
    table[get_global_id(0)] = vector;

Again we modify the call to clEnqueueNDRangeKernel() by now using a global_work_size divided by 4 (25 million blocks of 4 items will cover our 100 million entries). We also need to modify the way we are passing the vector parameter by setting the right parameters for arg_size and arg_value in the call to clSetKernelArg(), so that our input vector is now a block of 4 times the same value.

After running our test, the execution time is now down to 0.283 seconds, i.e. the GPU is 15.6 times faster than the CPU.
But in all fairness, this is a bit disappointing. We were expecting a massive 4x speedup, and we aren't even remotely getting this kind of improvement.
Why is that ?
Well, we've been trying to outsmart OpenCL, haven't we ?
In fact, the OpenCL compiler that we are using is pretty smart, and it had already vectorized our code.
Intel engineers, who wrote the compiler, thought about vectorization before we did. Of course, they know their hardware, so they knew it would be beneficial to use the full capacity of their SIMD units.
Once again, there are fascinating materials available here and here where Intel engineers explain how they vectorize your code before you even ask for it.

Although we managed to obtain a very good throughput by manually vectorizing our code, there is one caveat that needs to be mentioned here.
In our case, our table has 100,000,000 items. This number is divisible by 4, so it was easy to rewrite our code.
This spared us from the need to treat edge cases where 1, 2, or 3 numbers have been left unattended at the end of the process. In fact the correct code would be something like :

__kernel void fill_vectorized_full (__global float4* table, const float4 value, const size_t count) {
    table[get_global_id(0)] = value;
	if (get_global_id(0) == get_global_size(0) - 1) {
        float* stub_table = (float *)(table + get_global_size(0));
		const float scalar_value = value[0];
		for (uint stub_col = 0; stub_col < count % 4; ++stub_col) {
			stub_table[stub_col] = scalar_value;

Arguably the code is now much more complex. Although still manageable for such a simple test case, it can become prone to errors when implementing bigger kernels.
Overall it's up to you to decide. In my opinion, the compiler-made vectorization gives you most of the performance without having to think about all the corner cases.

Being fair to the CPU

If you recall well, at the start of this article we said that we would disable the C++ compiler optimizations.
We did this because we wanted to measure the improvement that parallelization brings over traditional non-parallel code.
But it turns out that modern CPUs can also benefit from parallelization. First because they usually have several cores. Second because they also have vectorization capabilities, such as AVX2 extensions. In fact, the SIMD units of the CPU are often much wider than their GPU counterparts, meaning that they can process blocks of 8 or even 16 numbers in one shot.
So now it looks unfair. If we allow the GPU compiler to vectorize our code, we should allow the CPU compiler to do the same.
We re-write our CPU code so as to unroll our 2D loop, and we turn on the full C++ compiler optimizations.

void fill_with_cpu_vectorized(float* table, const float value) {
    for (cl_ulong position=0; position<100000000; ++position) {
        table[position] = value;

Now it looks much better. The execution time is down to 0.521 seconds.

Putting everything together

Method Execution time Improvement over sequential code
CPU 2D loop 4.404 s --
GPU 2D 2.018 s 2.2x
CPU 1D + vectorize 0.521 s 8.4x
GPU 1D 0.353 s 12.5x
GPU 1D + hand-tuned WGS 0.318 s 13.8x
GPU 1D + vectorize 0.283 s 15.6x

Don't try to over-engineer your code

A look into the results is a simple guide for people writing parallel code on GPUs using OpenCL.

  • Avoid 2D kernels if you can. 1D kernels are much simpler to write, and much faster to execute. They can exploit the natural alignment of data in the cache, and they can be vectorized automatically by the compiler
  • Don't touch work-group size. This is a sensible parameter and it is better to leave it up to the OpenCL runtime to choose the right value for you
  • Don't expect much from hand-made vectorization. Compilers are made by talented software engineers who already took care of this for you. As a bonus, it will spare you some hard-to-debug errors

In summary : don't over-engineer.