Mathieu GAILLARD

Optimizing matrix multiplication: cache + OpenMP

When writing a matrix multiplication function, which is a memory bound algorithm, taking care of cache behavior can provide massive speedups. I will show that by just adding a line and swapping two loops we can achievement a 300x acceleration.

Matrix multiplication

Matrix multiplication is a basic tool of linear algebra. If A and B are matrices, then the coefficients of the matrix C=AB are equal to the dot product of rows of A with columns of B. The naive matrix multiplication algorithm has a computational complexity of O(n^3). More details on Wikipedia: Matrix multiplication.

Computer

I ran this benchmark on my laptop, which is eqquipped with:

  • CPU: Intel i7-10875H (8 cores, HT, 16MB cache)
  • RAM: 2*16 GB DDR4-2933

Data structure

The matrix is stored as a flat 1D dynamically allocated array. Access is row-major.

template <typename T>
struct AbstractMatrix
{
    typedef T Type;

    std::size_t rows;
    std::size_t cols;
    T* elements;

    T& at(std::size_t i, std::size_t j)
    {
        // Row major
        return elements[j + i * cols];
    }

    const T& at(std::size_t i, std::size_t j) const
    {
        // Row major
        return elements[j + i * cols];
    }

    void allocate(std::size_t n_rows, std::size_t n_cols)
    {
        rows = n_rows;
        cols = n_cols;
        elements = new T[rows*cols];
    }

    void free()
    {
        rows = 0;
        cols = 0;
        delete[] elements;
    }
};

using Matrix = AbstractMatrix<float>;

Four different implementations

I show 4 different implementations of matrix multiplication. All of them are very similar, but we will see that small changes in the code can have major consequences on performance.

1) Naive algorithm

If we were to naively implement matrix multiplication, this would be the result:

void MatMul(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);

    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t j = 0; j < c.cols; j++)
        {
            for (std::size_t k = 0; k < a.cols; k++)
            {
                c.at(i, j) += a.at(i, k) * b.at(k, j);
            }
        }
    }
}

2) Naive algorithm + OpenMP

By just throwing some OpenMP instructions without changing the code in 1), we get this:

void MatMulOpenMP(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);

    #pragma omp parallel for collapse(2)
    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t j = 0; j < c.cols; j++)
        {
            float result = 0.f;

            #pragma omp simd reduction(+:result)
            for (std::size_t k = 0; k < a.cols; k++)
            {
                result += a.at(i, k) * b.at(k, j);
            }
            
            c.at(i, j) = result;
        }
    }
}

3) Cache optimized algorithm

This code is the same as in 1) except that we swapped the two last for loops.

void MatMulCacheOptimized(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);
    
    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t k = 0; k < a.cols; k++)
        {
            for (std::size_t j = 0; j < c.cols; j++)
            {
                c.at(i, j) += a.at(i, k) * b.at(k, j);
            }
        }
    }
}

4) Cache optimized algorithm + OpenMP

By just throwing one OpenMP instruction without changing the code in 3), we get this:

void MatMulCacheOptimizedOpenMP(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);
    
    #pragma omp parallel for
    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t k = 0; k < a.cols; k++)
        {
            for (std::size_t j = 0; j < c.cols; j++)
            {
                // No need for atomic add, because 
                // each row is processed by a single thread
                c.at(i, j) += a.at(i, k) * b.at(k, j);
            }
        }
    }
}

Results

For a matrix smaller than the cache (N=32)

With a 32*32 matrix, the size in RAM is 4 KB. By having 3 matrices (A, B, C) of that size, we are sure that they can all fit in the cache at the same time. We run each version 100,000 times and measure the average.

Here is the result of the benchmark:

MatMul time: 0.0190369 ms
MatMulOpenMP time: 0.00466552 ms (4.1x speedup)
MatMulCacheOptimized time: 0.00584262 ms (3.3x speedup)
MatMulCacheOptimizedOpenMP time: 0.00303766 ms (6.3x speedup)

With small matrices, we get a nice speedup using the cache optimized version, but it’s not as impressive as with bigger matrices.

For a matrix bigger than the cache (N=2048)

With a 2048*2048 matrix, the size in RAM is 16.7 MB. By having 3 matrices (A, B, C) of that size, we are sure that they cannot all fit in the cache at the same time. We run each version 3 times and measure the average.

Here is the result of the benchmark:

MatMul time: 57445.2 ms
MatMulOpenMP time: 6889.26 ms (8.33x speedup)
MatMulCacheOptimized time: 1720.7 ms (33.4x speedup)
MatMulCacheOptimizedOpenMP time: 191.71 ms (300x speedup)

I think the results speak for themselves. Making sure we use a coalescing memory access pattern, already brings a 33.4x speedup on a single core! Together with OpenMP we can achieve an impressive 300x speedup!

Conclusion

Of course, it would be possible to further improve the performance, but that is not the point of this article. I simply wanted to show how much of a contrast there is between the naive and the cache optimized implementation, which are essentially the same except a tiny yet critical difference.

Matrix multiplication is a memory bound algorithm, thus throwing more cores does not always help. We saw that in this particular case, cache optimization brings a lot more performance than OpenMP.

Usually it’s not the case though, and usually OpenMP is King!

The entire code

main.cpp

#include <iostream>
#include <cmath>
#include <algorithm>
#include <chrono>

template <typename T>
struct AbstractMatrix
{
    typedef T Type;

    std::size_t rows;
    std::size_t cols;
    T* elements;

    T& at(std::size_t i, std::size_t j)
    {
        // Row major
        return elements[j + i * cols];
    }

    const T& at(std::size_t i, std::size_t j) const
    {
        // Row major
        return elements[j + i * cols];
    }

    void allocate(std::size_t n_rows, std::size_t n_cols)
    {
        rows = n_rows;
        cols = n_cols;
        elements = new T[rows*cols];
    }

    void free()
    {
        rows = 0;
        cols = 0;
        delete[] elements;
    }
};

using Matrix = AbstractMatrix<float>;

bool areMatricesEqual(const Matrix& a, const Matrix& b)
{
    for (std::size_t i = 0; i < a.rows; i++)
    {
        for (std::size_t j = 0; j < a.cols; j++)
        {
            if (a.at(i, j) != b.at(i, j))
            {
                return false;
            }
        }
    }

    return true;
}

void MatMul(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);

    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t j = 0; j < c.cols; j++)
        {
            for (std::size_t k = 0; k < a.cols; k++)
            {
                c.at(i, j) += a.at(i, k) * b.at(k, j);
            }
        }
    }
}

void MatMulOpenMP(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);

    #pragma omp parallel for collapse(2)
    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t j = 0; j < c.cols; j++)
        {
            float result = 0.f;

            #pragma omp simd reduction(+:result)
            for (std::size_t k = 0; k < a.cols; k++)
            {
                result += a.at(i, k) * b.at(k, j);
            }
            
            c.at(i, j) = result;
        }
    }
}

void MatMulCacheOptimized(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);
    
    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t k = 0; k < a.cols; k++)
        {
            for (std::size_t j = 0; j < c.cols; j++)
            {
                c.at(i, j) += a.at(i, k) * b.at(k, j);
            }
        }
    }
}

void MatMulCacheOptimizedOpenMP(Matrix& c, const Matrix& a, const Matrix& b)
{
    std::fill_n(c.elements, c.rows * c.cols, 0.f);
    
    #pragma omp parallel for
    for (std::size_t i = 0; i < c.rows; i++)
    {
        for (std::size_t k = 0; k < a.cols; k++)
        {
            for (std::size_t j = 0; j < c.cols; j++)
            {
                // No need for atomic add, because each row is processed by a single threat
                c.at(i, j) += a.at(i, k) * b.at(k, j);
            }
        }
    }
}

template <void Function(Matrix&, const Matrix&, const Matrix&)>
double measureTime(Matrix& c, const Matrix& a, const Matrix& b)
{
    const int N = 3;

    auto start_time = std::chrono::steady_clock::now();
    for (int i = 0; i < N; i++) {
        Function(c, a, b);
    }
    auto end_time = std::chrono::steady_clock::now();

    return std::chrono::duration<double, std::milli> (end_time - start_time).count() / N;
}

int main()
{
    const std::size_t N = 2048;

    std::cout << "Matrix size: " << float(N * N * sizeof(Matrix::Type)) / 1e6 << " MB" << std::endl;

    Matrix a, b, c, r;

    a.allocate(N, N);
    b.allocate(N, N);
    c.allocate(N, N);
    r.allocate(N, N);

    std::cout << "Init matrices" << std::endl;
    for (std::size_t i = 0; i < N*N; i++)
    {
        a.elements[i] = i + 1;
        b.elements[i] = 2*i + 1;
    }

    std::cout << "MatMul time: " << std::flush;
    const auto matMulTime = measureTime<MatMul>(r, a, b);
    std::cout << matMulTime << " ms" << std::endl;

    std::cout << "MatMulOpenMP time: " << std::flush;
    const auto matMulOpenMPTime = measureTime<MatMulOpenMP>(c, a, b);
    if (areMatricesEqual(c, r))
    {
        std::cout << matMulOpenMPTime << " ms" << std::endl;
    }

    std::cout << "MatMulCacheOptimized time: " << std::flush;
    const auto matMulCacheOptimizedTime = measureTime<MatMulCacheOptimized>(c, a, b);
    if (areMatricesEqual(c, r))
    {
        std::cout << matMulCacheOptimizedTime << " ms" << std::endl;
    }

    std::cout << "MatMulCacheOptimizedOpenMP time: " << std::flush;
    const auto matMulCacheOptimizedOpenMPTime = measureTime<MatMulCacheOptimizedOpenMP>(c, a, b);
    if (areMatricesEqual(c, r))
    {
        std::cout << matMulCacheOptimizedOpenMPTime << " ms" << std::endl;
    }
    
    a.free();
    b.free();
    c.free();
    r.free();

    return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.5)
project(Matrix)

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra")

if(WITH_NATIVE_ARCH)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
endif()

find_package(OpenMP)
if (OPENMP_FOUND)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS} -ftree-vectorize -fopt-info-vec-optimized")
endif()

set(SOURCE_FILES main.cpp)
add_executable(Matrix ${SOURCE_FILES})