Mathieu GAILLARD

Matrix-vector product: OpenCV vs GLM vs Eigen

For a project involving 3D triangulation of points and optimization, I needed to find out the fastest way to compute the projection of a 3D point on a 2D view. This operation requires a 4D matrix-vector multiplication, which can be accelerated with SIMD instructions. Initially, to save time, I reused an implementation that I had written for a class at the University. However, it turned out that it was far from optimal in terms of performance. Here I present the benchmark I wrote in C++ to compare OpenCV, GLM and Eigen.

Method

I used a C++ micro benchmarking library named nanobench and implemented a 4-by-4 matrix-vector multiplication using either OpenCV, GLM, or Eigen. I implemented a very simple sum of 16 numbers at first to warmup the benchmark and to get a reference time for the matrix-vector product. To make it more informative, I tested with float and double precision. I also tried a 3-by-4 matrix multiplication to check if it is faster than 4-by-4.

Results

I ran the benchmark on two computers: my Windows laptop and a Linux server from the university. Both are equipped with an Intel CPU that has the AVX2 instruction set.

Dell Precision T7910 with two Intel Xeon E5-2680v3
Ubuntu 20.04 LTS with g++ version 9.3.0

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                6.44 |      155,197,812.41 |    0.4% |      0.03 | `Sum of 16 values (float)`
|                4.02 |      248,858,761.35 |    0.1% |      0.02 | `OpenMP: Sum of 16 values (float)`
|                6.44 |      155,161,499.82 |    0.4% |      0.02 | `Sum of 16 values (double)`
|               18.90 |       52,903,340.71 |    0.1% |      0.07 | `OpenMP: Sum of 16 values (double)`
|              796.67 |        1,255,220.56 |    1.2% |      2.96 | `OpenCV: Matrix4x4 times Vector4 (float)`
|              798.32 |        1,252,634.11 |    1.1% |      2.89 | `OpenCV: Matrix4x4 times Vector4 (double)`
|                1.75 |      571,241,065.35 |    1.7% |      0.02 | `GLM: Matrix4x4 times Vector4 (float)`
|                1.80 |      554,194,843.32 |    0.4% |      0.02 | `GLM: Matrix4x4 times Vector4 (double)`
|                3.97 |      252,131,302.91 |    0.3% |      0.02 | `GLM: Matrix3x4 times Vector4 (float)`
|                3.57 |      280,377,487.72 |    0.2% |      0.02 | `GLM: Matrix3x4 times Vector4 (double)`
|                2.11 |      473,589,916.42 |    0.4% |      0.02 | `Eigen: Matrix4x4 times Vector4 (float)`
|                2.11 |      473,834,046.86 |    0.3% |      0.02 | `Eigen: Matrix4x4 times Vector4 (double)`
|                3.96 |      252,306,466.48 |    0.3% |      0.02 | `Eigen: Matrix3x4 times Vector4 (float)`
|                3.17 |      315,584,466.29 |    0.0% |      0.02 | `Eigen: Matrix3x4 times Vector4 (double)`

Dell XPS 17 9700 with an Intel Core i7-10875H
Windows 10 with Visual Studio 2019 version 16.11.8

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                4.37 |      228,687,312.69 |    0.4% |      0.02 | `Sum of 16 values (float)`
|                1.68 |      595,975,422.43 |    0.2% |      0.01 | `OpenMP: Sum of 16 values (float)`
|                4.45 |      224,631,924.20 |    3.3% |      0.02 | `Sum of 16 values (double)`
|                2.51 |      398,327,569.17 |    0.2% |      0.01 | `OpenMP: Sum of 16 values (double)`
|              839.09 |        1,191,761.48 |    0.7% |      3.10 | `OpenCV: Matrix4x4 times Vector4 (float)`
|              827.68 |        1,208,194.69 |    0.6% |      3.01 | `OpenCV: Matrix4x4 times Vector4 (double)`
|                3.89 |      256,997,129.61 |    4.1% |      0.01 | `GLM: Matrix4x4 times Vector4 (float)`
|                4.66 |      214,595,266.20 |    3.1% |      0.02 | `GLM: Matrix4x4 times Vector4 (double)`
|                2.99 |      334,772,137.44 |    0.0% |      0.01 | `GLM: Matrix3x4 times Vector4 (float)`
|                2.99 |      334,801,605.83 |    0.3% |      0.01 | `GLM: Matrix3x4 times Vector4 (double)`
|                1.46 |      684,929,393.87 |    0.1% |      0.01 | `Eigen: Matrix4x4 times Vector4 (float)`
|                1.42 |      702,323,457.34 |    0.5% |      0.01 | `Eigen: Matrix4x4 times Vector4 (double)`
|                3.71 |      269,729,973.67 |    0.0% |      0.01 | `Eigen: Matrix3x4 times Vector4 (float)`
|                2.84 |      352,666,208.95 |    0.2% |      0.01 | `Eigen: Matrix3x4 times Vector4 (double)`

Conclusion

As shown in the results, OpenCV is terribly slow, and should not be used for intensive 3D mathematics. It might be because the cv::Mat matrix class has a dynamic size rather than a fixed size defined at compile time, like in GLM and Eigen. It is however understandable as OpenCV is designed to process high resolution images.

Depending on the compiler and the Operating System, either GLM or Eigen performs the best. Interestingly, double precision is not necessarily less performant than float in this benchmark. Another interesting discovery is that using 4-by-4 matrices instead of 3-by-4 can be actually more performant even though it involves less computations. It is probably because in this case the compiler cannot make the best use of the SIMD instructions.

As a future work, it would be interesting to investigate if the use of SIMD instructions can make the computation memory bound. In fact, a better throughput in the execution units of the CPU could move the bottleneck towards the memory controller. Also, it would be nice to run it on a processor with AVX512 instruction set to see if there is a significant gain versus AVX2 only.

Code

Here is the cmake file that automatically downloads the dependencies, and compile them with the benchmark.

cmake_minimum_required(VERSION 3.16)

project(BenchmarkProject VERSION 1.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include(FetchContent)

# Activate OpenMP
find_package(OpenMP REQUIRED)

# Retrieve and activate the nanobench library
FetchContent_Declare(
    nanobench
    GIT_REPOSITORY https://github.com/martinus/nanobench.git
    GIT_TAG v4.3.6
    GIT_SHALLOW TRUE)
FetchContent_MakeAvailable(nanobench)

# Include OpenCV 4.5.5
option(WITH_OPENMP ON)
FetchContent_Declare(opencv
    GIT_REPOSITORY https://github.com/opencv/opencv.git
    GIT_TAG        4.5.5
    GIT_SHALLOW    TRUE
)
FetchContent_MakeAvailable(opencv)
# Manual include dirs for OpenCV (could not find better than this hack)
set(OpenCV_INCLUDE_DIRS "")
list(APPEND OpenCV_INCLUDE_DIRS ${OPENCV_CONFIG_FILE_INCLUDE_DIR})
list(APPEND OpenCV_INCLUDE_DIRS ${OPENCV_MODULE_opencv_core_LOCATION}/include)
set(OpenCV_LIBS "")
list(APPEND OpenCV_LIBS opencv_core)

# Retrieve and activate glm
option(BUILD_SHARED_LIBS "Build shared library" OFF)
option(BUILD_STATIC_LIBS "Build static library" OFF)
option(GLM_TEST_ENABLE "Build unit tests" OFF)
FetchContent_Declare(
    glm
    GIT_REPOSITORY https://github.com/g-truc/glm.git
    GIT_TAG 0.9.9.8
    GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(glm)

# Retrieve and activate Eigen3
FetchContent_Declare(
    eigen
    GIT_REPOSITORY https://gitlab.com/libeigen/eigen.git
    GIT_TAG 3.4.0
    GIT_SHALLOW TRUE
)
FetchContent_GetProperties(eigen)
if(NOT eigen_POPULATED)
    FetchContent_Populate(eigen)
endif()
set(EIGEN_INCLUDE_DIRS ${eigen_SOURCE_DIR})


add_executable(Benchmark main.cpp)

if (MSVC)
    # warning level 4 and all warnings as errors
    target_compile_options(Benchmark PRIVATE /W4)
    target_compile_options(Benchmark PRIVATE -openmp:experimental)
    target_compile_options(Benchmark PRIVATE /arch:AVX2)
else()
    # lots of warnings and all warnings as errors
    target_compile_options(Benchmark PRIVATE -Wall -Wextra -pedantic)
    target_compile_options(Benchmark PRIVATE -march=native)
endif()

target_include_directories(Benchmark
    PRIVATE
    ${OpenCV_INCLUDE_DIRS}
    $<BUILD_INTERFACE:${EIGEN_INCLUDE_DIRS}>
    $<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>
)

target_link_libraries(Benchmark
    PRIVATE
    nanobench
    ${OpenCV_LIBS}
    glm::glm
    OpenMP::OpenMP_CXX
)

The source code of the benchmark:

#include <array>
#include <atomic>
#include <iostream>

#include <omp.h>

#include <opencv2/core.hpp>

#include <glm/glm.hpp>

#include <Eigen/Dense>

#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>

constexpr int WARMUP_ITERATIONS = 1000;
constexpr int MIN_EPOCHS = 300000;

template <typename T, uint64_t N>
T array_sum(const std::array<T, N>& arr)
{
    T sum = 0;

    for (unsigned int i = 0; i < arr.size(); i++)
    {
        sum += arr[i];
    }

    return sum;
}

template <typename T, uint64_t N>
T array_sum_simd(const std::array<T, N>& arr)
{
    T sum = 0;

    #pragma omp simd reduction(+:sum)
    for (unsigned int i = 0; i < arr.size(); i++)
    {
        sum += arr[i];
    }

    return sum;
}

void benchmark_array_sum()
{
    constexpr unsigned int N = 16;

    std::array<float, N> arr_float{};
    std::array<double, N> arr_double{};

    for (unsigned int i = 0; i < N; i++)
    {
        arr_float[i] = 1.f;
        arr_double[i] = 1.0;
    }

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Sum of 16 values (float)", [&] {
        auto sum = array_sum(arr_float);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("OpenMP: Sum of 16 values (float)", [&] {
        auto sum = array_sum_simd(arr_float);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Sum of 16 values (double)", [&] {
        auto sum = array_sum(arr_double);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("OpenMP: Sum of 16 values (double)", [&] {
        auto sum = array_sum_simd(arr_double);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });
}

void benchmark_opencv_float()
{
    const cv::Mat1f mat4 = (cv::Mat1f(4, 4) <<
        1.f, 0.f, 0.f, 0.f,
        0.f, 1.f, 0.f, 0.f,
        0.f, 0.f, 1.f, 0.f,
        0.f, 0.f, 0.f, 1.f
        );
    const cv::Vec4f vec4(1.f, 1.f, 1.f, 1.f);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("OpenCV: Matrix4x4 times Vector4 (float)", [&] {
        cv::Mat1f result = mat4 * vec4;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_opencv_double()
{
    const cv::Mat1d mat4 = (cv::Mat1d(4, 4) << 
        1.0, 0.0, 0.0, 0.0,
        0.0, 1.0, 0.0, 0.0,
        0.0, 0.0, 1.0, 0.0,
        0.0, 0.0, 0.0, 1.0
        );
    const cv::Vec4d vec4(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("OpenCV: Matrix4x4 times Vector4 (double)", [&] {
        cv::Mat1d result = mat4 * vec4;

		ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_glm()
{
    const glm::mat4 mat4f(1.f);
    const glm::vec4 vec4f(1.f, 1.f, 1.f, 1.f);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("GLM: Matrix4x4 times Vector4 (float)", [&] {
        glm::vec4 result = mat4f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const glm::dmat4 mat4d(1.0);
    const glm::dvec4 vec4d(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("GLM: Matrix4x4 times Vector4 (double)", [&] {
        glm::dvec4 result = mat4d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const glm::mat4x3 mat34f(1.f);
    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("GLM: Matrix3x4 times Vector4 (float)", [&] {
        glm::vec3 result = mat34f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const glm::dmat4x3 mat34d(1.f);
    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("GLM: Matrix3x4 times Vector4 (double)", [&] {
        glm::dvec3 result = mat34d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_eigen()
{
    const Eigen::Matrix4f mat4f = Eigen::Matrix4f::Random(4, 4);
    const Eigen::Vector4f vec4f(1.f, 1.f, 1.f, 1.f);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Eigen: Matrix4x4 times Vector4 (float)", [&] {
        Eigen::Vector4f result = mat4f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const Eigen::Matrix4d mat4d = Eigen::Matrix4d::Random(4, 4);
    const Eigen::Vector4d vec4d(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("Eigen: Matrix4x4 times Vector4 (double)", [&] {
        Eigen::Vector4d result = mat4d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const Eigen::Matrix<float, 3, 4> mat34f = Eigen::Matrix<float, 3, 4>::Random(3, 4);
    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Eigen: Matrix3x4 times Vector4 (float)", [&] {
        Eigen::Vector3f result = mat34f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
    
    const Eigen::Matrix<double, 3, 4> mat34d = Eigen::Matrix<double, 3, 4>::Random(3, 4);
	ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Eigen: Matrix3x4 times Vector4 (double)", [&] {
        Eigen::Vector3d result = mat34d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

int main()
{
    benchmark_array_sum();

    benchmark_opencv_float();
    benchmark_opencv_double();

    benchmark_glm();

    benchmark_eigen();
    
    return 0;
}