Mathieu GAILLARD

Matrix-vector product: OpenCV vs GLM vs Eigen

For a project involving 3D triangulation of points and optimization, I needed to find out the fastest way to compute the projection of a 3D point on a 2D view. This operation requires a 4D matrix-vector multiplication, which can be accelerated with SIMD instructions. Initially, to save time, I reused an implementation that I had written for a class at the University. However, it turned out that it was far from optimal in terms of performance. Here I present the benchmark I wrote in C++ to compare OpenCV, GLM and Eigen.

Key takeaways

Here are the important highlights of the experiment. See the conclusion for more details.

  • 4-by-4 matrix-vector multiplication is faster than 3-by-4 matrix vector multiplication
  • double precision does not make the computations significantly slower than single precision
  • Results depend on the compilers and the library, in our case gcc performs better than MSVC
  • Never use OpenCV’s cv::Mat for 3D geometry computations, it is extremely slow!
  • The AVX-512 instruction set is not useful for 4-by-4 matrix-vector multiplications

Method

I used a C++ micro benchmarking library named nanobench and implemented a 4-by-4 matrix-vector multiplication using either OpenCV, GLM, or Eigen. To make it more informative, I tested with float and double precision. I also tried a 3-by-4 matrix multiplication to check if it is faster than 4-by-4.

Results

I ran the benchmark on two computers: my Windows laptop and a Windows workstation from the university. Both are equipped with an Intel CPU that has the AVX2 instruction set. The workstation also has AVX-512 instructions.

Dell Precision T5820 with an Intel Xeon W-2145 Windows 10 with Visual Studio 2022 version 17.3.6

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              814.35 |        1,227,973.74 |    0.3% |      2.98 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
|              822.10 |        1,216,398.71 |    0.7% |      3.00 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
|                7.51 |      133,084,031.35 |    0.6% |      2.47 | `OpenCV: Matx44f times Vector4 (float)`
|                8.67 |      115,304,500.68 |    0.1% |      2.85 | `OpenCV: Matx44d times Vector4 (double)`
|                4.08 |      245,041,633.15 |    0.4% |      1.49 | `GLM: Matrix4x4 times Vector4 (float)`
|                5.69 |      175,662,182.15 |    0.5% |      2.07 | `GLM: Matrix4x4 times Vector4 (double)`
|                3.05 |      327,899,204.92 |    0.4% |      1.11 | `GLM: Matrix3x4 times Vector4 (float)`
|                3.08 |      324,549,169.09 |    1.3% |      1.12 | `GLM: Matrix3x4 times Vector4 (double)`
|                1.61 |      622,895,521.72 |    1.8% |      0.58 | `Eigen: Matrix4x4 times Vector4 (float)`
|                1.68 |      594,626,598.16 |    0.3% |      0.62 | `Eigen: Matrix4x4 times Vector4 (double)`
|                3.98 |      251,082,244.98 |    0.4% |      1.45 | `Eigen: Matrix3x4 times Vector4 (float)`
|                2.89 |      346,103,412.46 |    0.3% |      1.05 | `Eigen: Matrix3x4 times Vector4 (double)`

Dell Precision T5820 with an Intel Xeon W-2145 Windows 10 WSL with g++ version 11.2.0

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              574.54 |        1,740,536.07 |    0.5% |      2.09 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
|              574.61 |        1,740,316.20 |    0.5% |      1.89 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
|                3.78 |      264,604,361.40 |    1.0% |      1.37 | `OpenCV: Matx44f times Vector4 (float)`
|                3.75 |      266,524,926.63 |    0.5% |      1.37 | `OpenCV: Matx44d times Vector4 (double)`
|                0.98 |    1,015,373,609.76 |    0.4% |      0.36 | `GLM: Matrix4x4 times Vector4 (float)`
|                1.13 |      887,509,294.54 |    1.1% |      0.41 | `GLM: Matrix4x4 times Vector4 (double)`
|                1.79 |      559,536,142.55 |    1.2% |      0.65 | `GLM: Matrix3x4 times Vector4 (float)`
|                1.87 |      533,794,861.38 |    0.3% |      0.69 | `GLM: Matrix3x4 times Vector4 (double)`
|                1.39 |      721,918,390.71 |    0.5% |      0.51 | `Eigen: Matrix4x4 times Vector4 (float)`
|                1.55 |      645,222,153.03 |    0.7% |      0.56 | `Eigen: Matrix4x4 times Vector4 (double)`
|                1.79 |      558,964,815.22 |    1.3% |      0.65 | `Eigen: Matrix3x4 times Vector4 (float)`
|                2.23 |      448,510,911.71 |    0.5% |      0.82 | `Eigen: Matrix3x4 times Vector4 (double)`

Dell XPS 17 9700 with an Intel Core i7-10875H
Windows 10 with Visual Studio 2022 version 17.3.6

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              842.56 |        1,186,864.16 |    0.7% |      3.10 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
|              823.40 |        1,214,480.65 |    0.8% |      3.00 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
|                7.14 |      139,958,475.16 |    0.4% |      2.35 | `OpenCV: Matx44f times Vector4 (float)`
|                7.90 |      126,551,768.52 |    0.4% |      2.59 | `OpenCV: Matx44d times Vector4 (double)`
|                4.04 |      247,224,394.58 |    0.6% |      1.47 | `GLM: Matrix4x4 times Vector4 (float)`
|                5.10 |      195,890,937.54 |    0.1% |      1.86 | `GLM: Matrix4x4 times Vector4 (double)`
|                3.10 |      322,251,119.32 |    1.1% |      1.13 | `GLM: Matrix3x4 times Vector4 (float)`
|                3.13 |      319,531,239.59 |    0.5% |      1.14 | `GLM: Matrix3x4 times Vector4 (double)`
|                1.54 |      648,734,166.30 |    1.4% |      0.56 | `Eigen: Matrix4x4 times Vector4 (float)`
|                1.53 |      651,892,822.93 |    0.5% |      0.56 | `Eigen: Matrix4x4 times Vector4 (double)`
|                3.93 |      254,772,280.05 |    0.7% |      1.42 | `Eigen: Matrix3x4 times Vector4 (float)`
|                2.70 |      370,644,876.89 |    2.0% |      0.99 | `Eigen: Matrix3x4 times Vector4 (double)`

Dell XPS 17 9700 with an Intel Core i7-10875H
Windows 10 WSL with g++ version 11.2.0

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              574.13 |        1,741,751.58 |    1.3% |      2.08 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
|              561.67 |        1,780,402.66 |    1.3% |      2.05 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
|                3.66 |      273,285,071.60 |    0.4% |      1.20 | `OpenCV: Matx44f times Vector4 (float)`
|                3.69 |      270,684,689.01 |    0.7% |      1.21 | `OpenCV: Matx44d times Vector4 (double)`
|                0.96 |    1,041,278,471.09 |    1.4% |      0.35 | `GLM: Matrix4x4 times Vector4 (float)`
|                1.03 |      972,124,429.18 |    1.3% |      0.37 | `GLM: Matrix4x4 times Vector4 (double)`
|                1.75 |      570,792,360.05 |    1.2% |      0.64 | `GLM: Matrix3x4 times Vector4 (float)`
|                1.86 |      537,884,955.04 |    1.5% |      0.68 | `GLM: Matrix3x4 times Vector4 (double)`
|                1.36 |      735,029,481.30 |    0.9% |      0.49 | `Eigen: Matrix4x4 times Vector4 (float)`
|                1.38 |      723,380,789.97 |    1.6% |      0.50 | `Eigen: Matrix4x4 times Vector4 (double)`
|                1.76 |      568,813,038.55 |    2.1% |      0.63 | `Eigen: Matrix3x4 times Vector4 (float)`
|                1.92 |      520,583,076.87 |    0.4% |      0.70 | `Eigen: Matrix3x4 times Vector4 (double)`

Discussion

To better understand results, let’s look at the output assembly codes. For this, I am using the excellent tool that is Compiler Explorer.

4-by-4 versus 3-by-4 multiplication

First, why is the 4-by-4 matrix-vector multiplication faster than 3-by-4 matrix-vector multiplication, even though there are actually more computations involved? Following is the answer with glm:

void glm_mul4(glm::dvec4& C, const glm::dmat4& A, const glm::dvec4& B)
{
    C = A * B;
}
void glm_mul3x4(glm::dvec3& C, const glm::dmat3x4& A, const glm::dvec4& B)
{
    C = A * B;
}
glm_mul4(glm::vec<4, double, (glm::qualifier)0>&, glm::mat<4, 4, double, (glm::qualifier)0> const&, glm::vec<4, double, (glm::qualifier)0> const&):
        vmovupd ymm0, YMMWORD PTR [rdx]
        vxorps  xmm1, xmm1, xmm1
        vpermpd ymm1, ymm0, 85
        vmulpd  ymm1, ymm1, YMMWORD PTR [rsi+32]
        vbroadcastsd    ymm2, xmm0
        vfmadd132pd     ymm2, ymm1, YMMWORD PTR [rsi]
        vxorps  xmm1, xmm1, xmm1
        vpermpd ymm1, ymm0, 170
        vpermpd ymm0, ymm0, 255
        vmulpd  ymm0, ymm0, YMMWORD PTR [rsi+96]
        vfmadd231pd     ymm0, ymm1, YMMWORD PTR [rsi+64]
        vaddpd  ymm0, ymm0, ymm2
        vmovupd YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret
glm_mul3x4(glm::vec<3, double, (glm::qualifier)0>&, glm::mat<3, 4, double, (glm::qualifier)0> const&, glm::vec<4, double, (glm::qualifier)0> const&):
        vmovsd  xmm1, QWORD PTR [rdx+16]
        vmovupd xmm0, XMMWORD PTR [rdx]
        vmovddup        xmm4, xmm1
        vmulsd  xmm1, xmm1, QWORD PTR [rsi+80]
        vpermilpd       xmm3, xmm0, 0
        vpermilpd       xmm2, xmm0, 3
        vfmadd231sd     xmm1, xmm0, QWORD PTR [rsi+16]
        vunpckhpd       xmm0, xmm0, xmm0
        vfmadd132sd     xmm0, xmm1, QWORD PTR [rsi+48]
        vmulpd  xmm1, xmm4, XMMWORD PTR [rsi+64]
        vfmadd231pd     xmm1, xmm3, XMMWORD PTR [rsi]
        vfmadd231pd     xmm1, xmm2, XMMWORD PTR [rsi+32]
        vmovsd  QWORD PTR [rdi+16], xmm0
        vmovupd XMMWORD PTR [rdi], xmm1
        ret

We can see that the function glm_mul3x4 uses SSE instructions (with xmm registers) whereas the function glm_mul4 uses AVX2 instructions (with ymm registers). Even though there are the same numbers of instructions in both implementations, the AVX2 executes slightly less computation (add and multiplications) instructions than the SSE implementation. This might explain the slight difference in performance between the two approaches.

Eigen versus glm

In our tests, glm appears to be slightly faster than Eigen. Following is the comparison of both implementations:

void eigen_mul4(Vector4d &C, const Matrix4d& A, const Vector4d& B)
{
    C = A * B;
}

void glm_mul4(glm::dvec4& C, const glm::dmat4& A, const glm::dvec4& B)
{
    C = A * B;
}
eigen_mul4(Eigen::Matrix<double, 4, 1, 0, 4, 1>&, Eigen::Matrix<double, 4, 4, 0, 4, 4> const&, Eigen::Matrix<double, 4, 1, 0, 4, 1> const&):
        vbroadcastsd    ymm0, QWORD PTR [rdx]
        vbroadcastsd    ymm1, QWORD PTR [rdx+8]
        vmulpd  ymm0, ymm0, YMMWORD PTR [rsi]
        vbroadcastsd    ymm2, QWORD PTR [rdx+16]
        vbroadcastsd    ymm3, QWORD PTR [rdx+24]
        vfmadd231pd     ymm0, ymm1, YMMWORD PTR [rsi+32]
        vfmadd231pd     ymm0, ymm2, YMMWORD PTR [rsi+64]
        vfmadd231pd     ymm0, ymm3, YMMWORD PTR [rsi+96]
        vmovapd YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret
glm_mul4(glm::vec<4, double, (glm::qualifier)0>&, glm::mat<4, 4, double, (glm::qualifier)0> const&, glm::vec<4, double, (glm::qualifier)0> const&):
        vmovupd ymm0, YMMWORD PTR [rdx]
        vxorps  xmm1, xmm1, xmm1
        vpermpd ymm1, ymm0, 85
        vmulpd  ymm1, ymm1, YMMWORD PTR [rsi+32]
        vbroadcastsd    ymm2, xmm0
        vfmadd132pd     ymm2, ymm1, YMMWORD PTR [rsi]
        vxorps  xmm1, xmm1, xmm1
        vpermpd ymm1, ymm0, 170
        vpermpd ymm0, ymm0, 255
        vmulpd  ymm0, ymm0, YMMWORD PTR [rsi+96]
        vfmadd231pd     ymm0, ymm1, YMMWORD PTR [rsi+64]
        vaddpd  ymm0, ymm0, ymm2
        vmovupd YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret

The two implementations gives different results after compilation. Unfortunately, I did not go deep enough to understand what differs and how much impact it has.

Conclusion

As shown in the results, the cv::Mat matrix from OpenCV is terribly slow, and should not be used for intensive 3D mathematics. The alternative is to use cv::Matx44d but it is still quite slow compared to glm or Eigen. It might be because the cv::Mat matrix class has a dynamic size rather than a fixed size defined at compile time, like in glm and Eigen. It is however understandable, as OpenCV is designed to process high resolution images.

Depending on the compiler and the Operating System, either glm or Eigen performs the best. Interestingly, double precision is not necessarily significantly less performant than float in this benchmark. Usually single precision outperforms double precision in memory bottleneck scenarios, but it is not the case here since the matrices fits in the CPU cache. Another interesting discovery is that using 4-by-4 matrices instead of 3-by-4 can be actually more performant even though it involves more computations. It is because in this case the compiler cannot make the best use of the AVX2 instructions.

As seen with the compiler outputs, the matrix-vector multiplication does not benefit from AVX-512. As a future work, it would be interesting to run the same benchmark with matrix-matrix multiplications and see whether AVX-512 provides a significant boost compared to AVX-2. It would also be interesting to investigate if the use of SIMD instructions can make the computation memory bound, and how much performance is lost because of this.

Code

Here is the cmake file that automatically downloads the dependencies, and compile them with the benchmark.

cmake_minimum_required(VERSION 3.16)

project(BenchmarkProject VERSION 1.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include(FetchContent)

# Activate OpenMP
find_package(OpenMP REQUIRED)

# Retrieve and activate the nanobench library
FetchContent_Declare(
    nanobench
    GIT_REPOSITORY https://github.com/martinus/nanobench.git
    GIT_TAG v4.3.6
    GIT_SHALLOW TRUE)
FetchContent_MakeAvailable(nanobench)

# Include OpenCV 4.5.5
option(WITH_OPENMP ON)
FetchContent_Declare(opencv
    GIT_REPOSITORY https://github.com/opencv/opencv.git
    GIT_TAG        4.5.5
    GIT_SHALLOW    TRUE
)
FetchContent_MakeAvailable(opencv)
# Manual include dirs for OpenCV (could not find better than this hack)
set(OpenCV_INCLUDE_DIRS "")
list(APPEND OpenCV_INCLUDE_DIRS ${OPENCV_CONFIG_FILE_INCLUDE_DIR})
list(APPEND OpenCV_INCLUDE_DIRS ${OPENCV_MODULE_opencv_core_LOCATION}/include)
set(OpenCV_LIBS "")
list(APPEND OpenCV_LIBS opencv_core)

# Retrieve and activate glm
option(BUILD_SHARED_LIBS "Build shared library" OFF)
option(BUILD_STATIC_LIBS "Build static library" OFF)
option(GLM_TEST_ENABLE "Build unit tests" OFF)
FetchContent_Declare(
    glm
    GIT_REPOSITORY https://github.com/g-truc/glm.git
    GIT_TAG 0.9.9.8
    GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(glm)

# Retrieve and activate Eigen3
FetchContent_Declare(
    eigen
    GIT_REPOSITORY https://gitlab.com/libeigen/eigen.git
    GIT_TAG 3.4.0
    GIT_SHALLOW TRUE
)
FetchContent_GetProperties(eigen)
if(NOT eigen_POPULATED)
    FetchContent_Populate(eigen)
endif()
set(EIGEN_INCLUDE_DIRS ${eigen_SOURCE_DIR})


add_executable(Benchmark main.cpp)

if (MSVC)
    # warning level 4 and all warnings as errors
    target_compile_options(Benchmark PRIVATE /W4)
    target_compile_options(Benchmark PRIVATE -openmp:experimental)
    target_compile_options(Benchmark PRIVATE /arch:AVX2)
else()
    # lots of warnings and all warnings as errors
    target_compile_options(Benchmark PRIVATE -Wall -Wextra -pedantic)
    target_compile_options(Benchmark PRIVATE -march=native)
endif()

target_include_directories(Benchmark
    PRIVATE
    ${OpenCV_INCLUDE_DIRS}
    $<BUILD_INTERFACE:${EIGEN_INCLUDE_DIRS}>
    $<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>
)

target_link_libraries(Benchmark
    PRIVATE
    nanobench
    ${OpenCV_LIBS}
    glm::glm
    OpenMP::OpenMP_CXX
)

The source code of the benchmark:

#include <array>
#include <atomic>
#include <iostream>

#include <omp.h>

#include <opencv2/core.hpp>

#include <glm/glm.hpp>

#include <Eigen/Dense>

#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>

constexpr int WARMUP_ITERATIONS = 1000;
constexpr int MIN_EPOCHS = 30000000;

template <typename T, uint64_t N>
T array_sum(const std::array<T, N>& arr)
{
    T sum = 0;

    for (unsigned int i = 0; i < arr.size(); i++)
    {
        sum += arr[i];
    }

    return sum;
}

template <typename T, uint64_t N>
T array_sum_simd(const std::array<T, N>& arr)
{
    T sum = 0;

    #pragma omp simd reduction(+:sum)
    for (unsigned int i = 0; i < arr.size(); i++)
    {
        sum += arr[i];
    }

    return sum;
}

void benchmark_array_sum()
{
    constexpr unsigned int N = 16;

    std::array<float, N> arr_float{};
    std::array<double, N> arr_double{};

    for (unsigned int i = 0; i < N; i++)
    {
        arr_float[i] = 1.f;
        arr_double[i] = 1.0;
    }

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Sum of 16 values (float)", [&] {
        auto sum = array_sum(arr_float);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("OpenMP: Sum of 16 values (float)", [&] {
        auto sum = array_sum_simd(arr_float);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Sum of 16 values (double)", [&] {
        auto sum = array_sum(arr_double);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("OpenMP: Sum of 16 values (double)", [&] {
        auto sum = array_sum_simd(arr_double);

        ankerl::nanobench::doNotOptimizeAway(sum);
    });
}

void benchmark_opencv_float()
{
    const cv::Mat1f mat4 = (cv::Mat1f(4, 4) <<
        1.f, 0.f, 0.f, 0.f,
        0.f, 1.f, 0.f, 0.f,
        0.f, 0.f, 1.f, 0.f,
        0.f, 0.f, 0.f, 1.f
        );
    const cv::Vec4f vec4(1.f, 1.f, 1.f, 1.f);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS / 100).minEpochIterations(MIN_EPOCHS / 100)
	.run("OpenCV: Mat1d(4x4) times Vector4 (float)", [&] {
        cv::Mat1f result = mat4 * vec4;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_opencv_double()
{
    const cv::Mat1d mat4 = (cv::Mat1d(4, 4) << 
        1.0, 0.0, 0.0, 0.0,
        0.0, 1.0, 0.0, 0.0,
        0.0, 0.0, 1.0, 0.0,
        0.0, 0.0, 0.0, 1.0
        );
    const cv::Vec4d vec4(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS / 100).minEpochIterations(MIN_EPOCHS / 100)
	.run("OpenCV: Mat1d(4x4) times Vector4 (double)", [&] {
        cv::Mat1d result = mat4 * vec4;

		ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_opencv_fixed_float()
{
    const cv::Matx44f mat4(
        1.0, 0.0, 0.0, 0.0,
        0.0, 1.0, 0.0, 0.0,
        0.0, 0.0, 1.0, 0.0,
        0.0, 0.0, 0.0, 1.0
    );
    const cv::Vec4f vec4(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("OpenCV: Matx44f times Vector4 (float)", [&] {
        cv::Vec4f result = mat4 * vec4;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_opencv_fixed_double()
{
    const cv::Matx44d mat4(
        1.0, 0.0, 0.0, 0.0,
        0.0, 1.0, 0.0, 0.0,
        0.0, 0.0, 1.0, 0.0,
        0.0, 0.0, 0.0, 1.0
        );
    const cv::Vec4d vec4(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("OpenCV: Matx44d times Vector4 (double)", [&] {
        cv::Vec4d result = mat4 * vec4;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_glm()
{
    const glm::mat4 mat4f(1.f);
    const glm::vec4 vec4f(1.f, 1.f, 1.f, 1.f);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("GLM: Matrix4x4 times Vector4 (float)", [&] {
        glm::vec4 result = mat4f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const glm::dmat4 mat4d(1.0);
    const glm::dvec4 vec4d(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("GLM: Matrix4x4 times Vector4 (double)", [&] {
        glm::dvec4 result = mat4d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const glm::mat4x3 mat34f(1.f);
    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("GLM: Matrix3x4 times Vector4 (float)", [&] {
        glm::vec3 result = mat34f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const glm::dmat4x3 mat34d(1.f);
    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("GLM: Matrix3x4 times Vector4 (double)", [&] {
        glm::dvec3 result = mat34d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

void benchmark_eigen()
{
    const Eigen::Matrix4f mat4f = Eigen::Matrix4f::Random(4, 4);
    const Eigen::Vector4f vec4f(1.f, 1.f, 1.f, 1.f);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Eigen: Matrix4x4 times Vector4 (float)", [&] {
        Eigen::Vector4f result = mat4f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const Eigen::Matrix4d mat4d = Eigen::Matrix4d::Random(4, 4);
    const Eigen::Vector4d vec4d(1.0, 1.0, 1.0, 1.0);

    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
        .run("Eigen: Matrix4x4 times Vector4 (double)", [&] {
        Eigen::Vector4d result = mat4d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });

    const Eigen::Matrix<float, 3, 4> mat34f = Eigen::Matrix<float, 3, 4>::Random(3, 4);
    ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Eigen: Matrix3x4 times Vector4 (float)", [&] {
        Eigen::Vector3f result = mat34f * vec4f;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
    
    const Eigen::Matrix<double, 3, 4> mat34d = Eigen::Matrix<double, 3, 4>::Random(3, 4);
	ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
	.run("Eigen: Matrix3x4 times Vector4 (double)", [&] {
        Eigen::Vector3d result = mat34d * vec4d;

        ankerl::nanobench::doNotOptimizeAway(result);
    });
}

int main()
{
    // benchmark_array_sum();

    benchmark_opencv_float();
    benchmark_opencv_double();

    benchmark_opencv_fixed_float();
    benchmark_opencv_fixed_double();

    benchmark_glm();

    benchmark_eigen();
    
    return 0;
}