Multithreading in C++ is one of those topics that every developer eventually runs into, whether they’re working in finance, gaming, or scientific computing. The language gives you raw primitives, but it also integrates with a whole ecosystem of libraries that scale from a few threads on your laptop to thousands of cores in a data center.
Choosing the right tool matters: what are the right libraries for your quantitative finance use case?
1. Standard C++ Threads (Low-Level Control)
Since C++11, <thread>
, <mutex>
, and <future>
are part of the standard. You manage threads directly, making it portable and dependency-free.
Example: Parallel computation of moving averages in a trading engine
#include <iostream>
#include <thread>
#include <vector>
void moving_average(const std::vector<double>& data, int start, int end) {
for (int i = start; i < end; i++) {
if (i >= 2) {
double avg = (data[i] + data[i-1] + data[i-2]) / 3.0;
std::cout << "Index " << i << " avg = " << avg << "\n";
}
}
}
int main() {
std::vector<double> prices = {100,101,102,103,104,105,106,107};
std::thread t1(moving_average, std::ref(prices), 0, 4);
std::thread t2(moving_average, std::ref(prices), 4, prices.size());
t1.join();
t2.join();
}
2. Intel oneTBB (Task-Based Parallelism)
oneTBB (Threading Building Blocks) provides parallel loops, pipelines, and task graphs. Perfect for HPC or financial risk simulations.
Example: Monte Carlo option pricing
#include <tbb/parallel_for.h>
#include <vector>
#include <random>
int main() {
const int N = 1'000'000;
std::vector<double> results(N);
std::mt19937 gen(42);
std::normal_distribution<> dist(0, 1);
tbb::parallel_for(0, N, [&](int i) {
double z = dist(gen);
results[i] = std::exp(-0.5 * z * z); // toy payoff
});
}
3. OpenMP (Loop Parallelism for HPC)
OpenMP is widely used in scientific computing. You add pragmas, and the compiler generates parallel code.
#include <vector>
#include <omp.h>
int main() {
const int N = 500;
std::vector<std::vector<double>> A(N, std::vector<double>(N, 1));
std::vector<std::vector<double>> B(N, std::vector<double>(N, 2));
std::vector<std::vector<double>> C(N, std::vector<double>(N, 0));
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
}
4. Boost.Asio (Async Networking and Thread Pools)
Boost.Asio is ideal for low-latency servers, networking, and I/O-heavy workloads (e.g. trading gateways).
#include <boost/asio.hpp>
using boost::asio::ip::tcp;
int main() {
boost::asio::io_context io;
tcp::acceptor acceptor(io, tcp::endpoint(tcp::v4(), 12345));
std::function<void()> do_accept = [&]() {
auto socket = std::make_shared<tcp::socket>(io);
acceptor.async_accept(*socket, [&, socket](boost::system::error_code ec) {
if (!ec) {
boost::asio::async_read_until(*socket, boost::asio::dynamic_buffer(std::string()), '\n',
[socket](auto, auto) {
boost::asio::write(*socket, boost::asio::buffer("pong\n"));
});
}
do_accept();
});
};
do_accept();
io.run();
}
5. Parallel STL (<execution>
)
C++17 added execution policies for standard algorithms. This makes parallelism easy.
#include <algorithm>
#include <execution>
#include <vector>
int main() {
std::vector<int> trades = {5,1,9,3,2,8};
std::sort(std::execution::par, trades.begin(), trades.end());
}
6. Conclusion
Multithreading in C++ offers many models, each fit for different workloads. Use std::thread
for low-level control of system tasks. Adopt oneTBB or OpenMP for data-parallel HPC simulations. Leverage Boost.Asio for async networking and trading engines. Rely on CUDA/SYCL for GPU acceleration in Monte Carlo or ML. Enable Parallel STL (<execution>
) for easy speed-ups in modern code. Try actor frameworks (CAF/HPX) for distributed, message-driven systems.
Compiler flags also make a big difference in multithreaded performance. Always build with -O3 -march=native
(or /O2
in MSVC). Use -fopenmp
or link to TBB scalable allocators when relevant. Prevent false sharing with alignas(64)
and prefer thread_local
scratchpads. Mark non-aliasing pointers with __restrict__
to help vectorization. Consider specialized allocators (jemalloc, TBB) in multi-threaded apps. Profile with -fsanitize=thread
to catch race conditions early.
The key: match the concurrency model + compiler setup to your workload for maximum speed.