C++ for Performance: 5 Ideas to Speed Up Your Quantitative Code

by Clement Daubrenet
c++ for performance

In quantitative finance, milliseconds can mean millions. Whether you’re pricing exotic derivatives, processing high-frequency trades, or running Monte Carlo simulations, performance is non-negotiable. C++ remains the go-to language for building ultra-fast systems thanks to its low-level control and fine-tuned memory management. What are 5 tricks to optimize C++ for performance?

1. Prefer Stack Allocation Over Heap

Heap allocations (new/delete) are costly due to the overhead of dynamic memory management and potential fragmentation. Stack allocation, on the other hand, is faster, safer, and automatically cleaned up when the scope ends:

Stack vs. Heap Memory Table

ParameterStackHeap
Data type structureLinear (LIFO: Last In, First Out)Hierarchical access possible; no fixed structure
Basic allocationMemory is allocated contiguously and sequentiallyMemory can be contiguous, but is not guaranteed (depends on allocator and fragmentation)
Allocation & DeallocationAutomatic by compiler (on function entry/exit)Manual (new/delete, malloc/free) or via smart pointers
CostVery low overheadHigher overhead due to allocator logic and possible fragmentation
Limit of space sizeFixed limit per thread (set by OS, typically MBs)Limited by total available system memory
Access timeVery fast (predictable layout, cache-friendly)Slower (more indirection, potential page faults)
FlexibilityFixed-size, defined at compile timeDynamically resizable
SizeTypically smallCan grow large (useful for big data structures)
ResizeNot resizable after allocationResizable (e.g., with realloc, std::vector::resize())

So if you want speed, choose the stack allocation:

// Slower: heap allocation
MyMatrix* mat = new MyMatrix(1000, 1000); 
process(*mat);
delete mat;

// Faster: stack allocation
MyMatrix mat(1000, 1000); 
process(mat);

Use smart pointers or containers only when dynamic allocation is necessary, and favor std::array or std::vector with reserved capacity for fixed-size needs.

2. Avoid Virtual Functions in Hot Paths

A virtual function lets C++ decide at runtime which version of a function to call, based on the actual object type, not the pointer or reference type.

Virtual functions use vtables for dynamic dispatch, introducing a level of indirection that prevents inlining and hurts CPU branch prediction.

A vtable (short for virtual table) is a mechanism used by C++ to implement runtime polymorphism — specifically, virtual function calls.

When a class has at least one virtual function, the compiler generates:

  • A vtable: a table of function pointers for that class.
  • A vptr (virtual table pointer): a hidden pointer added to each object instance, pointing to the appropriate vtable.

In tight loops or latency-critical sections, replacing virtual calls with alternatives like templates or function pointers can significantly improve performance.

// Slower: virtual dispatch
struct Instrument {
    virtual double price() const = 0;
};

double sumPrices(const std::vector<Instrument*>& instruments) {
    double total = 0;
    for (const auto* instr : instruments) {
        total += instr->price(); // Virtual call
    }
    return total;
}

Using templates is way more performant:

#include <iostream>
#include <vector>

struct Bond {
    double price() const { return 100.0; }
};

template<typename T>
double sumPrices(const std::vector<T>& instruments) {
    double total = 0;
    for (const auto& instr : instruments) {
        total += instr.price();  // Resolved at compile time, can be inlined
    }
    return total;
}

3. Use reserve() for Vectors

When using std::vector, every time you push back an element beyond its current capacity, it must allocate a new memory block, copy existing elements, and deallocate the old one — which is expensive. In performance-critical paths like simulations or data loading, this overhead adds up quickly.

If you know (or can estimate) the number of elements in advance, call vector.reserve(n) to allocate memory once upfront. This avoids repeated reallocations and boosts speed significantly.

std::vector<double> prices;

// Inefficient: multiple reallocations as vector grows
for (int i = 0; i < 1'000'000; ++i) {
    prices.push_back(i * 0.01);
}

// Better: allocate memory once
std::vector<double> fast_prices;
fast_prices.reserve(1'000'000);  // Preallocate
for (int i = 0; i < 1'000'000; ++i) {
    fast_prices.push_back(i * 0.01);
}

Why Not Always Use the Stack?
The stack is fast because:

  • Allocation/deallocation is automatic.
  • It’s contiguous and cache-friendly.
  • No fragmentation or dynamic bookkeeping.

But it comes with strict limitations: the stack size is limited and it’s not resizable (fixed at compile time or needs C99-style VLAs with compiler extension).

4. Leverage Compiler Optimizations

Modern C++ compilers are incredibly powerful but you have to ask for the performance. By default, they prioritize portability and safety over speed. Turning on aggressive optimization flags like -O2, -O3, -march=native, and -flto enables advanced techniques like loop unrolling, inlining, vectorization, and dead code elimination.

These flags can deliver huge speedups especially for compute-heavy quant workloads like Monte Carlo simulations, matrix operations, or pricing curves.

# Basic optimization
g++ -O2 mycode.cpp -o myapp

# Aggressive + hardware-specific + link-time optimization
g++ -O3 -march=native -flto mycode.cpp -o myapp

🧠 Key Flags:

  • -O2: General optimizations (safe default).
  • -O3: Adds aggressive loop optimizations and inlining.
  • -march=native: Tailors code to your CPU (uses AVX, SSE, etc.).
  • -flto: Link-time optimization — lets compiler optimize across translation units.

⚠️ Use profiling tools like perf, gprof, or valgrind to validate the gains and sometimes -O3 can make things faster, but also larger or harder to debug.

5. Minimize Lock Contention

In multi-threaded quant systems, excessive use of std::mutex can serialize threads, causing performance bottlenecks. Lock contention happens when multiple threads fight to acquire the same lock, leading to context switches and degraded latency.

To reduce contention:

  • Keep critical sections short.
  • Use std::atomic for simple shared data.
  • Prefer lock-free structures or per-thread buffers where possible.

Example: Avoiding mutex with std::atomic

std::atomic<int> counter = 0;

// Thread-safe increment without a lock
void safeIncrement() {
    counter.fetch_add(1, std::memory_order_relaxed);
}