Attribution Modeling for Credit PnL Using TreeSHAP

In modern credit trading, understanding why PnL moves is just as important as tracking that it moved.
When billions of data points flow through trading systems daily, intuition and heuristics fall short.
This article presents a scalable method for attributing PnL changes to specific credit risk drivers.
We model feature deltas — changes in exposure, spreads, and counterparty metrics — across time.

We train a LightGBM model to learn nonlinear relationships between deltas and ΔPnL.
Then, we apply TreeSHAP to extract the top features responsible for each credit-related PnL movement.
The result is a transparent, auditable, per-entity view of what caused the last PnL move.
This is not prediction, it’s attribution — a diagnostic engine for credit risk.

1. Introduction: The Need for Attribution, Not Prediction

Credit-related PnL refers to profit and loss driven by credit risk exposure — such as changes in counterparty credit spreads, rating downgrades, or shifts in default probabilities.
It captures the financial impact of events like widening CDS, growing exposures, or collateral changes, rather than market or operational effects.

For traders, a sudden loss tied to a counterparty matters less than knowing which risk factor caused it.
For quants, understanding nonlinear relationships between inputs and PnL is key to model trust and validation.
And for risk teams, attribution supports regulatory reporting, stress testing, and early warnings.
But with data at the scale of billions of deltas, traditional methods fail to offer clear answers.
We propose a hybrid pipeline: use Python to train a LightGBM model on feature deltas and PnL deltas.
Then apply TreeSHAP to explain which deltas caused most of the PnL movement.
For real-time inference, we port the model and SHAP logic to a high-performance C++ backend.
The result: an explainable, scalable, and production-ready system for credit PnL attribution at scale.

2. Data Design: Tracking the Right Deltas

Every 15 minutes, a new snapshot arrives — a high-dimensional vector (10,000+ features) representing the full state of the trading system: exposure, spreads, counterparty credit ratings, margin calls, instrument properties, and more.

This raw data isn’t yet meaningful. To extract signal, we compute deltas: changes in features between consecutive time windows. These represent what moved.

Alongside the feature deltas, we compute the PnL delta, giving us a supervised signal to explain: what caused the profit or loss change since the last window?

We filter and engineer meaningful features — credit spreads, counterparty ratings, net exposure, sector-level aggregates — to reduce noise and emphasize credit drivers.

Using deltas instead of raw levels is key. Most credit features are auto-correlated or slow-moving — a high exposure value tells us less than a sudden increase in exposure.

The delta of a credit spread or notional tells us more about risk events, decisions, or systemic shifts than the absolute value alone.

By modeling dPnL ~ Δ_features, we move from static snapshots to causal attribution.

📄 Mocked Extract of Delta Dataset

timestamp	entity	Δ_exposure	Δ_cds_bnp	Δ_rating_score	Δ_sector_risk	ΔPnL
2025-06-23 10:00:00	trader_01	+2.3M	+0.018	-1	+0.07	-4.21M
2025-06-23 10:15:00	trader_01	-1.1M	-0.005	0	-0.03	+2.73M
2025-06-23 10:30:00	trader_01	+0.4M	+0.002	0	0	-1.00M

Δ_exposure: Notional delta to counterparties

Δ_cds_bnp: Change in BNP CDS (in decimal)

Δ_rating_score: Discrete rating movement (e.g., 1 = downgrade)

Δ_sector_risk: Change in weighted average sector risk score

dPnL: Total change in credit-related PnL

We will drastically increase the dimensionality of those with 10k columns and 1 million rows.

We generate millions of entries in “delta_features.csv” and millions of delta PnLs in “delta_pnl.csv”

3. Modeling: Training a Gradient Boosted Tree on Deltas

To prepare for TreeSHAP attribution in this credit delta → PnL delta setup, you’ll want to use a LightGBM regression model that emphasizes explainability, stability, and performance.

✅ Goal:

Train a LightGBM model (Δfeatures → ΔPnL)
Export the model (model.txt)
Optionally export samples for testing C++ inference

🧠 File: `train_model.py`

import lightgbm as lgb
import pandas as pd

# Load delta feature and delta PnL data
X = pd.read_csv("delta_features.csv")
y = pd.read_csv("delta_pnl.csv")

# Train model
params = {
    "objective": "regression",
    "metric": "rmse",
    "learning_rate": 0.03,
    "num_leaves": 64,
    "feature_fraction": 0.6,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "lambda_l1": 0.1,
    "lambda_l2": 1.0,
    "min_data_in_leaf": 50,
    "verbose": -1
}
model = lgb.train(lgb.Dataset(X, label=y), params, num_boost_round=200)

# Save model
model.save_model("model.txt")

# Export a row (e.g., for test inference)
row_index = 12345
X.iloc[[row_index]].to_csv("sample_row.csv", index=False)
y.iloc[[row_index]].to_csv("sample_label.csv", index=False)

Then we can run the inferrence:

import pandas as pd
import shap
import lightgbm as lgb

# Load model and data
model = lgb.Booster(model_file="model.txt")          # Trained LGBM model
X = pd.read_csv("sample_row.csv")         # Latest 15-min features (deltas)
y_pnl = pd.read_csv("sample_label.csv")          # Actual PnL deltas (optional)

# Run SHAP
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# For a specific row (e.g. last 15-min observation)
row = 0
shap_row = shap_values[row]

# Create attribution table
explained = pd.DataFrame({
    "feature": X.columns,
    "shap_value": shap_row,
    "feature_value": X.iloc[row]
}).sort_values(by="shap_value", key=abs, ascending=False)

# Add cumulative contribution
explained["cum_impact"] = explained["shap_value"].cumsum()
total_impact = abs(explained["shap_value"]).sum()
explained["cum_percent"] = abs(explained["shap_value"]).cumsum() / total_impact

# Output: top features responsible for 90% of the PnL move
top_contributors = explained[explained["cum_percent"] <= 0.9]
print(top_contributors)

4. Results

📦 Output Example

      feature      shap_value  feature_value  cum_impact  cum_percent
23  spread_delta          0.68          12.0         0.68     0.42
14     curve_1y          0.51           3.2         1.19     0.72
9   exposure_bond         0.33          54.1         1.52     0.92

This shows:

The features most responsible for the ΔPnL for that 15-min slice
Their raw SHAP impact
How much of the total impact they cumulatively explain

Attribution Modeling for Credit PnL Using TreeSHAP

1. Introduction: The Need for Attribution, Not Prediction

2. Data Design: Tracking the Right Deltas

📄 Mocked Extract of Delta Dataset

3. Modeling: Training a Gradient Boosted Tree on Deltas

✅ Goal:

🧠 File: train_model.py

4. Results

📦 Output Example

Clement Daubrenet

Compute the Implied Volatility for a Call Option in C++

C++ Quantitative Developers: A Skyrocketing Job Market

🧠 File: `train_model.py`