5.5. Model Monitoring

Understanding how your models perform in a live environment is the easiest way to understand how to improve them.

Even models that have been through a rigorous development process and peer review can perform worse than expected, or deteriorate faster than expected.

A key reason in insurance pricing is the feedback loop: changing a model changes the price, which immediately changes the mix of business being written. If you don’t monitor this carefully, you can end up with unexpected shifts in portfolio risk, profitability, or competitiveness.

Learning from where previous models underperformed means you can set better expectations for future models, or proactively mitigate against the underlying drivers of deterioration.

Collect inference data

Even though modelling pipelines should be fully reproducible, simulating past predictions on historical data is cumbersome. It requires rebuilding the exact training environment and maintaining additional processes.

Instead, capture model predictions directly at the point of inference. This can be done either:
- From the model endpoint (if you deploy via API) - From the rating algorithm (if embedded in a pricing engine)

Appending predictions to the rating response is usually easier - each prediction is tied to a specific quote, and multiple models (e.g. frequency, severity, conversion) can be logged in a single step.

Captured inference data should include at minimum:
- Model version / alias used
- Input features
- Predicted values
- Quote identifier (quote ID, timestamp, etc.)
- Response (did the customer buy, and at what premium?)

This allows you to reconstruct exactly what the model “saw” at the time of pricing.

Monitor quoted vs. written business

It’s not enough to monitor just model predictions. In pricing, you care about both the quotes offered and the business actually written:

Quoted data shows the distribution of risks the model was applied to (competitiveness, demand curve, exposure mix).
Written data shows the subset of risks accepted by the market (reveals profitability, adverse selection, and risk of anti-selection).

Monitoring both lets you detect issues such as:
- The model working well technically but being uncompetitive in key segments - Shifts in the mix of written risks that make the portfolio riskier than expected

Data Drift

Data drift occurs when the distribution of input features changes over time compared to training.

Examples in pricing:
- A new competitor entering the market with very different prices - A shift in customer demographics - Economic changes

How to detect:
- Monitor statistical measures like population stability index (PSI) or KL-divergence between training and live data.
- Track feature summary stats (mean, std, quantiles) over time.

Prediction Drift

Even if inputs haven’t drifted, the distribution of predictions might.

Examples:
- Frequency model now outputs higher claim frequency across the portfolio - Conversion model predicts much lower take-up rates after a pricing change

How to detect:
- Compare live prediction distributions vs. training/validation - Use calibration plots to check predicted vs. observed outcomes (when outcomes are available)

Feature Attribution Drift

Modern models (e.g. CatBoost, LightGBM) allow SHAP values or similar methods to show which features drive predictions.

Drift in feature attributions can highlight why predictions are changing:
- if "driver age" was the main driver historically, but suddenly "vehicle type" dominates, it may signal market shifts

Tracking attribution drift is a way to catch silent failures where predictions look stable but the model is "using" data differently.

Model Performance

Ultimately, monitoring should measure whether the model still achieves its business objective:

Underwriting models (frequency/severity): Compare predicted vs. observed loss ratios, deviance, or Gini across time.
Conversion models: Compare predicted vs. actual quote-to-sale rates.
Combined pricing models: Track technical price adequacy, expected vs. actual profitability, and competitive position.

There’s a lag between quote and claim development. Leading indicators like mix of business written or early claim frequencies can provide faster feedback before full performance emerges.

Practical Implementation Tips

Dashboards: Build automated dashboards to track drift and performance KPIs
Alerting: Set thresholds for drift/performance and trigger alerts if exceeded
Champion/Challenger: Monitor both live "Champion" and experimental "Challenger" models in parallel
Governance: Store monitoring outputs in the same registry/logging system as models for full auditability