5.0. Modelling Data

Modelling data should share the same pipelines across models and live environments.

This ensures consistency across projects, prevents small differences creeping in, and reduces the risk of a model working in development but breaking in production.

Data should ideally be productionised so that it updates automatically, making modelling-ready datasets always available to analysts without repeated manual effort.

Data Preparation Consistency

Single Source of Truth – transformations (e.g. driver age, NCD years, claims history) should be defined once and reused everywhere.
Reusable Feature Engineering – functions for commonly used variables (e.g. youngest driver age, prior claims frequency, vehicle groupings) can be packaged into libraries.
Versioning – datasets used for model training should be versioned to ensure reproducibility of results.

Handling Categorical Variables

One Hot Encoding (OHE) – expands categorical fields into multiple binary columns. Works well for low-cardinality variables (e.g. fuel type).
Index Encoding – assigns each level an integer. Useful for tree-based models, but less interpretable.
Libraries with Native Support – e.g. CatBoost, LightGBM, and XGBoost can handle categorical variables directly if given the column list.
High-Cardinality Fields – features like postcodes or makes/models of vehicles may require grouping, target encoding, or embeddings to avoid exploding dimensionality.

Incremental Loading

Refreshing datasets end-to-end can be slow and inefficient. Instead:

Use incremental loading, where only new or updated records (e.g. new quotes, policies, claims) are processed.
Store checkpoints to track the last successful load.
Useful for keeping modelling datasets aligned with production data without overloading systems.

Data Quality and Validation

Before modelling, datasets should go through automated checks:

Completeness – are all required fields populated?
Consistency – do transformations produce the same outputs each time?
Distributional Shifts – monitor whether new data is materially different from training data.
Leakage Checks – ensure no information from the future (e.g. claim outcomes) leaks into training features.

Train/Validation/Test Splits

Temporal Splits – in insurance, splitting by time (e.g. training on Jan–Jun, testing on Jul–Aug) often makes more sense than random splits, since distributions drift over time.
Cross-Validation – helps test model robustness, especially when datasets are not huge.
Out-of-Time Samples – ensure models generalise to genuinely unseen data.

Governance and Traceability

Auditability – modelling datasets should be logged with metadata (date built, source tables, transformation logic).
Reproducibility – anyone should be able to rebuild the exact dataset used for a model.
Data Lineage – clear trace from raw data through transformations to final modelling dataset.

Extensions for Pricing Teams

Scenario/Impact Datasets – in addition to training sets, create datasets that allow impact testing of rating changes.
Feature Stores – centralise common derived features (e.g. driver count, youngest driver age) for consistency across models.
Monitoring Feeds – production data should be structured so it can feed monitoring dashboards (e.g. drift detection, stability metrics).