5.1. Modelling Data

Modelling data should share the same pipelines across models and live environments.

This ensures consistency across projects, prevents small differences creeping in, and reduces the risk of a model working in development but breaking in production.

Data should ideally be productionised so that it updates automatically, making modelling-ready datasets always available to analysts without repeated manual effort.

Data Preparation Consistency

Single Source of Truth – transformations (e.g. driver age, NCD years, claims history) should be defined once and reused everywhere.
Versioning – datasets used for model training should be versioned to ensure reproducibility of results.

Handling Categorical Variables

One Hot Encoding (OHE) – expands categorical fields into multiple binary columns. Works well for low-cardinality variables (e.g. fuel type).
Index Encoding – assigns each level an integer. required for some model libraries, but less interpretable.
High-Cardinality Fields – features like postcodes or makes/models of vehicles may require grouping, target encoding, or embeddings to avoid exploding dimensionality.

Incremental Loading

Rather than running scripts to create batches of data for modelling, consider creating a single table for modelling data that updates incrementally. For modelling, the data can just be filtered by date range.

Use incremental loading, where only new or updated records (e.g. new quotes, policies, claims) are processed
Store checkpoints to track the last successful load
Useful for keeping modelling datasets aligned with production data without overloading systems

Data Quality and Validation

Before modelling, datasets should go through automated checks:

Completeness – are all required fields populated?
Consistency – do transformations produce the same outputs each time?
Distributional Shifts – monitor whether new data is materially different from training data
Leakage Checks – ensure no information from the future (e.g. claim outcomes) leaks into training features

Train/Validation/Test Splits

Determine a way to assign Train, Test, and Holdout groups to individual data points, in a way that can be reproduced
Temporal Splits – splitting by time (e.g. training on Jan–Jun, testing on Jul–Aug) often makes more sense than random splits, since distributions drift over time

Governance and Traceability

Auditability – modelling datasets should be logged with metadata (date built, source tables, transformation logic).
Reproducibility – anyone should be able to rebuild the exact dataset used for a model
Data Lineage – clear trace from raw data through transformations to final modelling dataset