5.1. Modelling Data
Modelling data should share the same pipelines across models and live environments.
This ensures consistency across projects, prevents small differences creeping in, and reduces the risk of a model working in development but breaking in production.
Data should ideally be productionised so that it updates automatically, making modelling-ready datasets always available to analysts without repeated manual effort.
Data Preparation Consistency
- Single Source of Truth – transformations (e.g. driver age, NCD years, claims history) should be defined once and reused everywhere.
- Versioning – datasets used for model training should be versioned to ensure reproducibility of results.
Handling Categorical Variables
- One Hot Encoding (OHE) – expands categorical fields into multiple binary columns. Works well for low-cardinality variables (e.g. fuel type).
- Index Encoding – assigns each level an integer. required for some model libraries, but less interpretable.
- High-Cardinality Fields – features like postcodes or makes/models of vehicles may require grouping, target encoding, or embeddings to avoid exploding dimensionality.
Incremental Loading
Rather than running scripts to create batches of data for modelling, consider creating a single table for modelling data that updates incrementally. For modelling, the data can just be filtered by date range.
- Use incremental loading, where only new or updated records (e.g. new quotes, policies, claims) are processed
- Store checkpoints to track the last successful load
- Useful for keeping modelling datasets aligned with production data without overloading systems
Data Quality and Validation
Before modelling, datasets should go through automated checks:
- Completeness – are all required fields populated?
- Consistency – do transformations produce the same outputs each time?
- Distributional Shifts – monitor whether new data is materially different from training data
- Leakage Checks – ensure no information from the future (e.g. claim outcomes) leaks into training features
Train/Validation/Test Splits
- Determine a way to assign Train, Test, and Holdout groups to individual data points, in a way that can be reproduced
- Temporal Splits – splitting by time (e.g. training on Jan–Jun, testing on Jul–Aug) often makes more sense than random splits, since distributions drift over time
Governance and Traceability
- Auditability – modelling datasets should be logged with metadata (date built, source tables, transformation logic).
- Reproducibility – anyone should be able to rebuild the exact dataset used for a model
- Data Lineage – clear trace from raw data through transformations to final modelling dataset