5.0. Modelling Data
Modelling data should share the same pipelines across models and live environments.
This ensures consistency across projects, prevents small differences creeping in, and reduces the risk of a model working in development but breaking in production.
Data should ideally be productionised so that it updates automatically, making modelling-ready datasets always available to analysts without repeated manual effort.
Data Preparation Consistency
- Single Source of Truth – transformations (e.g. driver age, NCD years, claims history) should be defined once and reused everywhere.
- Reusable Feature Engineering – functions for commonly used variables (e.g. youngest driver age, prior claims frequency, vehicle groupings) can be packaged into libraries.
- Versioning – datasets used for model training should be versioned to ensure reproducibility of results.
Handling Categorical Variables
- One Hot Encoding (OHE) – expands categorical fields into multiple binary columns. Works well for low-cardinality variables (e.g. fuel type).
- Index Encoding – assigns each level an integer. Useful for tree-based models, but less interpretable.
- Libraries with Native Support – e.g. CatBoost, LightGBM, and XGBoost can handle categorical variables directly if given the column list.
- High-Cardinality Fields – features like postcodes or makes/models of vehicles may require grouping, target encoding, or embeddings to avoid exploding dimensionality.
Incremental Loading
Refreshing datasets end-to-end can be slow and inefficient. Instead:
- Use incremental loading, where only new or updated records (e.g. new quotes, policies, claims) are processed.
- Store checkpoints to track the last successful load.
- Useful for keeping modelling datasets aligned with production data without overloading systems.
Data Quality and Validation
Before modelling, datasets should go through automated checks:
- Completeness – are all required fields populated?
- Consistency – do transformations produce the same outputs each time?
- Distributional Shifts – monitor whether new data is materially different from training data.
- Leakage Checks – ensure no information from the future (e.g. claim outcomes) leaks into training features.
Train/Validation/Test Splits
- Temporal Splits – in insurance, splitting by time (e.g. training on Jan–Jun, testing on Jul–Aug) often makes more sense than random splits, since distributions drift over time.
- Cross-Validation – helps test model robustness, especially when datasets are not huge.
- Out-of-Time Samples – ensure models generalise to genuinely unseen data.
Governance and Traceability
- Auditability – modelling datasets should be logged with metadata (date built, source tables, transformation logic).
- Reproducibility – anyone should be able to rebuild the exact dataset used for a model.
- Data Lineage – clear trace from raw data through transformations to final modelling dataset.
Extensions for Pricing Teams
- Scenario/Impact Datasets – in addition to training sets, create datasets that allow impact testing of rating changes.
- Feature Stores – centralise common derived features (e.g. driver count, youngest driver age) for consistency across models.
- Monitoring Feeds – production data should be structured so it can feed monitoring dashboards (e.g. drift detection, stability metrics).