5.2. Training & Validation

The goal of modelling pipelines is to make training repeatable, configurable, and transparent, so that multiple analysts can run models consistently without rewriting code each time.

Abstract Training Pipelines & Config Files

Most training and validation logic can be written once and reused.
Instead of hardcoding settings into scripts, store them in configuration files.

Configuration might include:
- Target (response variable)
- Exposure/offset - Weighting - Feature list
- Loss function
- Hyperparameters

This approach ensures:
- Consistency across models
- Flexibility to test variations quickly
- Clear documentation of what was run

{
    "target": "ClaimCount",
    "exposure": "Exposure",
    "features": [
        "VehPower",
        "VehAge",
        "DrivAge",
        "BonusMalus",
        "VehBrand",
        "VehGas",
        "Area",
        "Density",
        "Region"
    ],
    "split": {
        "field": "Group",
        "assignment": {
            "1": "Train",
            "2": "Train",
            "3": "Train",
            "4": "Test",
            "5": "Holdout"
        }
    },
    "gbm_params": {
        "learning_params": {
            "loss_function": "Poisson",
            "learning_rate": 0.1,
            "depth": 3,
            "l2_leaf_reg": 2,
            "random_strength": 2,
            "bagging_temperature": 1,
            "verbose": 0
        },
        "num_rounds": 10000,
        "early_stopping_rounds": 10
    }
}

Which can be read in as so:

with open('./config/frequency_config.json', 'r') as f:
    config = json.load(f)

Dynamic Feature Lists

Avoid long lists of feature names other than in a config file.

Lists can be worked with dynamically to add and remove features, and identify categorical and continous features.

def define_categorical_columns(data: pl.DataFrame, features: List[str]) -> List[str]:

    string_columns = [name for name, dtype in zip(data.columns, data.dtypes) if dtype == pl.Utf8]
    categorical_features = [field for field in string_columns if field in features]

    return categorical_features

def define_continuous_columns(features: List[str], categorical_features: List[str]) -> List[str]:

    continuous_features = [feature for feature in features if feature not in categorical_features]

    return continuous_features

features = config.get('features')
categorical_features = define_categorical_columns(frequency, features)
continuous_features = define_continuous_columns(features, categorical_features)

Training Splits

Applying the Train, Test and Holdout groups as defined in the config file.

def assign_split(data, split):
    """Assigns a split to the data based on the provided split dictionary."""
    field = split.get('field')
    assignment = split.get('assignment')

    data = (
        data
        .with_columns(pl.col(field).replace_strict(assignment, default=None).alias(field))
    )

    return data

split = config.get('split')
data = assign_split(data, split)

Creating the Train, Test, Holdout datasets.

In this example the function Pool is used which is a data format specific to CatBoost. Rather than including the step to create Pools in the create_modelling_data() function, this is kept seperate to make it easier to use with other modelling techniques.

from catboost import Pool

def create_modelling_data(data, features, group_field, group, target, exposure = None):

    filtered_data = (
        data
        .filter(pl.col(group_field) == group)
    )

    train_X = (
        filtered_data
        .select(features)
        .to_pandas()
    )

    train_y = (
        filtered_data
        .select(target)
        .to_numpy()
        .ravel()
    )

    if exposure is not None:
        train_exposure = (
            filtered_data
            .with_columns(log_exposure=pl.col(exposure).log())
            .select('log_exposure')
            .to_numpy()
            .ravel()
        )

        return filtered_data, train_X, train_y, train_exposure

    return filtered_data, train_X, train_y

target = config.get('target')
exposure = config.get('exposure')

train, X_train, y_train, log_exposure_train = create_modelling_data(data, features, 'Group', 'Train', target, exposure)
test, X_test, y_test, log_exposure_test = create_modelling_data(data, features, 'Group', 'Test', target, exposure)
holdout, X_holdout, y_holdout, log_exposure_holdout = create_modelling_data(data, features, 'Group', 'Holdout', target, exposure)

train_pool = Pool(X_train, label=y_train, cat_features=categorical_features, baseline = log_exposure_train)
test_pool = Pool(X_test, label=y_test, cat_features=categorical_features, baseline = log_exposure_test)
holdout_pool = Pool(X_holdout, label=y_holdout, cat_features=categorical_features, baseline = log_exposure_holdout)

Gradient Boosting Machines (GBMs)

GBMs are generally the best performing model for tabular data, are very easy to setup, and plenty of other tools interact well with the most popular GBM libraries. These include:

XGBoost – Originally very popular, requires numeric input, and categorical features to be one-hot-encoded.
LightGBM – Requires numeric input, categorical features can be string-indexed.
Catboost – Strong performance without tuning, handles string categories natively.

Given Catboost doesn't require preprocessing of categorical features, and has strong out-of-box performance without hyperparameter tuning, the modelling pipelines are simpler, which often means it's the preferred option.

The below gets the configuration from the config dictionary and trains a Catboost model.

params = config.get('gbm_params').get('learning_params')
num_round = config.get('gbm_params').get('num_rounds')
early_stopping_rounds = config.get('gbm_params').get('early_stopping_rounds')

FrequencyModel = CatBoostRegressor(**params)
FrequencyModel.fit(train_pool, eval_set=[test_pool], early_stopping_rounds=early_stopping_rounds)

Plots & Validation

Validation is not just about metrics - it’s about understanding the model's behaviour.

Unlike GLM's where each feature is inspected and fitted manually, GBMs fit each feature automatically, and so reviewing the fit of each feature is important.

Modules should be created in your repository for creating these plots automatically for every model run.

Common validation plots:
- Feature Importance / SHAP – explainability, showing which factors matter most.
- Partial Dependence / SHAP dependence plots – how factors influence predictions.
- Calibration plots – compare predicted vs. actual outcomes (critical for pricing).
- Residual plots – highlight systematic biases.
- Lift/Gain charts – for classification tasks like conversion.

These can be generated as a PDF report that can be reviewed before pushing a model to production.