Skip to content

3.3. Dataframes

Dataframes are tabular data of rows and columns, which makes up the majority of data types used within pricing, and so the majority of your code will be using dataframe syntax.


Polars

Polars is a relatively new dataFrame library that is quickly becoming popular, designed for working with large datasets efficiently.

  • Performance - Built in Rust, Polars can handle millions of rows quickly.
  • Memory Efficiency - Uses less RAM than Pandas for the same data, and can use datasets larger than memory.
  • Lazy Evaluation - Can defer calculations until needed, optimising execution.
  • Clear Syntax - Easy to read, even for beginners.

For insurance pricing teams, this means you can analyse and process millions of quotes on a local machine.


Importing to a DataFrame

Querying database Note: credentials should not be stored in code, instead these can be set as environment variables.

import pyodbc

def SQLConnection():

    SERVER = SERVER
    DATABASE = DATABASE

    ConnectionString= """
        Driver={{ODBC Driver 18 for SQL Server}};
        Server=tcp:{SERVER}.database.windows.net,1433;
        Database={DATABASE};
        Encrypt=yes;
        TrustServerCertificate=no;
        Connection Timeout=30"""

    conn = pyodbc.connect(ConnectionString) 

    return conn

query = 'SELECT * FROM [{table}]'
quote_data = pl.read_database(query, conn)

Reading in CSV

policies = pl.read_csv('./raw-data/policies.csv')

Reading in Parquet

frequency = pl.read_parquet("data/frequency_set.parquet")

Manipulation

Creating Columns

df = df.with_columns(
    DifferenceToMarket = pl.col('CustomerPremium') - pl.col('MarketPricePrediction'),
    RatioToMarket = pl.col('CustomerPremium') / pl.col('MarketPricePrediction')
)
If else

df = df.with_columns(
    Premium = pl.when(pl.col('StrategyGroup') == 'Optimisation')
                .then(pl.col('OptimisedPremium'))
                .otherwise(pl.col('Premium'))
)

Clips

minimum_premium = configuration.get('minimum_premium')

df = df.with_columns(
    Premium = pl.col('Premium').clip(lower_bound = minimum_premium)
)

Aggregations

claim_counts = (
    claims_amounts
    .with_columns(ClaimCount = pl.lit(1))
    .group_by('IDpol')
    .agg(pl.sum('ClaimCount').alias('ClaimCount'))
)

Joining

frequency_set = (
    policies
    .join(exposure, on = 'IDpol', how='inner')
    .join(claim_counts, on = 'IDpol', how='left')
    .with_columns(ClaimCount = pl.col('ClaimCount').fill_null(0))
)

Lazy Evaluation

There are two modes to run polars, eager and lazy. Eager is when the code is executed as soon as it is run, which is useful if working iteratively, and assessing the output of each stage which is generally the case when developing code.

Lazy execution means the code isn't executed until all the code is run end-to-end. This has two main benefits, that polars can apply optimisation to the query plan, and the code can be run in batches allowing data that is larger than memory to be processed.

To use the lazy evaluation, instead of 'read_', use 'scan_' when importing a file, or use .lazy() to convert a polars dataframe to a lazyframe.

frequency = pl.scan_parquet("data/frequency_set.parquet")

lf = df.lazy()

Optimisations include:

Optimisation Explanation
Predicate pushdown Applies filters as early as possible/ at scan level.
Projection pushdown Select only the columns that are needed at the scan level.
Slice pushdown Only load the required slice from the scan level.
Common subplan elimination Cache subtrees/file scans that are used by multiple subtrees in the query plan.
Simplify expressions Various optimisations, such as constant folding and replacing expensive operations with faster alternatives.
Join ordering Estimates the branches of joins that should be executed first in order to reduce memory pressure.
Type coercion Coerce types such that operations succeed and run on minimal required memory.
Cardinality estimation Estimates cardinality in order to determine optimal group by strategy.