3.3. Dataframes
Dataframes are tabular data of rows and columns, which makes up the majority of data types used within pricing, and so the majority of your code will be using dataframe syntax.
Polars
Polars is a relatively new dataFrame library that is quickly becoming popular, designed for working with large datasets efficiently.
- Performance - Built in Rust, Polars can handle millions of rows quickly.
- Memory Efficiency - Uses less RAM than Pandas for the same data, and can use datasets larger than memory.
- Lazy Evaluation - Can defer calculations until needed, optimising execution.
- Clear Syntax - Easy to read, even for beginners.
For insurance pricing teams, this means you can analyse and process millions of quotes on a local machine.
Importing to a DataFrame
Querying database Note: credentials should not be stored in code, instead these can be set as environment variables.
import pyodbc
def SQLConnection():
SERVER = SERVER
DATABASE = DATABASE
ConnectionString= """
Driver={{ODBC Driver 18 for SQL Server}};
Server=tcp:{SERVER}.database.windows.net,1433;
Database={DATABASE};
Encrypt=yes;
TrustServerCertificate=no;
Connection Timeout=30"""
conn = pyodbc.connect(ConnectionString)
return conn
query = 'SELECT * FROM [{table}]'
quote_data = pl.read_database(query, conn)
Reading in CSV
policies = pl.read_csv('./raw-data/policies.csv')
Reading in Parquet
frequency = pl.read_parquet("data/frequency_set.parquet")
Manipulation
Creating Columns
df = df.with_columns(
DifferenceToMarket = pl.col('CustomerPremium') - pl.col('MarketPricePrediction'),
RatioToMarket = pl.col('CustomerPremium') / pl.col('MarketPricePrediction')
)
df = df.with_columns(
Premium = pl.when(pl.col('StrategyGroup') == 'Optimisation')
.then(pl.col('OptimisedPremium'))
.otherwise(pl.col('Premium'))
)
Clips
minimum_premium = configuration.get('minimum_premium')
df = df.with_columns(
Premium = pl.col('Premium').clip(lower_bound = minimum_premium)
)
Aggregations
claim_counts = (
claims_amounts
.with_columns(ClaimCount = pl.lit(1))
.group_by('IDpol')
.agg(pl.sum('ClaimCount').alias('ClaimCount'))
)
Joining
frequency_set = (
policies
.join(exposure, on = 'IDpol', how='inner')
.join(claim_counts, on = 'IDpol', how='left')
.with_columns(ClaimCount = pl.col('ClaimCount').fill_null(0))
)
Lazy Evaluation
There are two modes to run polars, eager and lazy. Eager is when the code is executed as soon as it is run, which is useful if working iteratively, and assessing the output of each stage which is generally the case when developing code.
Lazy execution means the code isn't executed until all the code is run end-to-end. This has two main benefits, that polars can apply optimisation to the query plan, and the code can be run in batches allowing data that is larger than memory to be processed.
To use the lazy evaluation, instead of 'read_', use 'scan_' when importing a file, or use .lazy() to convert a polars dataframe to a lazyframe.
frequency = pl.scan_parquet("data/frequency_set.parquet")
lf = df.lazy()
Optimisations include:
Optimisation | Explanation |
---|---|
Predicate pushdown | Applies filters as early as possible/ at scan level. |
Projection pushdown | Select only the columns that are needed at the scan level. |
Slice pushdown | Only load the required slice from the scan level. |
Common subplan elimination | Cache subtrees/file scans that are used by multiple subtrees in the query plan. |
Simplify expressions | Various optimisations, such as constant folding and replacing expensive operations with faster alternatives. |
Join ordering | Estimates the branches of joins that should be executed first in order to reduce memory pressure. |
Type coercion | Coerce types such that operations succeed and run on minimal required memory. |
Cardinality estimation | Estimates cardinality in order to determine optimal group by strategy. |