3.3. Dataframes
Dataframes are tabular data of rows and columns, which makes up the majority of data types used within pricing, and so the majority of your code will be using dataframe syntax.
Polars
Polars is a relatively new dataFrame library that is quickly becoming popular, designed for working with large datasets efficiently.
- Performance - Built in Rust, Polars can handle millions of rows quickly.
- Memory Efficiency - Uses less RAM than Pandas for the same data.
- Lazy Evaluation - Can defer calculations until needed, optimising execution.
- Clear Syntax - Easy to read, even for beginners.
For insurance pricing teams, this means you can analyse and process millions of quotes on a local machine.
Importing to a DataFrame
Querying database Note: credentials should not be stored in code, instead these can be set as environment variables.
import pyodbc
def SQLConnection():
SERVER = SERVER
DATABASE = DATABASE
ConnectionString= """
Driver={{ODBC Driver 18 for SQL Server}};
Server=tcp:{SERVER}.database.windows.net,1433;
Database={DATABASE};
Encrypt=yes;
TrustServerCertificate=no;
Connection Timeout=30"""
conn = pyodbc.connect(ConnectionString)
return conn
query = 'SELECT * FROM [{table}]'
quote_data = pl.read_database(query, conn)
Reading in CSV
policies = pl.read_csv('./raw-data/policies.csv')
Reading in Parquet
frequency = pl.read_parquet("data/frequency_set.parquet")
Manipulation
Creating Columns
df = df.with_columns(
DifferenceToMarket = pl.col('CustomerPremium') - pl.col('MarketPricePrediction'),
RatioToMarket = pl.col('CustomerPremium') / pl.col('MarketPricePrediction')
)
df = df.with_columns(
Premium = pl.when(pl.col('StrategyGroup') == 'Optimisation')
.then(pl.col('OptimisedPremium'))
.otherwise(pl.col('Premium'))
)
Clips
minimum_premium = configuration.get('minimum_premium')
df = df.with_columns(
Premium = pl.col('Premium').clip(lower_bound = minimum_premium)
)
Aggregations
claim_counts = (
claims_amounts
.with_columns(ClaimCount = pl.lit(1))
.group_by('IDpol')
.agg(pl.sum('ClaimCount').alias('ClaimCount'))
)
Joining
frequency_set = (
policies
.join(exposure, on = 'IDpol', how='inner')
.join(claim_counts, on = 'IDpol', how='left')
.with_columns(ClaimCount = pl.col('ClaimCount').fill_null(0))
)