Car Insurance Risk Analysis

A data-driven risk segmentation and premium optimization system for car insurance, using statistical analysis and predictive modeling to identify high-risk policyholders.

PythonPandasScikit-learnStatsmodelsSeabornMatplotlib

Overview

A data-driven risk segmentation and premium optimization system for car insurance, using statistical analysis and predictive modeling to identify high-risk policyholders.

Problem

Insurance companies need to accurately price policies based on risk. Underpricing high-risk customers leads to losses, while overpricing low-risk customers drives them away. The goal is to build a model that segments customers by risk level and supports actuarially sound premium setting.

Dataset

Historical insurance policy data including vehicle attributes, driver demographics, claim history, geographic data, and policy details. Extensive EDA performed to understand claim frequency and severity distributions.

Architecture

Two-stage modeling approach: a classification model predicts claim likelihood, and a regression model estimates claim severity for predicted claimants. Features engineered from vehicle age, driver experience, and geographic risk zones.

Training

Logistic regression and gradient boosting compared for claim prediction. GLM (Generalized Linear Model) with Tweedie distribution used for severity modeling, consistent with actuarial best practices. Cross-validation used throughout.

Results

Risk segmentation identified three distinct customer tiers. The combined model improved loss ratio estimation by 18% over the baseline flat-rate approach. Geographic risk heatmaps revealed high-claim zones for targeted underwriting.

GitHub Repository

View on GitHub