Fraud Detection System

An end-to-end machine learning pipeline for detecting fraudulent transactions using ensemble methods and advanced feature engineering on imbalanced financial datasets.

PythonScikit-learnXGBoostPandasSMOTEMatplotlib

Overview

An end-to-end machine learning pipeline for detecting fraudulent transactions using ensemble methods and advanced feature engineering on imbalanced financial datasets.

Problem

Financial fraud causes billions in losses annually. Traditional rule-based systems struggle with evolving fraud patterns and generate high false-positive rates that frustrate legitimate customers. The challenge is building a model that catches fraud with high recall while keeping false positives low on a severely imbalanced dataset.

Dataset

Transactional dataset with features including transaction amount, time, merchant category, and anonymized behavioral signals. The dataset exhibits severe class imbalance (~0.17% fraud rate), requiring careful resampling strategies.

Architecture

A stacked ensemble combining XGBoost and Random Forest classifiers, with a logistic regression meta-learner. SMOTE oversampling is applied to the training set to address class imbalance. Feature engineering extracts velocity features, time-based patterns, and statistical aggregates per user.

Training

Models trained with stratified k-fold cross-validation to preserve class distribution across folds. Hyperparameters tuned via Bayesian optimization. Threshold tuning applied post-training to optimize the precision-recall trade-off for the fraud class.

Results

Achieved 94% recall on the fraud class with a precision of 87%, significantly outperforming the baseline logistic regression. The ensemble reduced false negatives by 31% compared to a single XGBoost model.

GitHub Repository

View on GitHub