Financial Fraud Detection — UTEC Diploma Final Project — Diego Papaterra

Context

This project is the final deliverable for the Applied Data Analytics & AI Diploma at UTEC (Universidad Tecnológica del Uruguay). The diploma covered the full data science stack — from statistical foundations to production ML pipelines — and culminated in an applied project on a real-world dataset.

The problem

Financial fraud detection is one of the hardest ML classification problems: datasets are severely imbalanced (fraudulent transactions are typically 0.1–1% of the total), the cost of false negatives (missed fraud) vastly outweighs false positives, and the decision boundary is often non-linear.

The goal: build a pipeline that maximizes fraud recall while keeping precision acceptable — a classic precision-recall trade-off problem.

Dataset

A real-world transaction dataset with hundreds of thousands of records and significant class imbalance. Features include transaction amount, merchant type, time-based signals, and behavioral patterns.

Approach

1. Exploratory Data Analysis (EDA)

Used Matplotlib and Seaborn to understand the distribution of transactions, the class imbalance ratio, correlations between features, and temporal patterns in fraudulent activity.

2. Feature Engineering & Preprocessing

Created time-based features (hour of day, day of week, transaction frequency)
Applied log-transformation on amount features to reduce skewness
Encoded categorical variables (merchant category, card type)
Used SMOTE oversampling and class weighting to address imbalance

3. Model Comparison

Trained and evaluated three model families:

Model	Notes
Logistic Regression	Baseline — interpretable, fast
XGBoost	Gradient boosting with tree pruning
CatBoost	Handles categoricals natively; strong out-of-the-box performance

All models were evaluated with cross-validation on precision, recall, F1, AUC-ROC, and AUC-PR (area under precision-recall curve — more informative than AUC-ROC on imbalanced data).

4. Threshold Tuning

Default 0.5 classification thresholds are suboptimal for fraud. Applied threshold optimization to shift the decision boundary toward higher recall, accepting a controlled increase in false positives.

Results

CatBoost achieved the best AUC-PR on the test set, with recall significantly above the Logistic Regression baseline. The final model correctly identifies the vast majority of fraudulent transactions while maintaining a manageable false positive rate suitable for a first-tier triage system.

What I learned

How to design an end-to-end ML pipeline for imbalanced binary classification
The critical difference between AUC-ROC and AUC-PR on skewed datasets
Practical feature engineering for behavioral/transactional data
How to use scikit-learn pipelines for reproducible preprocessing + model training
The business-driven reasoning behind precision-recall trade-offs in fraud contexts