
Financial Fraud Detection — UTEC Diploma Final Project
Context
This project is the final deliverable for the Applied Data Analytics & AI Diploma at UTEC (Universidad Tecnológica del Uruguay). The diploma covered the full data science stack — from statistical foundations to production ML pipelines — and culminated in an applied project on a real-world dataset.
The problem
Financial fraud detection is one of the hardest ML classification problems: datasets are severely imbalanced (fraudulent transactions are typically 0.1–1% of the total), the cost of false negatives (missed fraud) vastly outweighs false positives, and the decision boundary is often non-linear.
The goal: build a pipeline that maximizes fraud recall while keeping precision acceptable — a classic precision-recall trade-off problem.
Dataset
A real-world transaction dataset with hundreds of thousands of records and significant class imbalance. Features include transaction amount, merchant type, time-based signals, and behavioral patterns.
Approach
1. Exploratory Data Analysis (EDA)
Used Matplotlib and Seaborn to understand the distribution of transactions, the class imbalance ratio, correlations between features, and temporal patterns in fraudulent activity.
2. Feature Engineering & Preprocessing
- Created time-based features (hour of day, day of week, transaction frequency)
- Applied log-transformation on amount features to reduce skewness
- Encoded categorical variables (merchant category, card type)
- Used SMOTE oversampling and class weighting to address imbalance
3. Model Comparison
Trained and evaluated three model families:
| Model | Notes |
| Logistic Regression | Baseline — interpretable, fast |
| XGBoost | Gradient boosting with tree pruning |
| CatBoost | Handles categoricals natively; strong out-of-the-box performance |
All models were evaluated with cross-validation on precision, recall, F1, AUC-ROC, and AUC-PR (area under precision-recall curve — more informative than AUC-ROC on imbalanced data).
4. Threshold Tuning
Default 0.5 classification thresholds are suboptimal for fraud. Applied threshold optimization to shift the decision boundary toward higher recall, accepting a controlled increase in false positives.
Results
CatBoost achieved the best AUC-PR on the test set, with recall significantly above the Logistic Regression baseline. The final model correctly identifies the vast majority of fraudulent transactions while maintaining a manageable false positive rate suitable for a first-tier triage system.
What I learned
- How to design an end-to-end ML pipeline for imbalanced binary classification
- The critical difference between AUC-ROC and AUC-PR on skewed datasets
- Practical feature engineering for behavioral/transactional data
- How to use scikit-learn pipelines for reproducible preprocessing + model training
- The business-driven reasoning behind precision-recall trade-offs in fraud contexts