Donna Glassbrenner Portfolio

About Me

Statistician & Data Scientist with 25+ Years Solving High-Stakes Analytical Challenges

PhD mathematician with deep understanding of machine learning mathematics—enabling custom statistical solutions that consistently outperform standard ML approaches. Recent hybrid fraud detection system demonstrates rigorous production thinking: 48% cost reduction by optimizing for business value rather than standard metrics. Foundation projects show statistical depth through custom methods achieving 30% improvements beyond typical approaches.

What I bring:

Deep statistical foundation enabling custom analytical methods beyond typical data science
Production-oriented thinking optimizing for business outcomes
Full analytical lifecycle: from stakeholder requirements through deployment
Clear communication translating complexity for diverse decision-makers

Domain experience:

24 years vehicle safety analysis and risk assessment (NHTSA)
3 years national economic indicators (U.S. Census Bureau)
Current fraud detection research portfolio (8+ open-source projects)
Mathematical consulting for AI research and automotive industry

Specialized in anomaly detection, risk quantification, and impact measurement. Recognized with 21 federal awards for analytical innovation and cross-functional collaboration.

Technical tools: Python (pandas, numpy, scikit-learn, matplotlib), SQL, SAS, Tableau, Git/GitHub. Deployment experience: dbt, Databricks, Snowflake, Streamlit, Hugging Face.

Explore my projects below to see real-world examples of my analytical approach and impact.

Fraud Detection Projects

Self-Initiated R&D via Analysis Insights, LLC (6 months intensive focus): These projects showcase rigorous production thinking and deep statistical expertise that elevates ML results beyond what standard approaches achieve. My current flagship project demonstrates unusually complete business orientation beyond typical portfolio projects, while foundation projects explore how rigorous statistical methods consistently improve ML accuracy, reveal hidden patterns, and drive superior outcomes.

Hybrid Fraud Detection: Production-Ready ML + Rules System CURRENT PROJECT

Demonstrating rigorous production thinking and business orientation. While most portfolio projects optimize for F1-score, this system optimizes for total business cost—demonstrating unusually complete production orientation beyond typical portfolio work.

Hybrid Architecture: Combined rule-based logic (impossible travel, burst detection, velocity thresholds) with Random Forest ML (18 features) for explainability + nuance. Rules provide instant explanations for blocked transactions; ML captures subtle patterns rules miss.

Rigorous Production Methodology: Grid search over 512 threshold combinations using proper validation methodology. Cost function incorporates realistic payment industry economics (EMV liability, dispute modeling, customer churn)—business-driven optimization that yielded $40K additional savings (20% improvement) vs. arbitrary thresholds. Demonstrates focus on business value rather than standard metrics.

Realistic Issuer Economics: Modeled EMV liability shift (merchant pays for 85% card-present fraud post-2015), 3D Secure adoption (15% U.S. merchants vs. 80% Europe), dispute probability patterns (30% for <$10, 95% for >$500), customer churn (2% after false positive × $2,000 LTV), interchange revenue loss (2%).

8 Fraud Typologies: Analyzed detection rates across card testing (98%), stolen card CNP (95%), account takeover (92%), friendly fraud (90%), synthetic identity (85%), refund fraud (88%), application fraud (87%), lost/stolen card (95%). Detailed pattern specifications for each type.

Results: 48% cost reduction vs. rules-only ($314K → $162K), 10% vs. ML-only ($180K), on simulated transaction data (47,109 transactions, 1.63% fraud rate, 500 cardholders, 6 months). Demonstrates rigorous production thinking beyond typical portfolio projects.

🧠Random Forest 🧠cost-based optimization ∑grid search 🔍business-driven ML 🐍Python 🐍scikit-learn

Foundation Work: How Deep Statistics Elevates ML

Supporting research demonstrating the rare combination of mathematical rigor + modern ML that drives superior results. These projects show how statistical expertise reveals insights ML alone misses, achieves accuracy improvements beyond standard approaches, and enables analytical solutions requiring mathematical depth most data scientists lack.

Statistical Enhancements to Machine Learning

Demonstrating how deep statistical expertise reveals insights and improves accuracy beyond what ML alone achieves— a rare combination of rigorous mathematical methods elevating ML results consistently.

Squeezing More Info Out of Fraud Data with Statistics
Your machine learning models are predicting fraud really well. But are they giving you the full picture? Are they giving you the critical clues you need to understand today's fraud threat? In this project, I give an example where they don't and show how to use statistical anomaly detection to fill in the gaps.
∑anomaly detection 🧠machine learning 🐍Python 🔍enhancing ML with statistics
How Bad Could This Emergent Fraud Be?
You have uncovered a new way that criminals are committing fraud, with 3 unrelated cases out of 1,000 transactions. How pervasive could this new fraud type be? Can answers from generative AI or a Stats 101 webpage be trusted? In this project, I show how "AI/Stats 101" can lead you astray, underestimating the fraud rate upper confidence bound by as much as 30%, and how to get the right answers with rigorous statistical methods—demonstrating mathematical depth that most data scientists lack.
∑statistical distributions ∑quantifying uncertainty 🐍Python 🔍rigorous stats beyond standard approaches
Medical Upcoding Analysis (Synthetic Data)
A self-funded employer's Third Party Administrator flags dermatology upcoding. Providers claim "no difference" (p=0.0725 via claim-level t-test). This project uses synthetic data to demonstrate how proper provider-level clustering analysis exposes upcoding patterns the t-test misses entirely. Complete reproducible analysis with Jupyter notebook, polished HTML report, and LinkedIn-ready visualization.
∑hierarchical clustering ∑unit of analysis ∑t-test critique 🔍healthcare analytics 🔍fraud detection 🐍Python 📊Jupyter 📄HTML reports

Business Optimization

Analyzing trade-offs between fraud capture, false positives, and investigation costs to optimize business outcomes.

Business Optimization Analysis Series
Three related projects analyzing critical business trade-offs: (1) Fraud-FPR trade-off analysis comparing 8 ML models to identify the optimal balance of 95%+ fraud capture with lowest false positive rate; (2) Investigative staffing optimization determining optimal resource allocation given $75 false positive vs. $1,500 missed fraud costs; (3) Cost-sensitive model comparison evaluating cost-sensitive XGBoost against imbalance techniques using custom CardPrecision@k/CardRecall@k metrics.
🧠cost-sensitive learning 🧠model comparison 🧠Precision@k and Recall@k 🔍achieving business objectives 🐍Python

ML Model Benchmarking & Deployment

Systematic model comparison and hands-on deployment demonstrations using modern data platforms.

Fraud Detection Machine Learning Blog
Benchmarked 8 ML algorithms (SVM, XGBoost, logistic regression, random forest, neural networks, k-nearest neighbors, and 2 decision trees) using custom CardPrecision@30 and CardRecall@30 metrics designed for fraud detection's unique challenges.
🧠random forests 🧠XGBoost 🧠support vector machines 🧠nearest neighbors 🧠neural networks 🧠imbalanced learning 🐍Python
Databricks Fraud Detection Dashboard
An interactive dashboard built on Databricks for monitoring fraud analytics, featuring suspicious transaction identification and model performance tracking.
🚀dashboard 🚀Databricks 🧠XGBoost 🐍Python 🗄️SQL
Fraud Detection API on Hugging Face Spaces
A Streamlit API application deployed on Hugging Face Spaces allowing users to input transaction features and generate fraud predictions using a deployed XGBoost model.
🚀API 🚀Hugging Face 🧠XGBoost 🐍Python
Snowflake and dbt Demonstration
This demonstration builds a dbt pipeline in Snowflake to ingest, validate, and document the IEEE-CIS Fraud Detection dataset, with schema management, automated data quality checks, and reproducible process documentation.
🚀Snowflake 🚀dbt 🧠XGBoost 🐍Python 🗄️SQL

Technical Expositions

Deep dives into the mathematical foundations and domain-specific considerations of fraud detection ML.

The Math Behind Fraud Detection with Logistic Regression
A write-up on the math behind fraud detection, illustrated with logistic regression. Why look at the math? Because you need to understand the math to adapt models to accommodate particularities in the data and address specific business objectives.
∑optimization 🧠logistic regression 🧠tuning hyperparameters 🧠regularization 🔍enhancing ML with math
What's the Same and What's Different in Fraud Detection
A write-up on how applying data science to fraud detection is similar to, and different from, data science applied to other domains. Understanding these similarities and differences is key to successfully adapting data science techniques to fraud detection.
🧠imbalanced & cost-sensitive learning 🧠class imbalance 🧠prequential validation 🧠Precision@k 🔍understanding ML with math

Customized Data Analyses

📋 Coming soon: Detailed write-ups for these projects. To illustrate the analytical challenges I have solved, I use fully simulated data and altered contexts so as not to reveal any non-public information. These examples showcase my problem-solving approach and custom analytical solutions.

This section spotlights challenging, custom data analysis problems I have solved in settings where standard approaches often fall short.

When I was learning and teaching math and statistics, I might have wondered how often the problems I would later encounter in the "real world" would be solved by simple cookie-cutter applications of the formulas and techniques I was learning or teaching. It turns out, not very often.

Most of the time, the data being used or the question being asked deviated from standard protocols in some way (e.g. involving a ratio, rare events, or reporting lag). Or the client knew what s/he wanted in general-but-somewhat-ambiguous terms that didn't quite translate into math. Or the technique involved an approximation and it wasn't quite clear if the approximation would be good enough for the client's requirements.

I don't know what "most" analysts do in these situations. Some (or many?) might lack the in-depth math understanding needed to address such issues head-on and instead default to applying cookie-cutter techniques that might not give accurate answers. They might or might not be able to explain the limitations of their simplified analysis to their client. The client walks away with what they think have solid conclusions, but they don't.

I approach these situations differently. I enjoy the challenge of formalizing ambiguous problems, identifying and addressing deviations from standard protocols, and determining whether approximations are good enough. I have the math and statistics background needed to tackle these issues head-on, and I can explain the limitations of various approaches to my clients so they can make informed decisions.

This section highlights some of the types of customized data analysis problems I have solved. I use made-up numbers and have hidden contextual details so as not to reveal non-public information.

To showcase the value of my custom analyses, I sometimes include what a cookie-cutter approach that you might have gotten from generative AI, a Stats 101 website, or a lesser-equipped analyst.

What Can I Conclude from This Data?
You conducted a test and you have results. What can you conclude from your results? Are these two groups you tested different? Did this countermeasure work? What about if you got more data?
- How I improved an estimate of a rate by 10 percentage points
∑hypothesis tests (A/B tests) ∑confidence bounds ∑quantifying uncertainty ∑binomial distributions
How Should I Do This? Design a Plan for Me. (Coming soon)
You know what you want but not how to do it. Maybe you want to estimate the rate of occurrence of a rare event. Maybe you want to know if a countermeasure worked. Or maybe something else. Doing it right can mean the difference between making valid conclusions or not, or getting more precise answers at less cost.
∑experimental design ∑complex sample designs ∑optimal sample sizes ∑stratified PPS ∑hypothesis tests (A/B tests) ∑statistical power ∑cost optimization
How Good Is This Plan? (Coming soon)
You know what you want, and you have a plan. How robust is it? Or you have candidate plans. Which plan should you choose?
∑Monte Carlo simulation ∑what-if scenarios
How Do I Deal with This Aspect? (Coming soon)
Data is messy. Missing values, imperfect tests, reporting lags. What is an analyst to do?
∑reporting lag ∑imputation 🧠logistic regression 🧠false positive rate, false neg rate ∑conditional probability
Big Questions (Coming soon)
You've got the small picture nailed. What's the big picture? You want to aggregate small-scale impacts (like individual test results) into big picture impacts (nationwide implications). Maybe you have results from various studies to incorporate. Or you want to know the large-scale implications of what-if scenarios.
∑quantifying uncertainty 🧠modeling

Vehicle Safety Work

My vehicle safety work includes extensive collaboration with engineers and behavioral scientists to design studies, analyze data, and evaluate safety interventions. While some analyses were unpublished or advisory, the published projects below demonstrate impactful modeling and statistical innovation used to improve vehicle safety and policy. For illustrative examples of unpublished analytic work using simulated data and made-up contexts—including fraud-themed examples—see the Customized Data Analyses section and the Fraud Detection Projects section.

Report to Congress, Vehicle Safety Recall Completion Rates 2021
This report analyzes trends in vehicle recall completion rates and identifies risk factors associated with low compliance. (Recall completion rates indicate the share of recalled vehicles that have been repaired or otherwise remedied.) I conducted the modeling and drafted sections IIIc, IIId, V, and VI. In addition to the Williams-adjusted fixed-effects logistic regression described in the report, I also built decision trees and generalized linear models, using LASSO, stepwise selection, and multi-fold cross-validation. My champion model was implemented by NHTSA to identify low-performing recalls for follow-up.
∑LASSO ∑stepwise selection ∑generalized linear models 🧠decision trees 🧠k-fold cross validation 🗄️SAS 🗄️SQL
An Analysis of Recent Improvements to Vehicle Safety
A study of improvements to vehicle safety, using negative binomial, log-linear, logistic, generalized logistic, and cumulative logistic models. My study showed that improvements collectively prevented over 700,000 crashes in a single year, as well as preventing or mitigating over one million injuries.
🧠negative binomial models 🧠log-linear models 🧠generalized logistic models 🗄️SAS 🗄️SQL
Designing Samples to Satisfy Many Variance Constraints, 2001 FCSM
This paper presents and proves an algorithm that finds optimal sample sizes meeting nested univariate constraints of the coefficients of variation of a Horvitz-Thompson estimator under stratified simple random sampling.
∑convex optimization ∑precision requirements 🔍improving statistics with math
Estimating the Lives Saved by Safety Belts and Air Bags, 2003 ESV
This paper, which was presented at the 2003 Enhanced Safety of Vehicle International Conference, describes changes to the calculations of the lives saved by safety belts and air bags. It also discusses alternative methods for attributing a life saved to the safety belt or the air bag, for occupants protected by both devices.
NHTSA's Review of the National Automotive Sampling System, Report to Congress
I conducted the analysis in Chapter 8 of this report, which calculates the recommended numbers of investigations, crash reports, and data collection sites to use for NHTSA's two premier crash databases (now called the Crash Report Sampling System and Crash Investigation Sampling System). This chapter, which I drafted, also presents the analyses that could be conducted and conclusions that could be reached by the recommended sample sizes.
∑sample design ∑determining sample sizes ∑precision requirements
The Relationship between Occupant Compartment Deformation and Occupant Injury
I cowrote this report with a NHTSA engineer, which analyzes the relationship between occupant compartment deformation and injury to the occupant.

Donna Glassbrenner, Ph.D.