Statistician & Data Scientist with 25+ Years Solving High-Stakes Analytical Challenges
PhD mathematician with deep understanding of machine learning mathematics—enabling custom statistical
solutions that consistently outperform standard ML approaches. Recent hybrid fraud detection system
demonstrates rigorous production thinking: 48% cost reduction by optimizing for business value rather
than standard metrics. Foundation projects show statistical depth through custom methods achieving
30% improvements beyond typical approaches.
What I bring:
Deep statistical foundation enabling custom analytical methods beyond typical data science
Production-oriented thinking optimizing for business outcomes
Full analytical lifecycle: from stakeholder requirements through deployment
Clear communication translating complexity for diverse decision-makers
Domain experience:
24 years vehicle safety analysis and risk assessment (NHTSA)
3 years national economic indicators (U.S. Census Bureau)
Current fraud detection research portfolio (8+ open-source projects)
Mathematical consulting for AI research and automotive industry
Specialized in anomaly detection, risk quantification, and impact measurement.
Recognized with 21 federal awards for analytical innovation and cross-functional collaboration.
Explore my projects below to see real-world examples of my analytical approach and impact.
Fraud Detection Projects
Self-Initiated R&D via Analysis Insights, LLC (6 months intensive focus):
These projects showcase rigorous production thinking and deep statistical expertise that elevates
ML results beyond what standard approaches achieve. My current flagship project demonstrates unusually
complete business orientation beyond typical portfolio projects, while foundation projects explore how
rigorous statistical methods consistently improve ML accuracy, reveal hidden patterns, and drive superior outcomes.
Demonstrating rigorous production thinking and business orientation. While most portfolio projects optimize
for F1-score, this system optimizes for total business cost—demonstrating unusually complete production
orientation beyond typical portfolio work.
Hybrid Architecture: Combined rule-based logic (impossible travel, burst detection, velocity thresholds)
with Random Forest ML (18 features) for explainability + nuance. Rules provide instant explanations for blocked transactions;
ML captures subtle patterns rules miss.
Rigorous Production Methodology: Grid search over 512 threshold combinations using proper
validation methodology. Cost function incorporates realistic payment industry economics (EMV liability, dispute modeling,
customer churn)—business-driven optimization that yielded $40K additional savings (20% improvement)
vs. arbitrary thresholds. Demonstrates focus on business value rather than standard metrics.
Realistic Issuer Economics: Modeled EMV liability shift (merchant pays for 85% card-present fraud post-2015),
3D Secure adoption (15% U.S. merchants vs. 80% Europe), dispute probability patterns (30% for <$10, 95% for >$500),
customer churn (2% after false positive × $2,000 LTV), interchange revenue loss (2%).
Supporting research demonstrating the rare combination of mathematical rigor + modern ML that drives superior results.
These projects show how statistical expertise reveals insights ML alone misses, achieves accuracy improvements beyond standard approaches,
and enables analytical solutions requiring mathematical depth most data scientists lack.
Statistical Enhancements to Machine Learning
Demonstrating how deep statistical expertise reveals insights and improves accuracy beyond what ML alone achieves—
a rare combination of rigorous mathematical methods elevating ML results consistently.
Squeezing More Info Out of Fraud Data with Statistics
Your machine learning models are predicting fraud really well. But are they giving you the full picture?
Are they giving you the critical clues you need to understand today's fraud threat? In this project,
I give an example where they don't and show how to use statistical anomaly detection to fill in the gaps.
∑anomaly detection🧠machine learning🐍Python🔍enhancing ML with statistics
How Bad Could This Emergent Fraud Be?
You have uncovered a new way that criminals are committing fraud, with 3 unrelated cases out of 1,000 transactions.
How pervasive could this new fraud type be? Can answers from generative AI or a Stats 101 webpage be trusted?
In this project, I show how "AI/Stats 101" can lead you astray, underestimating the fraud rate upper confidence bound by as much as 30%,
and how to get the right answers with rigorous statistical methods—demonstrating mathematical depth that most data scientists lack.
∑statistical distributions∑quantifying uncertainty🐍Python🔍rigorous stats beyond standard approaches
Medical Upcoding Analysis (Synthetic Data)
A self-funded employer's Third Party Administrator flags dermatology upcoding. Providers claim "no difference" (p=0.0725 via claim-level t-test).
This project uses synthetic data to demonstrate how proper provider-level clustering analysis exposes upcoding patterns the t-test misses entirely.
Complete reproducible analysis with Jupyter notebook, polished HTML report, and LinkedIn-ready visualization.
∑hierarchical clustering∑unit of analysis∑t-test critique🔍healthcare analytics🔍fraud detection🐍Python📊Jupyter📄HTML reports
Business Optimization
Analyzing trade-offs between fraud capture, false positives, and investigation costs to optimize business outcomes.
Business Optimization Analysis Series
Three related projects analyzing critical business trade-offs: (1) Fraud-FPR trade-off analysis comparing 8 ML models to identify the optimal balance of 95%+ fraud capture with lowest false positive rate; (2) Investigative staffing optimization determining optimal resource allocation given $75 false positive vs. $1,500 missed fraud costs; (3) Cost-sensitive model comparison evaluating cost-sensitive XGBoost against imbalance techniques using custom CardPrecision@k/CardRecall@k metrics.
🧠cost-sensitive learning🧠model comparison🧠Precision@k and Recall@k🔍achieving business objectives🐍Python
ML Model Benchmarking & Deployment
Systematic model comparison and hands-on deployment demonstrations using modern data platforms.
Fraud Detection Machine Learning Blog
Benchmarked 8 ML algorithms (SVM, XGBoost, logistic regression, random forest, neural networks, k-nearest neighbors, and 2 decision trees) using custom CardPrecision@30 and CardRecall@30 metrics designed for fraud detection's unique challenges.
Databricks Fraud Detection Dashboard
An interactive dashboard built on Databricks for monitoring fraud analytics, featuring suspicious transaction
identification and model performance tracking.
🚀dashboard🚀Databricks🧠XGBoost🐍Python🗄️SQL
Fraud Detection API on Hugging Face Spaces
A Streamlit API application deployed on Hugging Face Spaces allowing users to input transaction features and generate
fraud predictions using a deployed XGBoost model.
🚀API🚀Hugging Face🧠XGBoost🐍Python
Snowflake and dbt Demonstration
This demonstration builds a dbt pipeline in Snowflake to ingest, validate, and document the IEEE-CIS Fraud Detection dataset,
with schema management, automated data quality checks, and reproducible process documentation.
🚀Snowflake🚀dbt🧠XGBoost🐍Python🗄️SQL
Technical Expositions
Deep dives into the mathematical foundations and domain-specific considerations of fraud detection ML.
The Math Behind Fraud Detection with Logistic Regression
A write-up on the math behind fraud detection, illustrated with logistic regression. Why look at the math?
Because you need to understand the math to adapt models to accommodate particularities in the data and address specific business objectives.
∑optimization🧠logistic regression🧠tuning hyperparameters🧠regularization🔍enhancing ML with math
What's the Same and What's Different in Fraud Detection
A write-up on how applying data science to fraud detection is similar to, and different from, data science applied to other domains.
Understanding these similarities and differences is key to successfully adapting data science techniques to fraud detection.
🧠imbalanced & cost-sensitive learning🧠class imbalance🧠prequential validation🧠Precision@k🔍understanding ML with math
Customized Data Analyses
📋 Coming soon: Detailed write-ups for these projects.
To illustrate the analytical challenges I have solved, I use fully simulated data and
altered contexts so as not to reveal any non-public information.
These examples showcase my problem-solving approach and custom analytical solutions.
This section spotlights challenging, custom data analysis problems I have solved in settings
where standard approaches often fall short.
When I was learning and teaching math and statistics, I might have wondered how often the problems I would
later encounter in the "real world" would be solved by simple cookie-cutter applications of the formulas and
techniques I was learning or teaching. It turns out, not very often.
Most of the time, the data being used or the question being asked deviated from standard protocols in
some way (e.g. involving a ratio, rare events, or reporting lag). Or the client knew what s/he wanted in
general-but-somewhat-ambiguous terms that didn't quite translate into math. Or the technique involved an
approximation and it wasn't quite clear if the approximation would be good enough for the client's
requirements.
I don't know what "most" analysts do in these situations. Some (or many?) might lack the in-depth math
understanding needed to address such issues head-on and instead default to applying cookie-cutter techniques that
might not give accurate answers. They might or might not be able to explain the limitations of their
simplified analysis to their client. The client walks away with what they think have solid conclusions,
but they don't.
I approach these situations differently. I enjoy the challenge of formalizing ambiguous problems,
identifying and addressing deviations from standard protocols, and determining whether approximations
are good enough. I have the math and statistics background needed to tackle these issues head-on,
and I can explain the limitations of various approaches to my clients so they can make informed decisions.
This section highlights some of the types of customized data analysis problems I have solved.
I use made-up numbers and have hidden contextual details so as not to reveal non-public information.
To showcase the value of my custom analyses, I sometimes include what a cookie-cutter approach that you
might have gotten from generative AI, a Stats 101 website, or a lesser-equipped analyst.
What Can I Conclude from This Data?
You conducted a test and you have results. What can you conclude from your results? Are these two
groups you tested different? Did this countermeasure work? What about if you got more data?
How Should I Do This? Design a Plan for Me. (Coming soon)
You know what you want but not how to do it. Maybe you want to estimate the rate of occurrence
of a rare event. Maybe you want to know if a countermeasure worked. Or maybe something else.
Doing it right can mean the difference between making
valid conclusions or not, or getting more precise answers at less cost.
How Good Is This Plan? (Coming soon)
You know what you want, and you have a plan. How robust is it? Or you have candidate plans.
Which plan should you choose?
∑reporting lag∑imputation🧠logistic regression🧠false positive rate, false neg rate∑conditional probability
Big Questions (Coming soon)
You've got the small picture nailed. What's the big picture? You want to aggregate small-scale impacts
(like individual test results) into big picture impacts (nationwide implications). Maybe you have results
from various studies to incorporate. Or you want to know the large-scale implications of what-if scenarios.
∑quantifying uncertainty🧠modeling
Vehicle Safety Work
My vehicle safety work includes extensive collaboration with engineers and behavioral scientists to
design studies, analyze data, and evaluate safety interventions. While some analyses were unpublished
or advisory, the published projects below demonstrate impactful modeling and statistical innovation
used to improve vehicle safety and policy. For illustrative examples of unpublished analytic work
using simulated data and made-up contexts—including fraud-themed examples—see the
Customized Data Analyses section and the
Fraud Detection Projects section.
Report to Congress, Vehicle Safety Recall Completion Rates 2021
This report analyzes trends in vehicle recall completion rates and identifies risk factors associated with low compliance.
(Recall completion rates indicate the share of recalled vehicles that have been repaired or otherwise remedied.)
I conducted the modeling and drafted sections IIIc, IIId, V, and VI. In addition to the Williams-adjusted
fixed-effects logistic regression described in the report, I also built decision trees and generalized linear models,
using LASSO, stepwise selection, and multi-fold cross-validation. My champion model was implemented by NHTSA to
identify low-performing recalls for follow-up.
∑LASSO∑stepwise selection∑generalized linear models🧠decision trees🧠k-fold cross validation🗄️SAS🗄️SQL
An Analysis of Recent Improvements to Vehicle Safety
A study of improvements to vehicle safety, using negative binomial, log-linear, logistic, generalized logistic, and cumulative logistic models.
My study showed that improvements collectively prevented over 700,000 crashes in a single year, as well as preventing or mitigating over one million injuries.
Designing Samples to Satisfy Many Variance Constraints, 2001 FCSM
This paper presents and proves an algorithm that finds optimal sample sizes meeting nested univariate constraints of the coefficients of variation of a Horvitz-Thompson estimator under stratified simple random sampling.
∑convex optimization∑precision requirements🔍improving statistics with math
Estimating the Lives Saved by Safety Belts and Air Bags, 2003 ESV
This paper, which was presented at the 2003 Enhanced Safety of Vehicle International Conference, describes changes to the calculations of the lives saved by safety belts and air bags.
It also discusses alternative methods for attributing a life saved to the safety belt or the air bag, for occupants protected by both devices.
NHTSA's Review of the National Automotive Sampling System, Report to Congress
I conducted the analysis in Chapter 8 of this report, which calculates the recommended numbers of investigations, crash reports, and data collection sites to use for NHTSA's two premier crash databases (now called the Crash Report Sampling System and
Crash Investigation Sampling System). This chapter, which I drafted, also presents the analyses that could be conducted and conclusions that could be reached by the recommended sample sizes.
With expertise in anomaly detection, risk assessment, and impact measurement, I bring deep statistical
rigor and production-oriented thinking to complex analytical challenges. Currently available for
consulting through Analysis Insights, LLC and open to full-time opportunities.