Validation of CA-125 Kinetic Models for Early Ovarian Cancer Detection Using Machine Learning
Principal Investigator
Name
Fahad Kiani
Institution
CrisPRO.ai
Position Title
Principle Investigator
Email
fahad@crispro.ai
About this CDAS Project
Study
PLCO
(Learn more about this study)
Project ID
PLCO-2010
Initial CDAS Request Approval
Feb 23, 2026
Title
Validation of CA-125 Kinetic Models for Early Ovarian Cancer Detection Using Machine Learning
Summary
Background:
Ovarian cancer remains the deadliest gynecologic malignancy, with 5-year survival rates below 30% for advanced-stage disease. Early detection is critical, but current screening approaches using CA-125 alone have limited sensitivity and specificity. The CA-125 ELIMination rate constant (KELIM) has emerged as a validated biomarker of tumor chemosensitivity in treatment settings, demonstrating that longitudinal CA-125 kinetics provide more information than single timepoint measurements.
While KELIM has been extensively validated for predicting treatment response and survival in diagnosed patients, the kinetics of CA-125 rise BEFORE diagnosis (in screening populations) have not been systematically characterized using modern machine learning approaches. Understanding pre-diagnostic CA-125 trajectories could enable earlier detection and risk stratification, potentially improving outcomes through earlier intervention.
Rationale:
Recent studies demonstrate that CA-125 velocity in screening cohorts differs dramatically between incident cancer cases (19.75 U/mL/month) and controls (0.035 U/mL/month) - approximately 500-fold difference. However, optimal modeling approaches for extracting predictive signal from sparse, irregularly-sampled longitudinal CA-125 data remain unclear. Machine learning methods may capture non-linear kinetic patterns that traditional linear models miss.
Objectives:
This project will develop and validate machine learning models for early ovarian cancer detection using longitudinal CA-125 kinetics from the PLCO Ovarian Screening Trial. We will:
1. Characterize CA-125 kinetic features (velocity, acceleration, inflection points) in incident cancers versus controls using the ovar_screen dataset (~151,000 screening records from ~78,000 women, 72 incident cancers).
2. Develop predictive models comparing baseline CA-125 alone versus kinetic features (velocity, KELIM-like scores, trajectory shape) versus combined approaches.
3. Assess model performance (AUROC, sensitivity, specificity) at different lead times before diagnosis (6 months, 12 months, 18+ months pre-diagnosis).
4. Validate findings using cross-validation and identify optimal kinetic windows for maximum predictive value.
Methods:
We will extract serial CA-125 measurements (ca125_level, ca125_days) for all PLCO participants with at least 2 measurements. For each patient, we will compute longitudinal features including velocity (change in CA-125 per unit time), acceleration (change in velocity), and KELIM-like elimination constants. We will handle irregular sampling using interpolation and mixed-effects models.
Machine learning approaches will include logistic regression (interpretable baseline), random forests (capture non-linear patterns), and gradient boosting (state-of-art performance). We will use time-stratified cross-validation to prevent data leakage and assess performance at specific lead times before diagnosis.
Significance:
This validation will establish whether CA-125 kinetics can improve early ovarian cancer detection beyond current screening approaches. Results will inform the design of next-generation screening algorithms combining CA-125 kinetics with imaging, genetics, and other biomarkers. The validated kinetic features and modeling approaches will be made publicly available to accelerate translation to clinical practice.
All analyses will be computational. No patient-level data will be published. Results will be reported in aggregate form and shared via peer-reviewed publications and open-source code repositories.
Aims
**Aim 1: Characterize pre-diagnostic CA-125 kinetics in incident ovarian cancers versus controls**
• Extract serial CA-125 measurements from the PLCO ovar_screen dataset for all participants with ≥2 measurements
• Compute kinetic features for each patient:
- Velocity: CA-125 change per month (U/mL/month)
- Acceleration: Change in velocity over time
- KELIM-like elimination constant: Modeled rate of CA-125 change
- Trajectory shape: Monotonic rise, plateau-then-rise, or fluctuating
• Compare kinetic distributions between incident cancers (n=72) and controls (n~78,000)
• Assess lead time sensitivity: How far before diagnosis do kinetic signatures emerge?
**Aim 2: Develop machine learning models for cancer detection using CA-125 kinetics**
• Baseline model: Logistic regression using CA-125 level at most recent screening (current standard)
• Kinetics-only model: Logistic regression using velocity and acceleration (no baseline level)
• Combined model: Integrate baseline CA-125 level + kinetic features
• Advanced models: Random forest and gradient boosting to capture non-linear kinetic patterns
• Handle irregular sampling: Mixed-effects models and cubic spline interpolation
• Feature selection: Identify most predictive kinetic windows (e.g., 100 days pre-diagnosis vs 365+ days)
**Aim 3: Validate model performance and assess clinical utility**
• Primary outcome: AUROC for cancer detection at different lead times (6, 12, 18+ months pre-diagnosis)
• Secondary outcomes:
- Sensitivity and specificity at clinically relevant thresholds (>90% specificity for screening)
- Positive predictive value in screening population (prevalence ~0.09%)
- Comparison to published CA-125 screening performance metrics
• Validation strategy: Time-stratified 5-fold cross-validation to prevent data leakage
• Subgroup analyses: Stratify by age, histology, stage at diagnosis
• Lead time analysis: Quantify how much earlier kinetic models detect cancer versus baseline CA-125 threshold
**Aim 4: Develop open-source tools for CA-125 kinetic analysis**
• Create validated software pipeline for computing CA-125 kinetics from sparse longitudinal data
• Document optimal feature engineering approaches for irregular sampling
• Publish trained model coefficients and decision thresholds
• Release code via GitHub for research community use
• Enable future integration with multi-modal screening approaches (imaging, genetics, proteomics)
**Expected Timeline:** Data download (1 day), analysis (3-4 weeks), manuscript preparation (6-8 weeks)
Collaborators
Fahad Kiani CrisPRO.ai
Rahima Nayeem CrisPRO.ai