Skip to Main Content
An official website of the United States government

Effective AI for Small Data Challenges: Insights from the PLCO Dataset

Principal Investigator

Name
Vincent Oria

Degrees
PhD

Institution
New Jersey Institute of Technology

Position Title
Professor

Email
oria@njit.edu

About this CDAS Project

Study
PLCO (Learn more about this study)

Project ID
PLCOI-1984

Initial CDAS Request Approval
Dec 5, 2025

Title
Effective AI for Small Data Challenges: Insights from the PLCO Dataset

Summary
While the field of AI is often driven by "Big Data," many crucial real-world problems inherently involve "Small Data" due to various constraints. This research directly addresses this challenge by establishing a foundational framework for AI in data-scarce environments. We will use the comprehensive PLCO dataset as a controlled source to simulate these small-data conditions. The PLCO dataset comprises a vast amount of data from approximately 155,000 participants who were followed over many years. This will allow us to rigorously simulate a wide range of "small-data" scenarios by systematically and non-destructively sampling and reducing the full dataset. This controlled reduction is impossible with a dataset that is small from the outset.
We plan to first use the rich tabular data (in CSV format) from the PLCO dataset. We will perform feature engineering to select the most relevant clinical, demographic, and lifestyle variables. Using these engineered features, we will train and evaluate traditional machine learning algorithms on progressively smaller subsets of the data to understand their limitations and identify the most promising approaches.
Following the work with tabular data, we will extend our research to incorporate the imaging data available in the PLCO dataset, specifically the chest X-rays and pathology images. This will allow us to investigate the unique challenges of training diagnostic models on limited visual data and explore methods to combine tabular and image features for a more robust prediction. The multi-modal nature of the PLCO dataset is perfect for our two-pronged approach, enabling us to test our methods on both structured and unstructured data, going far beyond simplified toy problems.
The PLCO dataset is not simply a source of data, but a meticulously curated scientific laboratory that provides the perfect environment to address the critical, yet underexplored, challenges of small-data AI in clinical medicine. Its size allows for controlled experimentation, and its richness and multi-modal nature ensure that our findings are both methodologically sound and clinically meaningful.

Aims

The primary objective of this project is to investigate the challenges of working with “small data” in real-world clinical settings. We will use the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) dataset as a large-scale source from which to derive small datasets.
The core of this research involves performing extensive feature engineering, training traditional machine learning algorithms on these small subsets, and developing an efficient model capable of accurately diagnosing disease despite limited data.
The PLCO dataset is not simply a source of data, but a meticulously curated scientific laboratory that provides the perfect environment to address the critical, yet underexplored, challenges of small-data AI in clinical medicine. Its size allows for controlled experimentation, and its richness and multi-modal nature ensure that our findings are both methodologically sound and clinically meaningful.

Collaborators

Vincent Oria New Jersey Institute of Technology
Vincent Oria New Jersey Institute of Technology
Arashdeep Kaur New Jersey Institute of Technology
Canan Eren New Jersey Institute of Technology