Automated Data Harmonization
Principal Investigator
Name
Mohammed Eslami
Degrees
Ph.D.
Institution
Netrias, LLC
Position Title
Chief Scientist
Email
meslami@netrias.com
About this CDAS Project
Study
IDATA
(Learn more about this study)
Project ID
IDATA-92
Initial CDAS Request Approval
Oct 22, 2025
Title
Automated Data Harmonization
Summary
The IDATA study includes a range of questionnaires and activity logs where participants enter free-text responses. These include dietary entries not found in food lists (ms_responseos_asa24), free-text activity descriptions (activity1-3, champs_other_activity_a1-2), and reasons for device removal (device_off_reason_1-17, pal_coded_activity1-4, pal_ag1_activity_code_1-4). These open-text inputs introduce variation due to typos, abbreviations, and inconsistent phrasing. Our objective is to explore the feasibility of using language models to automatically harmonize such entries into structured, standardized ontologies (e.g., food or physical activity categories), improving data quality and downstream usability.
Aims
Aim 1: Conduct an exploratory analysis of participant-entered responses in IDATA (e.g., foods, activities, device off reasons) to characterize common variations in spelling, phrasing, or semantic ambiguity.
Aim 2: Develop data augmentation tools that simulate these variations using terms drawn from existing dietary and activity ontologies (e.g., USDA FoodData Central, Compendium of Physical Activities).
Aim 3: Train and evaluate a harmonization model capable of mapping free-text inputs to standardized categories (e.g., “walk dog” → “walking the dog”), supporting semi-automated cleaning of diet and activity data.
Collaborators
NA