Skip to Main Content
An official website of the United States government

Automated Data Harmonization

Principal Investigator

Name
Mohammed Eslami

Degrees
Ph.D.

Institution
Netrias, LLC

Position Title
Chief Scientist

Email
meslami@netrias.com

About this CDAS Project

Study
IDATA (Learn more about this study)

Project ID
IDATA-92

Initial CDAS Request Approval
Oct 22, 2025

Title
Automated Data Harmonization

Summary
The IDATA study includes a range of questionnaires and activity logs where participants enter free-text responses. These include dietary entries not found in food lists (ms_responseos_asa24), free-text activity descriptions (activity1-3, champs_other_activity_a1-2), and reasons for device removal (device_off_reason_1-17, pal_coded_activity1-4, pal_ag1_activity_code_1-4). These open-text inputs introduce variation due to typos, abbreviations, and inconsistent phrasing. Our objective is to explore the feasibility of using language models to automatically harmonize such entries into structured, standardized ontologies (e.g., food or physical activity categories), improving data quality and downstream usability.

Aims

Aim 1: Conduct an exploratory analysis of participant-entered responses in IDATA (e.g., foods, activities, device off reasons) to characterize common variations in spelling, phrasing, or semantic ambiguity.

Aim 2: Develop data augmentation tools that simulate these variations using terms drawn from existing dietary and activity ontologies (e.g., USDA FoodData Central, Compendium of Physical Activities).

Aim 3: Train and evaluate a harmonization model capable of mapping free-text inputs to standardized categories (e.g., “walk dog” → “walking the dog”), supporting semi-automated cleaning of diet and activity data.

Collaborators

NA