Skip to Main Content
An official website of the United States government

Automated Data Harmonization

Principal Investigator

Name
Mohammed Eslami

Degrees
Ph.D

Institution
Netrias, LLC

Position Title
Chief Scientist

Email
meslami@netrias.com

About this CDAS Project

Study
PLCO (Learn more about this study)

Project ID
PLCO-1789

Initial CDAS Request Approval
Jan 13, 2025

Title
Automated Data Harmonization

Summary
PLCO questionnaires have opportunities for respondents to manually respond to certain questions. The MUQ is one of the few where the majority of the responses are manually entered. The data primarily comprises of drug names. We want to see if we can develop a language model to automatically harmonize those entries.

Aims

Aim 1: Conduct an exploratory analysis of responses provided for the medication list to identify variations (e.g. typos, specific acronyms, etc) often used for drug. names.

Aim 2: Develop a set of data generators to produce identified variations using the therapeutic_agent entity in the NCI Thesaurus.

Aim 3: Train and evaluate harmonization model with the data generators to semi-automatically standardize response for medication names.

Collaborators

NA