Automated Data Harmonization

Principal Investigator

Name
Mohammed Eslami

Degrees
Ph.D.

Institution
Netrias, LLC

Position Title
Chief Scientist

Email
meslami@netrias.com

About this CDAS Project

Study
NLST (Learn more about this study)

Project ID
NLST-1439

Initial CDAS Request Approval
Aug 25, 2025

Title
Automated Data Harmonization

Summary
The National Lung Screening Trial (NLST) includes spiral CT and chest X-ray screening forms with radiologist-entered free-text fields such as “Other (SPECIFY)” and “Comments.” These narrative entries often contain multiple clinical observations phrased inconsistently and without standard terminology. Our objective is to evaluate the feasibility of using language models to extract and harmonize clinical concepts from these free-text verbatims. This would enable semi-automated integration of unstructured radiology data into structured formats, improving data quality and supporting downstream analyses.

Aims

Aim 1: Conduct an exploratory analysis of radiologist-entered free-text responses to characterize linguistic variation, concept co-occurrence patterns, and sources of ambiguity in radiologic documentation.

Aim 2: Develop data augmentation tools to simulate variation in radiologic phrasing, drawing on standard terminology from sources such as RadLex and SNOMED CT to generate training examples that reflect real-world input diversity.

Aim 3: Train and evaluate a sequence tagging model to extract and categorize clinical concepts from multi-concept free-text responses. For example, given the radiologist-entered text “linear scarring in the left lower lobe with adjacent pleural thickening,” the model would extract and categorize [“linear scarring” → Finding], [“left lower lobe” → Location], and [“pleural thickening” → Finding]. Tagging provides the boundaries and types of clinical entities needed for downstream harmonization.

Aim 4: Harmonize extracted concepts to standardized terms within their respective categories. Where feasible, we will identify initial target standards (e.g., from RadLex, SNOMED CT, LOINC, ICD-10-CM) and use harmonization models to align extracted concepts to those standards, expanding coverage over time as needed.

Collaborators