Automated Data Harmonization
Aim 1: Conduct an exploratory analysis of radiologist-entered free-text responses to characterize linguistic variation, concept co-occurrence patterns, and sources of ambiguity in radiologic documentation.
Aim 2: Develop data augmentation tools to simulate variation in radiologic phrasing, drawing on standard terminology from sources such as RadLex and SNOMED CT to generate training examples that reflect real-world input diversity.
Aim 3: Train and evaluate a sequence tagging model to extract and categorize clinical concepts from multi-concept free-text responses. For example, given the radiologist-entered text “linear scarring in the left lower lobe with adjacent pleural thickening,” the model would extract and categorize [“linear scarring” → Finding], [“left lower lobe” → Location], and [“pleural thickening” → Finding]. Tagging provides the boundaries and types of clinical entities needed for downstream harmonization.
Aim 4: Harmonize extracted concepts to standardized terms within their respective categories. Where feasible, we will identify initial target standards (e.g., from RadLex, SNOMED CT, LOINC, ICD-10-CM) and use harmonization models to align extracted concepts to those standards, expanding coverage over time as needed.
NA