Skip to Main Content

An official website of the United States government

Principal Investigator
Name
Jonathan Pearson
Degrees
Ph.D
Institution
NHS Commissioning Board (NHS England)
Position Title
Lead Data Scientist
Email
About this CDAS Project
Study
PLCO (Learn more about this study)
Project ID
PLCOI-1077
Initial CDAS Request Approval
Oct 18, 2022
Title
Synthetic generation and tagging of CXR and pathology images
Summary
Note: Same project as NLST-961. After discussing with the team they recommended that we put in an additional request for the Chest X-rays and Pathology images from PLCO as these have accompanying pathology reports.

There is a growing opportunity for analysis of multi-modal data in healthcare for the generation of new insights and enabling of automated classification and clinical support systems. The NHS has a significant interest in understanding these innovative opportunities. A key blocker to innovation is the unavailability of sources of multi-modal data (e.g. images with connected text). This project would seek to use the NLST spiral CT and Pathology image data with associated notes/annotations, to investigate if coupled synthetic images and notes can be generated alongside artificial patient records for different image types.

Our similar previous work has focussed on using pre-trained embedding models for chest x-rays (https://github.com/nhsx/txt-ray-align) and variational auto-encoders (https://github.com/nhsx/SynthVAE) - see reports within the repos. This investigation would not seek to make a final dataset, but rather explore the methodology and the level of fidelity and quality that these synthetically generated data could obtain with current methods. The methodology and report would be made open through a GitHub repository at the project end but with no training or synthetic data published.
Aims

- Quantification of the accuracy obtained when using pre-trained models to represent the text and image embeddings for different healthcare images
- Demonstration of generation of image from text and text from image for established multi-modal datasets.
- Investigation into the best algorithm to use for generating synthetic healthcare images in terms of fidelity, privacy, fairness, and explainability.
- Possible extension into visual question and answering search algorithm

Collaborators

NHS England Transformation Directorate