PLCOI-1077: Synthetic generation and tagging of CXR and pathology images - Approved Projects

Studies on CDAS

Additional Studies...

More Information

Principal Investigator

Name

Jonathan Pearson

Degrees

Ph.D

Institution

NHS Commissioning Board (NHS England)

Position Title

Lead Data Scientist

jonathanpearson@nhs.net

About this CDAS Project

Study

PLCO (Learn more about this study)

Project ID

PLCOI-1077

Initial CDAS Request Approval

Oct 18, 2022

Title

Synthetic generation and tagging of CXR and pathology images

Summary

Note: Same project as NLST-961. After discussing with the team they recommended that we put in an additional request for the Chest X-rays and Pathology images from PLCO as these have accompanying pathology reports.

There is a growing opportunity for analysis of multi-modal data in healthcare for the generation of new insights and enabling of automated classification and clinical support systems. The NHS has a significant interest in understanding these innovative opportunities. A key blocker to innovation is the unavailability of sources of multi-modal data (e.g. images with connected text). This project would seek to use the NLST spiral CT and Pathology image data with associated notes/annotations, to investigate if coupled synthetic images and notes can be generated alongside artificial patient records for different image types.

Our similar previous work has focussed on using pre-trained embedding models for chest x-rays (https://github.com/nhsx/txt-ray-align) and variational auto-encoders (https://github.com/nhsx/SynthVAE) - see reports within the repos. This investigation would not seek to make a final dataset, but rather explore the methodology and the level of fidelity and quality that these synthetically generated data could obtain with current methods. The methodology and report would be made open through a GitHub repository at the project end but with no training or synthetic data published.

Aims

- Quantification of the accuracy obtained when using pre-trained models to represent the text and image embeddings for different healthcare images
- Demonstration of generation of image from text and text from image for established multi-modal datasets.
- Investigation into the best algorithm to use for generating synthetic healthcare images in terms of fidelity, privacy, fairness, and explainability.
- Possible extension into visual question and answering search algorithm

Collaborators

NHS England Transformation Directorate