Skip to Main Content

An official website of the United States government

Principal Investigator
Name
Frank Meng
Degrees
PhD
Institution
UCLA
Position Title
Assistant Professor
Email
About this CDAS Project
Study
NLST (Learn more about this study)
Project ID
NLST-130
Initial CDAS Request Approval
Mar 16, 2015
Title
Designing and Implementing a Database Infrastructure for Large-Scale Biomedical Data
Summary
As the volume of biomedical data continues to increase, innovative techniques and designs will be needed to efficiently store and retrieve this vast amount of information to support large-scale clinical and research applications. Though database systems that are capable of handling large-scale data have been in use for several years, biomedical big data presents several unique challenges to data repositories. The goal of this project is to address biomedical data storage issues mainly along two axes that each may contain potentially conflicting requirements: 1) efficient storage and retrieval of multi-modal clinical data including textual documents, medical images, and high-dimensional omics data (e.g., genomic sequences); and 2) integrated support for both ad-hoc and predefined queries across these various data formats.

Different types and formats of data will impose specific needs for indexing and retrieval that must be addressed by the underlying database infrastructure. Textual documents are usually indexed based on words or phrases to facilitate keyword-based searches. Medical images can be accessed based on various parameters such as color, texture, surfaces, and regions that are represented by histograms, spectrums, level sets, and spatial regions of interest. High-dimensional biological data such as genomic sequences require a highly flexible schema, cannot usually be normalized easily, and contain multiple non-trivial data types. Finally, existing structured clinical data is typically captured within the constraints of a predefined relational model.

In addition, user query requirements will differ depending on the intended applications. Many biomedical research workflows leverage the efficient execution of a small set of commonly occurring queries that operate on large volumes of data, such as searching for specific genes within a large genomic data set. Clinical researchers, on the other hand, may need to gather specific patient cohorts using parameters that vary from one project to the next. NoSQL solutions are optimized to handle predefined queries of large volumes of schema-less data that are distributed across multiple nodes. However, these architectures are less efficient when handling ad-hoc queries that cannot be easily anticipated. Relational databases, in contrast, are designed for the ad-hoc querying environment and provide a clear interface for users to submit data access requests (e.g., SQL). The proposed database infrastructure will need to provide both capabilities through a hybrid approach, where data at lower levels are stored within distributed NoSQL data stores for efficient retrieval, and extract, transform, load (ETL) processes can be established to harmonize data with traditional relational models for servicing ad-hoc queries.
Aims

1. To design and implement a database infrastructure for large-scale, multi-modal biomedical data that enables fast and efficient storage and retrieval of structured, textual, imaging, and genomic data.

2. To develop techniques for integrating relational database models with NoSQL-based systems for servicing both ad-hoc and predefined querying mechanisms.

Collaborators

Alex Bui, UCLA
William Hsu, UCLA
Corey Arnold, UCLA
Denise Aberle, UCLA