NLST-130: Designing and Implementing a Database Infrastructure for Large-Scale Biomedical Data - Approved Projects

Studies on CDAS

Additional Studies...

More Information

Principal Investigator

Name

Frank Meng

Degrees

PhD

Institution

UCLA

Position Title

Assistant Professor

fmeng@mii.ucla.edu

About this CDAS Project

Study

NLST (Learn more about this study)

Project ID

NLST-130

Initial CDAS Request Approval

Mar 16, 2015

Title

Designing and Implementing a Database Infrastructure for Large-Scale Biomedical Data

Summary

As the volume of biomedical data continues to increase, innovative techniques and designs will be needed to efficiently store and retrieve this vast amount of information to support large-scale clinical and research applications. Though database systems that are capable of handling large-scale data have been in use for several years, biomedical big data presents several unique challenges to data repositories. The goal of this project is to address biomedical data storage issues mainly along two axes that each may contain potentially conflicting requirements: 1) efficient storage and retrieval of multi-modal clinical data including textual documents, medical images, and high-dimensional omics data (e.g., genomic sequences); and 2) integrated support for both ad-hoc and predefined queries across these various data formats.

Different types and formats of data will impose specific needs for indexing and retrieval that must be addressed by the underlying database infrastructure. Textual documents are usually indexed based on words or phrases to facilitate keyword-based searches. Medical images can be accessed based on various parameters such as color, texture, surfaces, and regions that are represented by histograms, spectrums, level sets, and spatial regions of interest. High-dimensional biological data such as genomic sequences require a highly flexible schema, cannot usually be normalized easily, and contain multiple non-trivial data types. Finally, existing structured clinical data is typically captured within the constraints of a predefined relational model.

In addition, user query requirements will differ depending on the intended applications. Many biomedical research workflows leverage the efficient execution of a small set of commonly occurring queries that operate on large volumes of data, such as searching for specific genes within a large genomic data set. Clinical researchers, on the other hand, may need to gather specific patient cohorts using parameters that vary from one project to the next. NoSQL solutions are optimized to handle predefined queries of large volumes of schema-less data that are distributed across multiple nodes. However, these architectures are less efficient when handling ad-hoc queries that cannot be easily anticipated. Relational databases, in contrast, are designed for the ad-hoc querying environment and provide a clear interface for users to submit data access requests (e.g., SQL). The proposed database infrastructure will need to provide both capabilities through a hybrid approach, where data at lower levels are stored within distributed NoSQL data stores for efficient retrieval, and extract, transform, load (ETL) processes can be established to harmonize data with traditional relational models for servicing ad-hoc queries.

Aims

1. To design and implement a database infrastructure for large-scale, multi-modal biomedical data that enables fast and efficient storage and retrieval of structured, textual, imaging, and genomic data.

2. To develop techniques for integrating relational database models with NoSQL-based systems for servicing both ad-hoc and predefined querying mechanisms.

Collaborators

Alex Bui, UCLA
William Hsu, UCLA
Corey Arnold, UCLA
Denise Aberle, UCLA