Skip to Main Content

An official website of the United States government

Principal Investigator
Name
Jennifer Frediani
Degrees
PhD, RD
Institution
Emory University
Position Title
Assistant Professor
Email
About this CDAS Project
Study
IDATA (Learn more about this study)
Project ID
IDATA-56
Initial CDAS Request Approval
Aug 22, 2022
Title
Mitigating measurement error and missing data impact on analysis of food frequency questionnaire data using machine learning methods
Summary
Self-reported dietary data are important to collect and are valuable in nutrition research and inform public health policy.1 In the last decade there have been 27,682 publications using food frequency questionnaire (FFQ) data, 5,690 in the last year, showing that despite knowing this is not the most accurate method of capturing self-reported dietary data, it is still widely used. It has been established that multiple online 24-hour dietary recalls (24HRs) are a more accurate and simple choice for large, longitudinal cohort studies due to a combination of reduction in participant and researcher burden and higher attenuation factors. 2–5, Despite this, FFQs are still being used and have been the method of choice for many classic datasets like Atherosclerosis Risk in Communities (ARIC)6–8 and Reasons for Geographic and Racial Differences in Stroke (REGARDS).9–11 Both cohorts have new publications using omics technologies with banked samples.12–14 Measurement error is inherent in all data, especially self-reported dietary data. When compensating for measurement error in FFQs two main methods are used. The gold standard is to conduct a doubly labeled water experiment on a subset of the population, often this is deemed too expensive. The second is adjusting for energy intake, which is problematic when FFQs are not quantitative. However, comparing energy intake to basal metabolic rate or collecting 24HRs from a subset of the population for calculating energy intake is widely used. Regardless of these methods, FFQs are still extensively criticized. New bioinformatic techniques can provide a solution to these criticisms.

Most studies describe the differences in magnitude of measurement error between methods to determine the best way of assessing diet.2,20,22,23 What is needed is a method of correcting measurement error in data already collected. We propose new statistical machine learning methods for ordinal matrix recovery (OMR) to impute missing data and remove measurement errors and biases caused by under/over-reporting in FFQs. As the FFQ data columns are highly correlated, they possess a low-rank structure. The low-rank structure will be utilized to cast a novel optimization problem for ordinal data using OMR. Additionally, we develop calibration techniques that take advantage of the association between food consumption data and common lab results, a more precise measure, to impute the true value of under/over-reported data. This method has been tested in an internal observational cohort consisting of 750 participants followed up to 10 years. The advantages of this new method are that it does not use 24HRs, it relies on the derived correlations and the underlying ground truth in the FFQ dataset, hence ensuring that there is no unnecessary introduction of variability in the data. There is not a transformation of the variables, or the introduction of any other distribution due to the assumption of normality. Rather the measurement error in the FFQ data is considered an aggregate of the measurement errors in multiple covariates in the dataset, and adequately adjusts the error at once.
Aims

Aim 1: To build and validate statistical machine learning techniques for modeling and correcting measurement error and correctly imputing missing data responses in food frequency questionnaire using ordinal data.

Aim 2: To study the impact of measurement errors and the proposed remedial preprocessing solutions.

2a: To examine the impact of biological sex on dietary pattern scores from the FFQ dataset both prior and after the proposed preprocessing.

2b: To examine the impact of age on dietary pattern scores from the FFQ dataset. We will divide the dataset on median age and apply the preprocessing procedures on each group separately.

Aim 3: To develop statistical packages for use in the field to accurately correct for measurement error in FFQ datasets.

Collaborators

Kamran Paynabar, Terry Hartman