Mitigating measurement error and missing data impact on analysis of food frequency questionnaire data using machine learning methods
Most studies describe the differences in magnitude of measurement error between methods to determine the best way of assessing diet.2,20,22,23 What is needed is a method of correcting measurement error in data already collected. We propose new statistical machine learning methods for ordinal matrix recovery (OMR) to impute missing data and remove measurement errors and biases caused by under/over-reporting in FFQs. As the FFQ data columns are highly correlated, they possess a low-rank structure. The low-rank structure will be utilized to cast a novel optimization problem for ordinal data using OMR. Additionally, we develop calibration techniques that take advantage of the association between food consumption data and common lab results, a more precise measure, to impute the true value of under/over-reported data. This method has been tested in an internal observational cohort consisting of 750 participants followed up to 10 years. The advantages of this new method are that it does not use 24HRs, it relies on the derived correlations and the underlying ground truth in the FFQ dataset, hence ensuring that there is no unnecessary introduction of variability in the data. There is not a transformation of the variables, or the introduction of any other distribution due to the assumption of normality. Rather the measurement error in the FFQ data is considered an aggregate of the measurement errors in multiple covariates in the dataset, and adequately adjusts the error at once.
Aim 1: To build and validate statistical machine learning techniques for modeling and correcting measurement error and correctly imputing missing data responses in food frequency questionnaire using ordinal data.
Aim 2: To study the impact of measurement errors and the proposed remedial preprocessing solutions.
2a: To examine the impact of biological sex on dietary pattern scores from the FFQ dataset both prior and after the proposed preprocessing.
2b: To examine the impact of age on dietary pattern scores from the FFQ dataset. We will divide the dataset on median age and apply the preprocessing procedures on each group separately.
Aim 3: To develop statistical packages for use in the field to accurately correct for measurement error in FFQ datasets.
Kamran Paynabar, Terry Hartman