New Statistical Methods for the Analysis of Longitudinal Compositional Data
Principal Investigator
Name
Erika Shults
Degrees
MS
Institution
Southern Methodist University
Position Title
Student
Email
About this CDAS Project
Study
IDATA
(Learn more about this study)
Project ID
IDATA-88
Initial CDAS Request Approval
Jan 24, 2025
Title
New Statistical Methods for the Analysis of Longitudinal Compositional Data
Summary
Compositional data is a special type of non-negative, multivariate data where the components represent parts of a whole. The data are expressed as proportions, percentages, or fractions that sum to a constant, typically 1 or 100%. Due to the constant sum constraint, relevant information is carried in the relative, rather than the absolute, values of components. Common examples include the relative abundances of different species in an ecosystem, the makeup of minerals in a rock, or the proportions of nutrients in food.
Analyzing compositional data requires specialized methods to account for its inherent constraints and avoid spurious correlations. Compositional data analysis (CODA) was first introduced by Dr. John Aitchison in 1982 and has continually grown in popularity and scope, with applications in a wide array of fields, such as geology, chemistry, sociology, and marketing. Aitchison advocated for and popularized the use of transformations when conducting CODA, so most research has utilized this method. However, there are limitations to this transformational approach, including difficulty in interpretation and non-uniqueness of values.
Recent work, led by Dr. Monnie McGee at SMU, has challenged the use of transformations and instead proposed the use of the nested Dirichlet distribution (NDD), which is a natural model for compositional data. The NDD is a generalization of the Dirichlet distribution, which Aitchison mentioned in his seminal work, but largely discounted because the DD places unrealistic restrictions on the data. However, the DD has been used previously for compositional data analysis and provides key benefits, such as easy interpretation and uniqueness of values, as well as a probabilistic framework for handling uncertainty. Dr. McGee’s team recently developed tests using the NDD, a generalization of the Dirichlet that relaxes the restrictions on the Dirichlet, to compare mean vectors of components in a two-sample and multiple sample context. A natural continuation of this work is to extend this group comparison longitudinally, which is the primary intention of this project.
One of the most popular use cases for CODA has been in analyzing 24-hour activity data. In fact, according to Google Scholar, more than 50 compositional analyses using this data have been published in the past year alone. However, not one uses the Dirichlet distribution and only one uses longitudinal data (with transformations). To the best of our knowledge, our proposed work using the NDD for a longitudinal analysis of daily activity data will be the first of its kind. Having access to the iDATA, in particular the longitudinal ACT24 data, will allow us to fulfill the aims below.
Analyzing compositional data requires specialized methods to account for its inherent constraints and avoid spurious correlations. Compositional data analysis (CODA) was first introduced by Dr. John Aitchison in 1982 and has continually grown in popularity and scope, with applications in a wide array of fields, such as geology, chemistry, sociology, and marketing. Aitchison advocated for and popularized the use of transformations when conducting CODA, so most research has utilized this method. However, there are limitations to this transformational approach, including difficulty in interpretation and non-uniqueness of values.
Recent work, led by Dr. Monnie McGee at SMU, has challenged the use of transformations and instead proposed the use of the nested Dirichlet distribution (NDD), which is a natural model for compositional data. The NDD is a generalization of the Dirichlet distribution, which Aitchison mentioned in his seminal work, but largely discounted because the DD places unrealistic restrictions on the data. However, the DD has been used previously for compositional data analysis and provides key benefits, such as easy interpretation and uniqueness of values, as well as a probabilistic framework for handling uncertainty. Dr. McGee’s team recently developed tests using the NDD, a generalization of the Dirichlet that relaxes the restrictions on the Dirichlet, to compare mean vectors of components in a two-sample and multiple sample context. A natural continuation of this work is to extend this group comparison longitudinally, which is the primary intention of this project.
One of the most popular use cases for CODA has been in analyzing 24-hour activity data. In fact, according to Google Scholar, more than 50 compositional analyses using this data have been published in the past year alone. However, not one uses the Dirichlet distribution and only one uses longitudinal data (with transformations). To the best of our knowledge, our proposed work using the NDD for a longitudinal analysis of daily activity data will be the first of its kind. Having access to the iDATA, in particular the longitudinal ACT24 data, will allow us to fulfill the aims below.
Aims
1. Methodology Development: Develop a longitudinal approach for analyzing compositional data based on the nested Dirichlet distribution. All work will be conducted using the longitudinal ACT24 data.
2. Dissemination of information: Due to the popularity of this data, this analysis will be highly relevant to researchers across many fields, including statistics, epidemiology, and aging. Both the newly developed statistical methodologies and outcomes of the analysis will be disseminated freely.
Collaborators
Monnie McGee, Ph.D.