Study
PLCO
(Learn more about this study)
Project ID
PLCO-1349
Initial CDAS Request Approval
Oct 5, 2023
Title
Investigation into the Effect of Dimensionality Reduction Techniques on Machine Learning Algorithm.
Summary
The rapid proliferation of data across various fields has brought with it a pressing need to develop effective strategies for handling high-dimensional datasets. The performance of Machine Learning algorithms can be greatly hindered by the curse of dimensionality, wherein high-dimensional data presents challenges such as increased computational complexity, overfitting, and decreased interpretability. Dimensionality reduction techniques offer a promising solution to these issues by transforming data into a lower-dimensional representation while preserving essential information. While high-dimensional datasets contain rich information, they also introduce complexities that often impede the performance of machine learning algorithms. Traditional machine learning may struggle to process and extract meaningful patterns from such data, necessitating the exploration of techniques to address these challenges. This project is situated at the intersection of machine learning and dimensionality reduction, aiming to bridge the gap between high-dimensional data and algorithm performance. The project proposes to implement and apply a variety of dimensionality reduction techniques such as T-SNE, PCA, and LDA to the high-dimensional datasets. These techniques will be employed to reduce the dataset dimensionality while retaining relevant information, after which selected machine learning algorithms will be applied to both the original dataset and datasets reduced using dimensionality reduction techniques. The bulk of the project lies in the comparison of performance across different dimensionality reduction techniques and machine learning algorithms. The performance of the algorithm when dimensionality reduction techniques are applied to the dataset will be compared to the performance of the machine learning algorithm when the original dataset is used. The study will help the industry practices when applying machine learning algorithms by helping to decide which dimensionality reduction technique best fits the dataset being considered and also helps improve the machine learning algorithm's performance.
Aims
The primary aim of this research is to investigate how various dimensionality reduction techniques impact the performance of machine learning algorithms. To achieve this overarching goal, several specific objectives have been outlined:
Comparative Analysis: Conduct a comparative analysis of different dimensionality reduction techniques, including Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Linear Discriminant Analysis (LDA), and autoencoders. The aim is to assess their effectiveness in reducing data dimensionality while retaining relevant information.
Algorithmic Performance Evaluation: Evaluate the performance of a range of machine learning algorithms, encompassing classification, regression, and clustering tasks. These algorithms may include Support Vector Machines (SVM), Random Forests, k-Nearest Neighbors, and deep learning architectures. Assess the algorithms on both the original high-dimensional datasets and datasets reduced using dimensionality reduction techniques.
Performance Metrics Analysis: Quantify and analyze various performance metrics, including accuracy, precision, recall, F1-score, and computational time, for each machine learning algorithm on both the original and reduced-dimensionality datasets. This comprehensive analysis will facilitate an in-depth comparison of algorithmic performance.
Scenario Identification: Identify instances where specific dimensionality reduction techniques yield substantial improvements in algorithm performance. By examining the relationship between data characteristics and technique effectiveness, this objective aims to provide insights into the ideal contexts for applying each technique.
Collaborators