A ‘Cluster-Classify’ Approach to solving Multi-Dimensional Classification Problems
Author : Aryan Agarwal
Vasant Valley School, New Delhi, India
Abstract
Today, Multi-Dimensional Classification (MDC) problems are found in every sector, most notably in clinical settings. Such datasets tend to ‘overfit’ classical machine learning models, leading to low accuracies on previously unseen data. The authors propose a two-step Machine Learning framework to solve MDC problems and overcome ‘overfitting’. This
framework clusters instance features in a labelled dataset and then validates the results using classical classifiers. Clustering reduces dimensional complexity while maintaining correlations between features and labels. Classification makes the model fit for predictions, opening it to real-world applications. As a case study, this framework is implemented on two datasets of
different dimensions. Promising results are recorded for the wine quality classification dataset, with eleven features, and the stroke classification dataset with ten features. The ideal k is found to be three for both, using the elbow method. K-Means and K-Medians are used for clustering, while Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), K Nearest Neighbours (KNN), and Naïve Bayes are used for classification. Five-fold cross-validation is used to reduce bias while measuring model performance, and the results are compared to direct classification without clustering. On the wine dataset, K-Means clustering improves accuracy by 3.89%, while on the stroke dataset it improves accuracy by 8.81%. The results prove that Cluster-Classify is an appropriate candidate for classification where there exists a need to reduce dimensions and understand inter-feature relationships. This framework may thus be used to increase model accuracy by reducing
overfitting in MDC problems.
Comments