Nowadays, most real-world datasets suffer from the problem of imbalanced distribution of data samples in classes, especially when the number of data representing the larger class (majority) is much greater than that of the smaller class (minority). In order to solve this problem, various types of undersampling or oversampling techniques have been proposed to create a dataset with equal number of samples in each class by reducing or increasing the number of samples in majority or minority classes, respectively. Ensemble classifiers use multiple learning algorithms to enhance the accuracy of classification. Based on the results, combining undersampling or oversampling methods with ensemble classifiers can result in models with better performance. By using both clustering and new undersampling methods, the present study aimed to propose a novel clustering-based undersampling method to create a balanced dataset. This method uses k-means clustering algorithm for clustering the data, Mahalanobis distance to analyze samples distance in each cluster to centroid, and a selection method that preserves the pattern of data distribution in each cluster. Regarding the experimental results obtained by 44 benchmark datasets from KEEL repository, the proposed approach performed better than that of seven state-of-the-art approaches.

A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems

Palmieri F.
2021-01-01

Abstract

Nowadays, most real-world datasets suffer from the problem of imbalanced distribution of data samples in classes, especially when the number of data representing the larger class (majority) is much greater than that of the smaller class (minority). In order to solve this problem, various types of undersampling or oversampling techniques have been proposed to create a dataset with equal number of samples in each class by reducing or increasing the number of samples in majority or minority classes, respectively. Ensemble classifiers use multiple learning algorithms to enhance the accuracy of classification. Based on the results, combining undersampling or oversampling methods with ensemble classifiers can result in models with better performance. By using both clustering and new undersampling methods, the present study aimed to propose a novel clustering-based undersampling method to create a balanced dataset. This method uses k-means clustering algorithm for clustering the data, Mahalanobis distance to analyze samples distance in each cluster to centroid, and a selection method that preserves the pattern of data distribution in each cluster. Regarding the experimental results obtained by 44 benchmark datasets from KEEL repository, the proposed approach performed better than that of seven state-of-the-art approaches.
2021
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4806742
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 16
  • ???jsp.display-item.citation.isi??? 12
social impact