Diabetes diagnosis based on hard and soft voting classifiers combining statistical learning models

Main Article Content

Gustavo Peixoto de Oliveira
Anderson Fonseca
Paulo Canas Rodrigues
https://orcid.org/0000-0002-1248-9910

Abstract

Diabetes mellitus is one of the deadliest incurable diseases globally, and its cases continue upward. The identification of the disease in an early way helps fight it; however, blood tests can be considered invasive, discouraging its accomplishment. In this vein, this work aims to build a model as an alternative to traditional exams to identify the disease. Statistical learning algorithms such as logistic regression, K-nearest neighbors, decision trees, random forest, and support vector machines were used for diabetes classification. These models were considered separately and combined via hard and soft voting classifiers. The methods were applied to a widely known dataset of 768 individuals and nine variables, compared using several accuracy metrics based on the confusion matrix, and used to estimate the probability of diabetes for a given profile.

Article Details

How to Cite
Oliveira, G. P. de, Fonseca, A., & Rodrigues, P. C. (2022). Diabetes diagnosis based on hard and soft voting classifiers combining statistical learning models. Brazilian Journal of Biometrics, 40(4), 415–427. https://doi.org/10.28951/bjb.v40i4.605
Section
Articles

References

Anguita, D., Ghelardoni, L., Ghio, A., Oneto, L. & Ridella, S.The ‘K’in K-fold cross validation in 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2012), 441–446.

Ara, A., Louzada, F. & Milan, L. A. Classification binary models for biomedical data: simple probabilistic networks and logistic regression. Brazilian Journal of Biometrics 36, 48–55 (2018).

Ayon, S. I. & Islam, M. M. Diabetes prediction: a deep learning approach. International Journal of Information Engineering and Electronic Business 12, 21 (2019).

Breiman, L, Friedman, J, Stone, C. & Olshen, R. Classification and Regression Trees: Taylor & Francis 1984.

Breiman, L. Random forests .Machine learning 45,5–32 (2001).

Bressan, G. M., de Azevedo, B. C. F. & de Souza, R. M. Métodos de classificação automática para predição do perfil clínico de pacientes portadores do diabetes mellitus. Brazilian Journal of Biometrics 38, 257–273 (2020).

Cortes, C. & Vapnik, V. Support-vector networks. Machine learning 20, 273–297 (1995).

Cox, D. R. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20, 215–232 (1958).

Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H.The elements of statistical learning: data mining, inference, and prediction (Springer, 2009).

Hina, S., Shaikh, A. & Sattar, S. A. Analyzing diabetes datasets using data mining. Journal of Basic and Applied Sciences 13, 466–471 (2017).

Ho, T. K. Random decision forests in Proceedings of 3rd international conference on document analysis and recognition 1 (1995), 278–282.

International Diabetes Federation. IDF diabetes atlas ninth. Dunia: Idf9,5–9 (2019).

Izbicki, R. & dos Santos, T. M. Aprendizado de máquina: uma abordagem estatística (Rafael Izbicki, 2020).

Jeatrakul, P., Wong, K. W. & Fung, C. C. Data cleaning for classification using misclassification analysis. Journal of Advanced Computational Intelligence and Intelligent Informatics 14, 297–302 (2010).

Kohonen, T. in Self-organizing maps 175–189 (Springer, 1995).

Kuhn, M. & Johnson, K. Feature engineering and selection: A practical approach for predictive models (CRC Press, 2019).

Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A.,et al. caret: classification and regression training. 2020 (2021).

Kumari, S., Kumar, D. & Mittal, M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering 2, 40–46 (2021).

Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 50–60 (1947).

Pedregosa, F.et al.Scikit-learn: Machine learning in Python. Journal of machine learning research 12,2825–2830 (2011).

R Core Team.R: A Language and Environment for Statistical ComputingR Foundation for Statistical Computing (Vienna, Austria, 2022). https://www.R-project.org/.

Shang, T., Zhang, J. Y., Thomas, A., Arnold, M. A., Vetter, B. N., Heinemann, L. & Klonoff,D. C. Products for monitoring glucose levels in the human body with noninvasive optical, noninvasive fluid sampling, or minimally invasive technologies .Journal of diabetes science and technology 16,168–214 (2022).

Silverman, B. W. & Jones, M. C. E. fix and jl hodges (1951): An important contribution to nonparametric discriminant analysis and density estimation: Commentary on fix and hodges (1951) .International Statistical Review/Revue Internationale de Statistique,233–238 (1989).

Sisodia, D. & Sisodia, D. S. Prediction of diabetes using classification algorithms .Procedia computer science 132,1578–1585 (2018).

Smith, J. W., Everhart, J. E., Dickson, W., Knowler, W. C. & Johannes, R. S. Using the ADAPlearning algorithm to forecast the onset of diabetes mellitus, 261 (1988).

Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society: Series B (Methodological) 36,111–133 (1974).

Wilcoxon, F. Some uses of statistics in plant pathology.Biometrics Bulletin 1,41–45 (1945).

Wolpert, D. H. Stacked generalization. Neural networks 5,241–259 (1992).

World Health Organization.Diabetes [Online; accessed 28-November-2021]. 2021. https:/ /www.who.int/health-topics/diabetes#tab=tab_1.

World Health Organization. The top 10 causes of death [Online; accessed 28-November-2021]. 2020. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.