Diabetes diagnosis based on hard and soft voting classifiers combining statistical learning models
Main Article Content
Abstract
Diabetes mellitus is one of the deadliest incurable diseases globally, and its cases continue upward. The identification of the disease in an early way helps fight it; however, blood tests can be considered invasive, discouraging its accomplishment. In this vein, this work aims to build a model as an alternative to traditional exams to identify the disease. Statistical learning algorithms such as logistic regression, K-nearest neighbors, decision trees, random forest, and support vector machines were used for diabetes classification. These models were considered separately and combined via hard and soft voting classifiers. The methods were applied to a widely known dataset of 768 individuals and nine variables, compared using several accuracy metrics based on the confusion matrix, and used to estimate the probability of diabetes for a given profile.
Article Details
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
References
Anguita, D., Ghelardoni, L., Ghio, A., Oneto, L. & Ridella, S.The ‘K’in K-fold cross validation in 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2012), 441–446.
Ara, A., Louzada, F. & Milan, L. A. Classification binary models for biomedical data: simple probabilistic networks and logistic regression. Brazilian Journal of Biometrics 36, 48–55 (2018).
Ayon, S. I. & Islam, M. M. Diabetes prediction: a deep learning approach. International Journal of Information Engineering and Electronic Business 12, 21 (2019).
Breiman, L, Friedman, J, Stone, C. & Olshen, R. Classification and Regression Trees: Taylor & Francis 1984.
Breiman, L. Random forests .Machine learning 45,5–32 (2001).
Bressan, G. M., de Azevedo, B. C. F. & de Souza, R. M. Métodos de classificação automática para predição do perfil clínico de pacientes portadores do diabetes mellitus. Brazilian Journal of Biometrics 38, 257–273 (2020).
Cortes, C. & Vapnik, V. Support-vector networks. Machine learning 20, 273–297 (1995).
Cox, D. R. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20, 215–232 (1958).
Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H.The elements of statistical learning: data mining, inference, and prediction (Springer, 2009).
Hina, S., Shaikh, A. & Sattar, S. A. Analyzing diabetes datasets using data mining. Journal of Basic and Applied Sciences 13, 466–471 (2017).
Ho, T. K. Random decision forests in Proceedings of 3rd international conference on document analysis and recognition 1 (1995), 278–282.
International Diabetes Federation. IDF diabetes atlas ninth. Dunia: Idf9,5–9 (2019).
Izbicki, R. & dos Santos, T. M. Aprendizado de máquina: uma abordagem estatística (Rafael Izbicki, 2020).
Jeatrakul, P., Wong, K. W. & Fung, C. C. Data cleaning for classification using misclassification analysis. Journal of Advanced Computational Intelligence and Intelligent Informatics 14, 297–302 (2010).
Kohonen, T. in Self-organizing maps 175–189 (Springer, 1995).
Kuhn, M. & Johnson, K. Feature engineering and selection: A practical approach for predictive models (CRC Press, 2019).
Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A.,et al. caret: classification and regression training. 2020 (2021).
Kumari, S., Kumar, D. & Mittal, M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering 2, 40–46 (2021).
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 50–60 (1947).
Pedregosa, F.et al.Scikit-learn: Machine learning in Python. Journal of machine learning research 12,2825–2830 (2011).
R Core Team.R: A Language and Environment for Statistical ComputingR Foundation for Statistical Computing (Vienna, Austria, 2022). https://www.R-project.org/.
Shang, T., Zhang, J. Y., Thomas, A., Arnold, M. A., Vetter, B. N., Heinemann, L. & Klonoff,D. C. Products for monitoring glucose levels in the human body with noninvasive optical, noninvasive fluid sampling, or minimally invasive technologies .Journal of diabetes science and technology 16,168–214 (2022).
Silverman, B. W. & Jones, M. C. E. fix and jl hodges (1951): An important contribution to nonparametric discriminant analysis and density estimation: Commentary on fix and hodges (1951) .International Statistical Review/Revue Internationale de Statistique,233–238 (1989).
Sisodia, D. & Sisodia, D. S. Prediction of diabetes using classification algorithms .Procedia computer science 132,1578–1585 (2018).
Smith, J. W., Everhart, J. E., Dickson, W., Knowler, W. C. & Johannes, R. S. Using the ADAPlearning algorithm to forecast the onset of diabetes mellitus, 261 (1988).
Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society: Series B (Methodological) 36,111–133 (1974).
Wilcoxon, F. Some uses of statistics in plant pathology.Biometrics Bulletin 1,41–45 (1945).
Wolpert, D. H. Stacked generalization. Neural networks 5,241–259 (1992).
World Health Organization.Diabetes [Online; accessed 28-November-2021]. 2021. https:/ /www.who.int/health-topics/diabetes#tab=tab_1.
World Health Organization. The top 10 causes of death [Online; accessed 28-November-2021]. 2020. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.