Bayesian binary regression using power and power reverse link functions: an application to premature birth data

This study aims to determine factors associated (and quantify) with prematurity through binary regression models, considering power and reverse power link functions, with asymmetric characteristics. As criteria for the model selection, the Bayesian Deviance Information Criterion (DIC), predictive evaluation, and residual analysis. All models analyzed presented similar predictive capacity, however, the model with a reverse power logit link function, with asymmetry parameter  =0.336 was chosen, since it presented the lowest value of DIC=3,203, residues that indicated a good fitted of the model. There was an association of prematurity with the following variables: maternal - age over 35 years (OR=1.485), with a partner (OR=0.731), and primiparous (OR=1.307); of pregnancy and childbirth - multiple pregnancy (OR=36.360), cesarean childbirth (OR=1.337) and number of prenatal consultations less than seven (OR=3.305); and newborns of white race/skin (OR=0.731) and presence of congenital malformation (OR=2.663). Considering the proposed criteria, an asymmetric link function (reverse power logit) was the most parsimonious for the model. From this, there were high chances of factors associated with the occurrence of prematurity, indicating the need for actions to minimize them.

The variables selected to verify a possible association with prematurity were those obtained before the event occurred. Thus, those related to the mother (age, partner, education, and parity) were obtained; to pregnancy and childbirth (type of pregnancy, type of childbirth, prenatal consultations, and place of childbirth) and to the newborn (sex, race/skin color, and congenital malformation). Each variable was categorized into reference (lower levels) and risk, according to the criteria presented in the literature (Oliveira, 2015) (Table 1).

Link functions
In the literature, the logit, probit, and cauchit link functions, all with symmetrical characteristics, are obtained from the Logistic, Normal, and Cauchy probability distributions, respectively; and the cloglog, and loglog link functions, of asymmetrical characteristic, obtained from the Gumbel distributions of Minimum and Maximum Value, respectively, are said to be basic (Albert and Chib, 1993;Abanto-Valle et al., 2015). In addition to these usual link functions, it is intended to use here the class of link functions known as power and reverse power, as they are presented as an alternative for a better fit of the model .
To obtain such link functions, one must know the probability distributions from which they come. A univariate random variable T has power probability distribution, T ~ P(, 2 ,), or reverse power, T  Silva et al., 2020), that is, the Logistics, Normal (Gaussian), Cauchy, Minimum and Maximum Value Gumbel, were also considered as base distributions for the power and reverse power distributions ( Table 2).
Considering the variables of interest, binary regression models were fitted, with the link functions shown in Table 2, under the Bayesian approach, in order to choose the model with the best fit (Paulino et al., 2003;Rossi, 2011). Considering that for the basics link functions, the parameters to be estimated are the regression coefficients, β=(β0, β1, ..., βk), on the other hand, for the power and reverse power link functions, in addition to the previous parameters, it is also necessary to estimate , the parameter of asymmetry, which can be obtained from the transformation =ln(λ) ↔ λ= exp(), which facilitates
For all models, as initial values for β, were used frequentist (classical) estimates with logit link function and, to , a random generated value from a uniform distribution, with parameters a=-2 and b=2. In addition, 1,100,000 values were generated considering a sample discard period of 100,000 and jump of size 100 to reduce the autocorrelation generated in the MCMC process (Markov Chain Monte Carlo), forming final samples of 10,000 generated values. The convergence of the chains was verified by the Geweke (1992) and Heidelberger and Welch (1983) criteria through the coda package (Plummer et al., 2006) in R software. The parameter estimates, obtained from posterior distributions, were the mean, the standard deviation and the median (used when there is an asymmetry in the posterior distribution). Furthermore, the "significance" of these parameters was verified if their respective HPD (Highest Posterior Density) intervals, with 95% credibility, did not contain a zero value.

2.2.3
Model selection, diagnostics, and predictive evaluation Model selection, diagnosis, and predictive evaluation Some criteria were considered to carry out the selection final model fitted. As an information criterion, the Bayesian Deviance Information Criterion (DIC) (Spiegelhalter et al., 2002) was used. Models with lower DIC values are indicated as more parsimonious.
In order to evaluate the quality of fit of the model to the data, it was performed the diagnosis of the residuals in addition to the verification of their predictive capacity (Giolo, 2017). Regarding the residual, it was chosen to use the normalized quantile (Dunn and Smyth, 1996) and check its suitability from the index graphs and the binomial simulated envelope (Atkinson, 1981).
For the evaluation of predictive capacity, the study population was divided into sets: training (70%) and test (30%), in order not to obtain an overfitting for the prediction model (Kuhn and Kjell, 2013). Predictive measures of sensitivity (s), specificity (e), accuracy (ACC) and the area under the ROC curve (AUC -area under ROC curve (Receiver Operating Characteristic)) -measure summarizing the overall accuracy of the predictive model (Fawcett, 2006;Borges, 2016), were calculated.
After the diagnostic stage, the best model was selected and with that, the OR's were calculated for the associations found. Such a measure can be used in studies with a cross-sectional design, provided that its interpretation is carried out in terms of odds, and not of probability associated with risk (Castro-Costa and Ferri, 2008).

Results and Discussion
From the exploratory analysis, it can be verified that for the Age variable of the mother, the proportion of premature births is higher for women aged 35 years or over, when compared to the reference category and for women aged less than 20 years is less than the reference category. In the variables with only two categories: primiparous mothers, who had cesarean childbirth and with an insufficient number of prenatal consultations, in addition to newborns with congenital malformations, also present risk categories with a higher proportion of premature births, which descriptively indicates a possible association with the outcome. A variable that presents a large proportion of premature births in its risk range is formed by mothers who had multiple pregnancies. It is possible to verify a proportion of premature births three times higher than non-premature infants, indicating descriptively that the association of this variable with the outcome will be considered strong.
For the Partner and Race/Skin Color variables, the risk categories have a lower proportion of premature compared to the reference one. This result indicates descriptively that the association of these variables with the prematurity outcome has an inverse effect on the risk class. That is, individuals without partners have 73.1% (OR=0.731) of the risk of those with partners, and non-white individuals have 75% of the risk of whites (OR=0.75).
Regarding the variables: Education, Type of childbirth and sex, the proportion of premature and non-premature in the reference and risk categories present the same (or very close) pattern of premature and non-premature in the population, which indicates descriptively that there may be no association of these variables with the outcome.
The variable Age (related to the mother) was considered as a dummy variable, such that the age between 20 and 35 years was the reference class in both cases. Thus, it was defined as Age1 for less than 20 years being the intermediary risk category, and Age2 for over or equal to 35 years being the elevate risk category. The explanatory variables that were used in the proposed models were Age2, Partner, Parity, Type of pregnancy, Type of childbirth, Prenatal consults, Race/Skin Color and Congenital malformation, according to Oliveira (2015).
All the fitted models passed the Geweke, and Heidelberger and Welch tests, thus verifying the convergence in the posterior chains. It was found that from the DIC (Table 3), the models with RPL and RPC link functions are also parsimonious, besides presenting the lowest values (3,203 and 3,199, respectively) when compared to the others.
From the results of the first selection stage, it was used the criteria for evaluating the quality of the fit of the models, with link functions that stood out by the DIC (RPL and RPC), and also, for the model with logit link function, commonly used to determine associations of factors to prematurity (Cascaes et al., 2008;Guimarães et al., 2017).
When evaluating the accuracy, sensitivity, specificity and AUC measures (Table 4), it is found that all models have very close values, indicating that the predictive analysis criterion is not useful for choosing a model for the preterm birth data of the present study. Taking into account all the proposed criteria to verify the fitted models (Table 3), the RPL model was chosen. The estimates of the parameters of all the models analyzed so far (L, RPL and RPC), are presented in Table 5. Note that for the chosen link function (RPL), all parameters were significant, since 0  HPD intervals. It was decided to use the median as a point estimate for the parameters, as there is asymmetry in posterior distributions.
The diagnosis of the fit of models with L, RPL and RPC link functions, through the graphic analysis of the randomized quantile residuals, shows that the RPL model did not present excessive observations highlighted from the others, in addition to not presenting points (residuals) outside the credibility band simulated (Figure 1), indicating that the model is adequate. For the other models such behaviors were not verified.  Considering the RPL model, it can be said that the odds of prematurity was higher among those born to mothers over 35 years of age (OR=1.485) and primiparous (OR=1.307); with multiple pregnancy (OR=36.360), cesarean childbirth (OR=1.337), with insufficient number of prenatal consultations (OR=3.305) and newborns with congenital malformations (OR=2.663). In addition, the odds for prematurity was lower among those born to a mother with a partner (OR=0.731) and those with the white race/skin color self-referred by the mother (OR=0.750) ( Table 5).
The present study showed that, when considering criteria for the selection of a model, it was found that the one, with asymmetric characteristic (RPL -has the estimate of the asymmetry parameter related to the data), was more suitable for the analysis of factors associated with the outcome, which is unbalanced, although the parameter estimate is not significant, as it includes the unit value in its high density range (λ=0.336; HPD: (0.135;2.707)) and marginally significant for the RPC model (λ=0.708; HPD: (0.313;1.025)). Also, when compared with the results obtained by the model with logit link function, which is the commonly used for this type of analysis (Cascaes et al., 2008;Guimarães et al., 2017), greater association measures are verified for the model chosen through the criteria of selection, that is, the asymmetric reverse power link function better captures the associations between factors and outcome.
It is also important to highlight, as the use of models with power and reverse power reverse link functions, are present in the literature for the analysis of different areas. In the application (Chen et al., 1999) in which a sample of 4,000 car policyholders was used, with unbalanced amounts of successes and failures (34% and 66%, respectively), concluded that the model with the RPC link function would be the most parsimonious under the Bayesian perspective, to predict customers who will purchase a full insurance coverage plan for their automobile, based on sex, driving area, vehicle use, marital status, age and seniority in the company.
From an application (Lemonte and Bazán, 2018) in order to present models with power and reverse power link functions obtained from the Logistics, Normal (Gaussian), Cauchy, Laplace and t-Student distributions, as an alternative for modeling binary responses, used data related to coca leaf cultivation in Peru. From a sample of 1,947 coca producers, an exploratory study on the factors associated (demographic characteristics and aspects of social programs implemented in recent years in the country) to their behavior in the decision to continue or eradicate coca cultivation, was carried out. Using the frequentist (classical) approach, they opted for the model with t-Student reverse power link function, therefore, based on the criteria: the highest value of the likelihood function evaluated in the estimates, information criteria (AIC, BIC and HQIC), diagnostic analyzes (local influence and residues) and because it has a lower standard deviation for the estimates of the parameters, the model with this link function was chosen.
With the same distributions mentioned above, an application (Huayanay et al., 2019) was presented to verify the performance of the models. From chemical attributes determined by wine connoisseurs, the objective was to verify which of these, the quality of white wine from the Minho region, in Portugal, was associated between May 2004 and February 2007. In a total of 4,898 wine samples, it was found that 21% were considered of good quality, showing the imbalance of the response variable. From certain Information Criteria, predictive analysis and the randomized quantile residual, the best fit was obtained by the model with t-Student power link function with 0.6 degrees of freedom.
In general, a model with a link function that has the estimate of the asymmetry parameter related to the unbalanced response, presented a better fit in different fields of study, as seen above, and in data related to prematurity. Also, the Bayesian methodology used to obtain parameter estimates was a practical option for implementing the models, either using the MCMC iterative process via the Gibbs Sampling (Geman and Geman, 1984) algorithm as here, or using other algorithms (Bazán et al., 2017;Huayanay et al., 2019;Silva et al., 2020), in addition to highlighting the asymmetry that the posterior distribution of the parameters presents for the link function chosen.
In the study (Silva et al., 2020) from a Bayesian perspective, the authors present modeling results in simulations of unbalanced dichotomous data, considering different sample sizes and conclude that, regardless of the sample size, such unbalance affects the estimation of logistic regression parameters in relation to the bias, square error and the standard deviation of the estimates. They also discuss adjustment in two distinct applications (one for linguistic data and one for educational data) considering other link functions in a regression model and conclude that power and reverse power are more suitable for the modeling.
It is common knowledge in the health area that the prevention of premature birth always needs to be improved, as complications from this type of birth are the main cause of death in children under five years of age (Who, 2021), and also a causative factor for several sequels to newborns in their childhood and even into adulthood (Oliveira, 2015;Crump et al., 2011). In this sense, statistics makes its contribution, presenting more sophisticated methods, providing professionals in the area and competent bodies, a basis to better lead them to a decision-making to minimize the occurrence of this type of birth.

Conclusions
In the proposed criteria, an asymmetric link function (reverse power logit) was the most parsimonious for the modeling. From this, there were high odds of factors associated with the occurrence of prematurity, indicating the need for actions to minimize them.
The use of power and reverse power link functions for binary regression models applied to unbalanced data can promote significant improvements in parameter inferences.