Counting models for overdispersed data: A review with application to tuberculosis data

The present work reviews distributions for counting data: Poisson; Negative Binomial; COM-Poisson and Generalized Poisson, and their regression models. Aspects such as parameter estimation and model choice criteria are presented. And as an application example, we use the regression models of these distributions to explain the relationship between tuberculosis notifications with the HDI Human Development Index of the 102 cities in the state of Alagoas. The existing relationship between notifications of tuberculosis with HDI is significant and overdispersion at the level α = 5% of probability, and the COM-Poisson distribution regression model was the best fit data, according to the Akaike AIC and Bayesian BIC information criteria.


Introduction
Counting data is common in areas of agriculture (Carvalho et al., 2018;Hall, 2000); education Desjardins (2016); engineering Anastasopoulos & Mannering (2009); industry Lambert (1992); psychology Atkins & Gallop (2007); public health Khan et al. (2011), among others.The linear regression model of the Poisson distribution is one of the most used models in the areas of science, since most of the phenomena meet the postulates of the Poisson distribution Shmueli et al. (2005).However, in cases where the data are overdispersed, that is, when the data variance is greater than the mean, the Poisson model is not indicated because it accommodates only equidispersion, when the variance is equal to the mean.In this case, the linear regression models of the Negative Binomial, COM-Poisson and Generalized Poisson distributions are the most indicated because they have a dispersion parameter that can accommodate overdispersion.Data overdispersion can be caused by too many zeros; outliers; the way the sampling process was conducted; the use of inappropriate link function in the model; non-linear effects considered as linear in the systematic component of the model; omitted covariates in the model, among others (Ridout & Besbeas, 2004;Hilbe, 2011).In any case, overdispersion cannot be neglected, otherwise the model estimates will be biased Ridout & Besbeas (2004).
Underdispersion may also occur in the data, a less common situation when the data variance is less than the mean.Works involving underdispersion data are found in (Shmueli et al., 2005; Sellers & Morris, 2017; Barlow & Proschan, 1996;citeridout2004;Hayati et al., 2018).For these types of data, the COM-Poisson and Generalized Poisson distribution models, as they have a more flexible dispersion parameter to accommodate underdispersion, are more suitable than the Negative Binomial distribution Santana (2019).Works involving these types of models are still scarce in some literature, such as in the agricultural sciences.For example, in a search on the SciELO base page in SciELO (2022a), of the 115,681 published articles, only 59 articles were found involving the distribution Poisson, 15 papers involving the Negative Binomial distribution and no papers involving the COM-Poisson and Generalized Poisson distributions.This problem, due to the lack of scientific articles with statistical rigor in the agricultural sciences, is reported in Carvalho et al. (2019) in a bibliographic review survey.
This problem also spreads to other areas, such as health case.In a search on the SciELO database in SciELO (2022b), of the 506,199 published articles, no article was found involving the COM-Poisson and Generalized Poisson distributions.This demonstrates the need to publish scientific papers involving more sophisticated distributions in related areas of statistics.
In view of the above, this work makes a bibliographic review of Poisson distributions; Negative Binomial; COM-Poisson and Generalized Poisson and their regression models, with the aim of providing information to professionals in sciences related to statistics who are not familiar with regression models and the criteria for choosing these distributions.As an application example, we used health data from the profile of municipalities in 2020 on notifications of tuberculosis and the Human Development Index of the 102 cities in the state of Alagoas, available at (de Saúde do Estado de Alagoas (2017); BRASIL (2021); de Saúde Perfil dos Municípios Alagoanos (2022)).
This article is organized as follows: in Section 2, we review regression models for Poisson distributions; Negative Binomial; COM-Poisson and Generalized Poisson and selection criteria.In Section 3, we present an application with these models to explain the relationship between tuberculosis notifications with the Human Development Index for all cities in the state of Alagoas.Finally, we present the conclusion in Section 4.

Poisson distribution model (Po)
We start counting models with the Poisson distribution model.Consider Y i as a random variable with Poisson distribution and we write (Y i ∼ Po(λ i )), if Y i has the following probability mass function (fmp) where λ i > 0 corresponds to the average number of occurrences of a given event, y i is the realization of the random variable Y i , with mean E(Y i ) = λ i and variance Var(Y i ) = λ i .In the regression structure of the Poisson distribution, the parameter λ i is related to the covariable x i through the logarithmic link function (Paula, 2004) that is, where x i = (1, x 1 , . . ., x p ) is a vector of explanatory variables and β = (β 0 , β 1 , ..., β p ) ⊤ is a vector of unknown model parameters, both of dimension (p+1).Then, for a random sample (y 1 , x 1 ), . . ., (y n , x n ), with n independent observations, the log-likelihood function of the Poisson model is given by For details on the log-likelihood function, and the various link functions of the Poisson distribution, see (Nelder & Wedderburn, 1972;Cameron & Trivedi, 2013).

Negative Binomial Distribution Model (NB)
Let Y i be a random variable representing the number of trials before the rth success in n Bernoulli trials.We say that the random variable Y i has a Negative Binomial distribution with parameters r and p and we write ( where r>0 is a fixed integer; p is the probability of success on the rth success and y i is a realization of the random variable (Bartko, 1960).Specifically, the Negative Binomial distribution is equivalent to the Geometric distribution with parameter p, that is, (Y i ∼ Geom(p)) when r = 1 (Hilbe, 2011).
There are several ways to represent the Negative Binomial distribution, depending on the parameterization to be used, one of them can be obtained through the parameterization p = µ i /(ϕ + µ i ) and ϕ = r used in Nelder & Wedderburn, 1972, and by replacing it in Equation ( 4) we obtain Where that is, the NB distribution can be used to model the overdispersion in the data and ϕ -1 corresponds to the dispersion parameter, for details see (Hinde & Demétrio, 1998;Park & Lord, 2009;Nelder & Wedderburn, 1972;Hilbe, 2011).Specifically, the NB distribution is equivalent to the Poisson distribution when ϕ → ∞ (Cameron & Trivedi, 2013).
In the regression structure of the NB distribution with regression on the mean, the parameters µ i and ϕ are related to the covariate x i and to the parameter τ , through the logarithmic binding function (Hilbe 2011), that is: where β = (β 0 , β 1 , ..., β p ) ⊤ and x i = (1, x 1 , . . ., x p ), are the vectors of parameters and explanatory variables, both of dimension p + 1.Thus, for a random sample with n independent observations, the log-likelihood function of the NB distribution model is given by For details on the log-likelihood function, and the various link functions of the Negative Binomial distribution, see (Paula, 2004

COM-Poisson distribution model (COMP)
The COM-Poisson distribution was proposed in the 1960s by Richard W. Conway and William L. Maxwell in (Conway & Maxwell, 1962) to model service fees, and then forgotten.Four decades later, COM-Poisson is revived with the paper by Shmueli et.al. in (Shmueli et al., 2005).Since then, numerous works involving COM-Poisson extensions have been developed (Santana, 2019).Thus, we say that the random variable Y i has a COM-Poisson distribution and we write Y i ∼ COMP(λ i , ν), if it has the following fmp where λ i > 0 is a center parameter that is approximately the mean when ν ≈ 1 (Barriga & Louzada, 2014); ν ≥ 0 is the dispersion parameter of the COM-Poisson distribution with under-equioverdispersion when (ν > 1, ν = 1 and ν < 1), and S(λ i , ν) is the COMP normalization constant (Shmueli et al., 2005), which is given by The COM-Poisson distribution has three distributions as particular cases, that is, it is equivalent to Poisson when ν = 1, the geometric when (λ i < 1 and ν = 0) , and the Bernoulli with probability of success λ i /(1 + λ i ) when ν → ∞, (Boatwright et al., 2006).The COM-Poisson distribution does not have a closed expression for its moments concerning parameters λ and ν, Guikema & Goffelt (2008).However, an approximation for its mean E(Y i ) and its variance Var(Y i ), is presented in Shmueli et al.
(2005), according to Equations10 and 11: Var another structure for the COM-Poisson mean and variance is presented in Guikema & Goffelt (2008) with the parameterization of COM-Poisson for the GLM structure.For details, see (Lord et al., 2008;Huang, 2017;Ribeiro Junior, 2019).
In the regression structure of the COM-Poisson distribution, with regression on the mean, the parameters λ i and ν are related to the covariate x i and to the parameter ξ, through the logarithmic binding function Sellers & Shmueli (2010), that is: where β = (β 0 , β 1 , ..., β p ) ⊤ and x i are vectors of parameters and covariates of dimensions p+1.For a random sample with n independent observations, the log-likelihood function of the COM-Poisson model is given by For details on the log-likelihood function and the linkage function of the COM-Poisson distribution, see (

Generalized Poisson distribution model (GP)
Let Y i be a random variable with Generalized Poisson distribution (GP) in Consul (1989), and denote where λ i > 0 and max(-1λ i /4) < φ < 1 are the center and dispersion parameters of the GP distribution, with sub-equi-overdispersion when (φ < 0, φ = 0 and φ > 0), specifically the GP distribution is equivalent to the Poisson distribution when φ = 0.And y i is a realization of the random variable Y i with mean E(Y i ) = λ i /(1φ) and variance Var(Y i ) = λ i /(1φ) 3 .For details, see Ridout et al. (2001).
In the regression structure of the GP distribution, with regression on the mean, the parameters λ i and φ are related to the covariate x i and the parameter ζ , through the logarithmic binding function (Consul, 1989), that is: where β = (β 1 , ..., β p ) ⊤ and x i = (1, x 1 , . . ., x p ), are the vectors of parameters and explanatory variables, both of dimension p + 1. ζ is one more parameter of the model to be estimated.For a random sample (y 1 , x 1 ), . . ., (y n , x n ), with n independent observations the log-likelihood function of the GP distribution model is given by For details on the log-likelihood function, and the various link functions of the Generalized Poisson distribution, see (Yee, 2017;Consul, 1989).

Estimation
To obtain the estimates of the models' parameters, we maximized the log-likelihood functions of the Equations (3; 7; 13 and 16), concerning the parameters of proposed models.Due to the complexity of the log-likelihood functions of the Equations (7, 13 and 16) estimators are obtained via numerical methods, such as the Quasi-Newton used in BFGS method, implemented in the optim function of software R Team et al. (2013).In implementing the codes for the NB, COMP and GP models, we multiplied the log-likelihood functions by -1, since optim minimizes the objective function, and as initial estimates, we used the Poisson model estimates for the vector of parameters ˛and (τ = 1, ξ = 1 and ζ = 1) for the dispersion parameters of the NB, COMP and GP models.To obtain the observed information matrix, we use the option hessian = TRUE of the optim function.
A similar alternative to find the parameter estimates of the proposed models without implementing codes is to use the glm function for the Poisson model, and the MASS packages for the Negative Binomial; COMPoissonReg for COM-Poisson and VGAM for Generalized Poisson.For details, see (Ripley et al., 2013;Yee, 2017;Sellers et al., 2019).

Confidence Interval
Let β be the maximum likelihood estimator of the generic vector β of parameters for the regression models, with θ = β; θ = (β, ϕ); θ = (β, ν) and θ = (β, φ) for Poisson models; Negative Binomial; COM-Poisson and Generalized Poisson.The Asymptotic Confidence Interval (ICa) with confidence level (1α)100% for the parameter vector θ, is given by where [J( θ)] -1 is the inverse matrix of the observed information, z α/2 is the α/2th quantile of the standard normal distribution (Santana et al., 2022).The ICa in Equation 17is recommended for large sample sizes, that is, when the distribution of θ approaches the normal distribution.For small or moderate sample sizes, the ICa may not provide accurate results.The Bootstrap Confidence Interval (ICb) can solve statistical inference problems related to sample size.There are several types of bootstrap confidence intervals, however, in this work we restrict them only to the non-parametric method, for details see (Efron, 1979).
Consider y = {y 1 . . ., y n } a random sample of the variable Y with cumulative distribution F. And let θ = h(F), a vector of unknown parameters of F and θ = s(y), your estimator.Then by randomly selecting B independent samples with replacement, y * = {y * 1 . . ., y * n } of y, we get { θ * } B b=1 estimates of θ.Consequently, the distribution θ is obtained by the empirical distribution of θ * .And then, the standard error ep( θ), of θ can be estimated by the standard deviation of { hatθ * } B b=1 , (Efron & Tibshirani, 1994, p. 45;Kehler, 2018, p. 22) this is, So, if the B bootstrap samples of { θ * } B b=1 have distribution close to normal the ICb (Liu & Tian, 2015, p. 206; Alves, 2013, p. 43), with confidence level (1α)100% for θ is given by

Choosing the Best Model
To choose the best model that fits the data, we used the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), which are given by where ℓ( θ; y) is the maximized log-likelihood function of the model; k is the number of parameters of the proposed model and n the number of observations in the sample.For details, see (Akaike, 1974;Schwarz, 1978).The best model will be the one that presents the lowest value for the AIC and BIC criteria (Paula, 2004).

Application
The variables under study correspond to confirmed cases of tuberculosis (all forms) for the year 2020 and the Human Development Index (HDI) for the 102 municipalities in the state of Alagoas.Data were collected by the authors in the statistical yearbook of the state of Alagoas for the year 2020 and on the Datasus page available at (BRASIL, 2021; de Saúde do Estado de Alagoas, 2017; Tabnet 2022).Figure 1 presents the map of the state of Alagoas with notifications of tuberculosis for its 102 municipalities.In Figure 1, we can observe that the data are scattered and asymmetrical, for example, the class >50 corresponds to the 584 cases of tuberculosis for the capital Maceió, with an estimated population of 1,031,597 inhabitants and HDI of 0.721 (IBGE 2022).

Results and discussion
In order to better in undertand the cases of tuberculosis, we performed a descriptive analysis, which presented a mean of 10.83; variance 3340.38 and asymmetry coefficient 9.70 which corroborates the visual analysis of Figure 1, and the same can be observed in Figure 2.
It can be seen in the column chart in Figure 2 that the data clearly do not have normality because they are counting data, and due to the large dispersion in the data and the asymmetry, probably the Negative Binomial, COM-Poisson and Generalized Poisson will fit the data better than Poisson, as they have a dispersion parameter.
Thus, we fit the regression models of Poisson distributions; Negative Binomial; COM-Poisson and Generalized Poisson to explain the relationship between tuberculosis notifications and the HDI.In Table 1, we present the parameters; their maximum likelihood estimators (MLE); their asymptotic confidence intervals ICa(95%) and bootstrap ICb(95%) both with 95% confidence, for the confidence interval ICb(95%) 5,000 replicates were performed bootstrap, in addition to the (AIC) and (BIC) criteria for four models considered in the study.
We observe in Table 1 that the parameter β 1 that corresponds to the regression coefficient was significant at the level α = 5% for all models, since their respective intervals of confidence ICa(95%) and ICb(95%), did not include the zero term, that is, the relationship of tuberculosis notifications with the HDI is significant α = 5%, and was identified by the four models.Likewise, the models of the NP, COMP and GP distributions were also able to capture the overdispersion caused by the high tuberculosis notifications as shown in Figure 2, since the estimates of their respective dispersion parameters, presented the values: 1/ φ = 1/1.016̸ = 0, ν = 0.046 < 1 and φ = 0.847 > 0. This overdispersion is also significant at the α = 5% level, since their respective confidence intervals ICa(95%) and ICb(95%), did not include the zero term for ϕ and φ and term one for ν.The same did not occur for the Poisson model because it does not have a dispersion parameter, part of the overdispersion went to its mean, corroborating the eigenvalue of β1 = 26.490.The GP distribution presented the value for β1 intermediate to that of the NB and COMP distributions, in hypothesis this occurred because the GP distribution is limited by the NB distribution.
The COMP distribution regression model was the one that showed the lowest value for the estimates of the AIC and BIC criteria, followed by the NB distribution model.This fact occurred because the NB and COMP distributions have the geometric distribution as a particular case, which corroborates the values of the parameters φ = 1.016 ≈ 1 for the NB model and ν = 0.046 ≈ 0 and λi < 1 for the COMP model.In hypothesis, the Geometric distribution regression model would fit the data better because it has a smaller number of parameters than the NB and COMP models.However, when considering the AIC and BIC criteria, we chose the COMP distribution regression model to represent the relationship between tuberculosis notifications and the HDI of each municipality in the state of Alagoas.
In Figure 3, we present the graph of the observed values versus the values predicted by the COM-Poisson distribution regression model, in order to verify the prediction quality of tuberculosis cases in the state of Alagoas.
The model managed to predict the data well according to the R 2 = 0.98 of the observed values versus those predicted in Figure 3.The outlier point corresponds to the capital Maceió, whose prediction by the model was #580 against the #584 notifications registered in that year.Thus, the COM-Poisson distribution regression model emerges as a great analysis option for health data with

Conclusion
In the 102 cities in the state of Alagoas, there is a relationship between tuberculosis notifications and the HDI human development index, which is overdispersed and significant at the probability level α = 5%, and can be explained by the COM-Poisson distribution regression model.

Figure 1 .
Figure 1.Map of the state of Alagoas with tuberculosis notifications for the year 2020.

Figure 2 .
Figure 2. Column chart of tuberculosis notifications in the state of Alagoas for the year 2020.