Sequential Bayesian approach for genetic diversity analysis of the piracanjuba fish ( Brycon orbignyanus )

In the sequential Bayesian approach, the sample size is not fixed before the experiment; it is determined based on the observations made. The procedure concludes when there is enough information to estimate the parameters, according to a stopping criterion. A parameter of interest in population genetics is the proportion obtained from the allele frequency at one or more loci to verify Hardy-Weinberg equilibrium (HWE). The objective of this study was to assess the occurrence of HWE in a population of piracanjuba fish (Brycon orbignyanus) by estimating the allele proportion and expected genotype proportions using a sequential Bayesian approach. Additionally, a comparison was made with frequentist and Bayesian approaches. Initially, genotypic profiles were analyzed at a microsatellite DNA locus, Bh6, in 49 fish, to determine the frequency of observed alleles and genotypes at the UFSJ Genetic Resources Laboratory. Seven allele classes were observed; thus, under the assumption of sampling independence, the likelihood is multinomial. The estimation of allele and genotype proportions was then carried out using frequentist, Bayesian, and sequential Bayesian approaches. A uniform prior and a cost of 10 –3 were considered. The estimates from the three approaches were compared, and it was concluded that the sequential approach proved effective, utilizing only 55.1% of the available data, thereby reducing the sample size and optimizing the procedure. Using a chi-square test at a 5% probability level, it was concluded that the studied sample is in Hardy-Weinberg equilibrium (p-value: 0.9800245).


Introduction
The sequential sampling is characterized by using samples of variable size; therefore, the sample size is not fixed before the experiment but is determined based on the observations made.The incorporation of Bayesian techniques into sequential sampling allows the use of a priori information that can optimize the sampling plan and improve parameter estimation (Wald, 1947).
Thus, in the Bayesian sequential approach, the procedure is stopped when sufficient information is obtained to estimate the desired parameters, according to a stopping criterion that compares the immediate and expected risks at each sample element.The decision to stop sampling is made when the immediate risk is lower than the expected risk, and the parameters of interest are estimated (Berger, 1985).
This approach can be applied in various areas with the aim of reducing the required sample size for making decisions about parameters.Therefore, it is particularly useful in processes where sampling is destructive and incurs high financial and/or execution time costs (Schilling & Neubauer, 2017).
In the context of population genetics, the parameters of interest are the proportions obtained from the frequencies of alleles and genotypes at one or more loci, to check for Hardy-Weinberg equilibrium (HWE).According to Hartl & Clark (2010), a population is in HWE when the genotype proportions are distributed as expected based on the occurrence of random mating, a phenomenon known as panmixia, for a given distribution of allele proportions (genetic diversity).In general, this equilibrium occurs when allele and genotype proportions remain constant.
However, the process of genetic population description is conducted meticulously, requiring the visual analysis of one element at a time at each DNA locus.This procedure, besides consuming considerable time, demands significant resources, making it an exhaustive task.Therefore, the application of the Bayesian sequential approach is relevant in the field of population genetics, as it may result in a reduction of the required sample size to verify Hardy-Weinberg equilibrium, optimizing the procedure.
In a two-allele gene system, A and a, the proportion of these alleles follows a binomial distribution.Meanwhile, the proportion of genotypes in a population can be described by a multinomial distribution with three categories: homozygote AA, heterozygote Aa, and homozygote aa.However, alleles generally have more than two categories, exhibiting genetic diversity; thus, the allele proportion is also determined by a multinomial distribution (Rehman et al., 2020).
Thus, the objective of this study was to verify the occurrence of Hardy-Weinberg equilibrium in a population of piracanjuba fish (Brycon orbignyanus), an endangered species, by estimating allele proportions and the expected proportions of genotypes.This was done using a Bayesian sequential approach for the multinomial distribution and comparing it with frequentist and Bayesian approaches.

Multinomial distribution
The multinomial distribution is a discrete probability distribution and a generalization of the binomial distribution for polytomous response variables, used to estimate the probability of an element belonging to more than two categories.
According to Casella & Berger (2002), the multinomial distribution is defined by assuming an experiment whose result is one of the events E 1 , E 2 , ..., E k , with probabilities P[E i ] = p i , where k is the number of classes of the multinomial distribution.For i = 1, 2, ..., k, 0 ≤ p i ≤ 1, and k i=1 p i = 1.Let X i be a random variable that counts the number of occurrences of E i in n independent repetitions of this experiment.Then, the random vector (X 1 , X 2 , ..., X k ) has a distribution called multinomial, with parameters p 1 , p 2 , ..., p k-1 , and n, given by: where each X i is a positive integer, p 1 , p 2 , ..., p k are population proportions, and k i=1 x i = n.There are p 1 , p 2 , ..., p k-1 parameters because k i=1 p i = 1, thus p k = 1 -k-1 i=1 p i .

The Bayesian estimation of parameters for the multinomial distribution
The Dirichlet distribution is the multivariate generalization of the beta distribution, with a nonnegative real vector parameter a.It is a widely used multivariate discrete distribution in the Bayesian context as the conjugate prior distribution for the multinomial distribution (Paulino et al., 2018).
Let X = (X 1 , ..., X k ) T be a vector with k components; then, it follows a Dirichlet distribution of order k ≥ 2 with a parameter vector a = (a 1 , ..., a k ) T , i.e., (Paulino et al., 2018): Its probability density function is given by: where a 0 = k i=1 a i , Γ (t) = ∞ 0 x t-1 e -x dx is the gamma function, and k i=1 p i = 1.The marginal distribution is a beta distribution with parameters a i and (a 0a i ) for each i, from which: If the prior distribution is Dirichlet and the observed variable follows a multinomial distribution, then the posterior distribution will be another Dirichlet distribution with different parameters.Therefore, the posterior distribution follows a Dirichlet distribution with parameters: where X = (x 1 , ..., x k , a 1 , ..., a k ) T .Thus, the mean, variance, and covariance of the Dirichlet posterior distribution are given by:

The Bayesian sequential estimation of parameters for the multinomial distribution
According to Berger (1985), the main idea behind the Bayesian sequential estimation procedure is that when making each observation one at a time, one should compare the a posteriori Bayesian risk of making an immediate decision with the expected a posteriori Bayesian risk, which will be obtained if more observations are taken.
Moreover, the Bayesian sequential rule can also be known as Bayesian learning because the a posteriori distribution calculated at the current n will be used to update the a priori distribution yet to be used in the (n + 1)-th evaluation (Berger, 1985).
In this regard, to determine the stopping criterion, it is necessary to calculate these risks for the multinomial distribution.Lima (2022) established these risks for Dirichlet conjugate priors and detailed the derivations; for more details, refer to the mentioned work.
To obtain the stopping criterion, a quadratic loss function was considered for the parameter estimate p = (p 1 , p 2 , ..., p k ) T by p = (p 1 , p2 , ..., pk ) T , and it takes the general quadratic form (p -p) T K(p -p), where K is a positive definite symmetric I × I matrix of constant loss (Chen, 1988;Jones, 1976;Owen, 1970).
Thus, using a quadratic loss function, the Bayesian estimator p is the mean of the posteriori distribution of (x, n), i.e., the mean of the posteriori Dirichlet distribution, given by ( 4).
The immediate risk, or the stopping risk of making a decision, is given by: where Σ is the dispersion matrix of the posteriori Dirichlet distribution, having the posteriori variances on its main diagonal and the posteriori covariances of the Dirichlet distribution in the other components.This results in: Dynamic programming equations were used to find the expected risk, resulting in: where c is the cost of sampling one observation.Therefore, the stopping criterion boils down to comparing the values of immediate and expected risks for each observation, given by the expressions found in ( 8) and (9).When the immediate risk is less than the expected one, i.e., S(x, m) < B(x, m), the sampling stops, and the parameters of interest are estimated.Otherwise, if S(x, m) > B(x, m), the sampling continues, taking another observation until a decision can be made.

Hardy-Weinberg equilibrium
The Hardy-Weinberg equilibrium is one of the main topics studied in population genetics.In 1908, the English mathematician Godfrey Harold Hardy and the German physician Wilhelm Weinberg independently and almost simultaneously arrived at the Hardy-Weinberg Equilibrium Law.
To check the HWE, one must calculate the expected proportion of each genotype based on the allelic proportions p 1 and p 2 , representing the proportion of alleles A and a, respectively, in the population.Thus, the expected genotypic proportions are estimated from the Hardy-Weinberg equation, which is given by a quadratic expansion of allelic proportions (Hartl & Clark, 2010): where p 2 1 represents the proportion of homozygotes AA, 2p 1 p 2 the proportion of heterozygotes Aa, and p 2 2 the proportion of homozygotes aa.In the case of a single-gene system with k alleles, k ≥ 2, the probability distribution associated with genotypes is a multinomial distribution, and it will have classes because the number of categories/classes of the multinomial distribution of genotypes is given by the combination of the number of alleles taken 2 by 2, added to the number of alleles, considering that species generally have a diploid number of chromosomes.i.e., The Hardy-Weinberg equation to estimate the expected genotypic proportions when there is genetic diversity, i.e., when the population has more than two alleles (k ≥ 2), is given by generalizing the expansion (p 1 + p 2 + ... + p k ) 2 = 1 where p n , with n = 1, ..., k, is the proportion of alleles, resulting in: Therefore, p ij = 2p i p j , i, j = 1, ..., k, if i ̸ = j, and p ij = p 2 i , if i = j, where p ij are the proportions of genotypes A i A j , with j ≥ i.This relationship can be written in matrix form, where on the main diagonal, the expected proportions of homozygous genotypes are, and the rest are the expected proportions of heterozygous genotypes:

Application
All the theory of Hardy-Weinberg equilibrium was applied to a dataset of a microsatellite DNA locus, Bh6, from 49 fish of the species Brycon orbignyanus, commonly known as piracanjuba.These fish are from the Rio das Mortes basin, São João del-Rei -MG municipality, and are used as breeders in fish farming, being a threatened species.
Initially, the analysis of genotypic profiles of the microsatellite DNA locus Bh6 from the 49 fish was performed using polyacrylamide gel electrophoresis to determine the proportions of observed alleles and genotypes at the Laboratory of Genetic Resources of the Department of Animal Science at the Federal University of São João del-Rei, São João del-Rei, Minas Gerais (LARGE-DEZOO-UFSJ).
The allele and genotype proportions were estimated using three approaches: frequentist, Bayesian, and sequential Bayesian.The sample size for the frequentist and Bayesian approaches was 49 fish.
For the frequentist approach, the proportion of observed genotypes was calculated as follows: The allele proportion: where n ij is the observed count in each category i, j = 1, ..., k, j ≥ i, n is the total count, n i is the sum of counts in each category in row i, and n j is the sum of counts in each category in column j, in a contingency table with L rows and C columns.
And the expected genotype proportions calculated according to the Hardy-Weinberg equation, using the allele proportions: where p ij are the proportions of genotypes A i A j , and p i are the proportions of alleles, i, j = 1, ..., k with j ≥ i.
For the Bayesian and sequential Bayesian approaches, a Dirichlet conjugate prior was adopted with hyperparameters equal to one, thus having as a particular case of the Dirichlet, a uniform prior distribution, where all values are equally probable, being a non-informative priori.
In the Bayesian approach, allele proportion estimates were calculated by the mean of the Dirichlet posterior distribution given in (4).The expected genotype proportions were then calculated using (15).
For the sequential Bayesian approach, a cost of 10 -3 was considered, which is constant and additive in the loss function.This value is associated with the precision of the p values.According to Bach (2015), the cost value should have a similar order of magnitude as the loss function.This ensures that the risk function is not exclusively dominated by the cost.By considering the quadratic loss function, as the loss is the square of a difference between real and estimated proportion values, which are in the interval [0, 1], the results are always close to zero, and therefore, the cost should also be close to zero.Thus, the chosen cost value is in line with Bach (2015) and does not dominate the stopping criterion.
In the sequential Bayesian approach, the procedure stops when the immediate risk is lower than expected, given by the expressions ( 8) and ( 9), obtained through the quadratic loss function.Thus, the estimates of observed genotypes were calculated by the mean of the Dirichlet a posteriori distribution, given by (4).With the sample size at which the procedure stopped, allele proportions were calculated using (14), and from these, expected genotype proportions were calculated using (15).
The sampling order of the sequential procedure followed the order of the spreadsheet analyses.Thus, the first fish analyzed from the spreadsheet, the second, and so on.Just for clarification, as this aspect can affect early or late stopping in relation to the obtained result.
A chi-square test was performed to check whether the locus is in Hardy-Weinberg equilibrium or not.
The chi-square test is used to verify if the frequency of a certain event observed in a sample differs significantly from the expected frequency of that event.This quantification is done through the chi-square statistic, defined as (Bussab & Morettin, 2017): where observed and expected refer to the observed and expected frequencies in each genotypic class.
After calculating the value of χ 2 calculated (Equation 16), it is compared with the χ 2 tabulated value for the appropriate degrees of freedom (d.f.) and desired significance level (α).If χ 2 calculated ≥ χ 2 tabulated , H 0 is rejected; otherwise, H 0 is not rejected.The degrees of freedom (d.f.) are calculated as the number of data classes -number of estimated parameters -1 (Hartl & Clark, 2010).
Thus, it was tested whether the number of individuals in each genotypic class corresponds to the expected under the hypothesis that the population is in Hardy-Weinberg equilibrium at a 5% probability level.The hypotheses are: The population is not in HWE.
The estimates from the three approaches were compared by calculating the percentage error, the correlation between the estimates, and the confidence interval for the differences in estimated proportions between two approaches.
The Percentage Error (PE) is given by the expression: where PE is the percentage error in the estimation of the proportion parameter, p i1 is the proportion parameter estimated by approach 1, p i2 is the proportion parameter estimated by approach 2 to be compared.
The confidence interval for the difference in proportions is given by:

Results and Discussion
In the utilized dataset, seven allele classes were observed, denoted as k = 1, ..., 7. Thus, assuming sample independence, the likelihood follows a multinomial distribution.The multinomial distribution for genotypes has 28 categories, calculated by the expression 11.
The estimation of observed genotype proportions and subsequently the estimation of allele proportions and expected genotype proportions were carried out using the frequentist, Bayesian, and sequential Bayesian approaches.In the sequential Bayesian approach, the procedure was interrupted, and the estimate was obtained with a sample size of 27 fish.
In Figure 1 and Table 1, the results of the allele proportion estimates by the three approaches are presented: As the expected genotype proportions are calculated based on allele proportions, given by the Hardy-Weinberg equation (15), only the allele estimate comparisons were made.
To do this, the Percentage Error (%) between the approaches was calculated pairwise, and also the confidence interval of the differences in proportions, with a 95% confidence level.The results are shown in Table 2.
In the PE between estimates by the Bayesian and frequentist approaches, the first approach was considered Bayesian, and the second one was considered frequentist.In the PE between estimates by the sequential Bayesian and frequentist approaches, the first approach was considered sequential Bayesian, and the second one was considered frequentist.Finally, in the PE between estimates by the sequential Bayesian and Bayesian approaches, the first approach was sequential Bayesian, and the second one was Bayesian.It can be observed from Table 2 that the lowest percentage errors were between estimates from the frequentist and Bayesian approaches.However, when the parameter is larger, the sequential Bayesian approach is quite effective, as observed in the case of p 3 and p 5 .
Moreover, in Table 2, all confidence intervals for the differences in proportion estimates contain zero, meaning that the difference in the proportion may be zero, indicating non-significance.This highlights that the sequential approach was efficient in estimates and utilized only 55.1% of the available data, optimizing the procedure.
Another point to be highlighted is that, with the Bayesian sequential approach, the sample size was 27 fish, of which only the presence of 6 alleles was detected.This resulted in the estimate of p 4 being equal to zero.However, the non-estimation of allele 4 is not considered a problem in population genetics, as it is a rare allele, and therefore, its influence on Hardy-Weinberg equilibrium is low.This fact can be confirmed by the low occurrence of allele 4 in the total sample of 49 individuals, where it was observed at a low frequency (3%), being present in the last individuals evaluated in the total sample (43 and 49).
Since allele 4 is a rare allele, even if it is present in the population, its frequency will be low.Therefore, the frequency of its combinations, i.e., genotypic frequencies, will also be low.This results in a low impact on the calculation of Hardy-Weinberg equilibrium, which depends only on the adherence between the observed and expected.Thus, if the observed is rare, then the expected will also be rare, resulting in little influence on the equilibrium, since estimates of rare alleles, when present, will be close to zero.
It is also important to note that, in this study, 49 fish were observed for analysis by traditional (frequentist) methods initially, of which 7 alleles were identified.But by taking a larger number than 49 fish, it may be possible to identify more rare alleles.Thus, even using other approaches with a fixed sample size, it is not possible to identify all alleles present in the population.Alleles that are considered rare, with low frequency, will be less identified.Therefore, the number of alleles cannot be pre-fixed, as it always depends on the analyzed sample size.
Another point to emphasize is that with the sequential Bayesian approach, the sample size was 27 fish, and with these 27 fish, the presence of only 6 alleles was detected, resulting in the estimate of p 4 being zero.It should be noted that even using another approach, the same thing could happen, for example, when analyzing more fish, and the presence of more alleles is detected.Therefore, the number of alleles cannot be predetermined; this quantity will always depend on the analyzed sample size.
Additionally, the correlation between the estimates was calculated, resulting in 0.9991 between the frequentist and Bayesian approaches, 0.9826 between the frequentist and sequential Bayesian approaches, and 0.9852 between the Bayesian and sequential Bayesian approaches.All values are close to 1, indicating a strong correlation.
Histograms of the posterior distributions of alleles from the sequential Bayesian approach were constructed using the R software (R Core Team, 2023), presented in Figure 2. A chi-square test was conducted at a 5% probability level, with degrees of freedom = 28 -6 -1 = 21.It was concluded that the studied sample is in Hardy-Weinberg equilibrium, as the obtained pvalue was 0.9800245, which is greater than 0.05, leading to the non-rejection of the null hypothesis.
It can be observed that the use of Bayesian methods with Dirichlet conjugate priors to assess Hardy-Weinberg equilibrium has been yielding good results.It can be observed that there are works in the literature with the Bayesian approach, but with the sequential Bayesian approach in this area, it is a novelty.The results obtained with this approach were of great utility, as it optimized the procedure.

Conclusions
This study demonstrates that it was possible to apply the sequential Bayesian approach to estimate allele and genotype proportions, for verifying the occurrence of Hardy-Weinberg equilibrium, in a population of piracanjuba fish (Brycon orbignyanus).
There was efficiency in the sequential Bayesian approach, reducing the sample size and utilizing only 55.1% of the available data for analysis.Additionally, it is noteworthy that the use of this approach in this field is a novelty.Moreover, it can be applied in various other areas for different procedures of interest, aiming to reduce time and/or cost.

Figure 2 .
Figure 2. Histograms of the posterior distributions of alleles from the sequential Bayesian approach.
Reis et al. (2008) described a Bayesian method to study Hardy-Weinberg equilibrium through the inbreeding coefficient.In this work, the authors analyzed various models and concluded that the best model is the one using Dirichlet priors.Moreover, Reis et al. (2011) concluded that the Bayesian methodology proved to be efficient in studying the Hardy-Weinberg model, being evaluated and confirmed by simulation studies, presenting estimates very close to the real value.Furthermore, Cunha Filho et al. (2020) noticed that Bayesian analysis obtained results relatively closer to reality to verify the Hardy-Weinberg equilibrium hypothesis and has the advantage of being applicable to samples of any size.The Bayesian methods presented were efficient in testing Hardy-Weinberg equilibrium.Its application may serve as a subsidy for the researcher's decisionmaking to be as close as possible to reality.

Table 2 .
Percentage Errors (PE) % and Confidence Intervals (CI) for the differences in proportions among the three approaches