DISCRIMINATION OF VARIETAL GROUPS AND HYBRIDS OF COFFEA CANEPHORA SPECIES USING MULTIVARIATE ANALYSIS

 ABSTRACT: Coffee growing is one of the most important agricultural activities in the world market. Coffea canephora is one of the most commercially relevant species, which can be divided into the varietal groups Conilon and Robusta. These varietal groups have complementary agronomic interests. Therefore, hybrids are obtained through crosses between these groups. Given the difficulty in differentiating the two varietal groups of genotypes in the field, the correct discrimination is essential for the definition of crosses in breeding programs. In this context, the present work aimed to apply a discriminant analysis (DA) to define functions to differentiate between varietal groups and hybrids of C. canephora, and identify the most relevant phenotypic traits in these functions. Data from 165 genotypes from the Instituto Capixaba de Pesquisa, Assistência Técnica e Extensão Rural e do Centro Agronómico Tropical de Investigación y Enseñanza were used to measure different plant traits. It was applied the quadratic DA with the best performance for genotype discrimination, with an average apparent error rate of 0.0364. Cercosporiose incidence, rust incidence, vegetative vigor, plant height and diameter of the canopy projection were the most important traits in the varietal groups' discrimination.


Introduction
characteristics for the differentiation of soybean cultivars, and observed an apparent error rate of 16.36%. More recently, Das et al. (2018) applied quadratic discriminant analysis to discriminate rice genotypes, and seventeen wavelengths were evaluated, with an apparent error rate of 2%. In this context, the discriminant analysis has been successfully applied to discriminate genotypes from different species. However, this methodology has not been applied to discriminate coffee genotypes.
This study aimed to apply discriminant analysis (DA) to define functions for the differentiation and classification of C. canephora varietal and hybrid groups and identify the most important phenotypic traits for the discrimination of these genotypes.

Materials and methods
The data used in this study were obtained from Conilon and Robusta varietal groups and hybrid families originated from crosses between these groups. Conilon genetic material was provided by the Instituto Capixaba de Pesquisa, Assistência Técnica e Extensão Rural (INCAPER) and Robusta material was obtained from the Centro Agronómico Tropical de Investigación y Enseñanza (CATIE). These genetic materials belong to the breeding program of the Empresa de Pesquisa Agropecuária de Minas Gerais (Epamig), in association with the Universidade Federal de Viçosa (UFV) and the Empresa Brasileira de Pesquisa Agropecuária -Café (Embrapa Café), located in the city of Oratórios/MG.
The genotypes were evaluated for seven phenotypic traits. The evaluations were performed at physiological maturity of the coffee fruits. Five categorical traits were evaluated: vegetative vigor (Vig), field evaluation of rust incidence (Rus), cercosporiosis incidence (Cer), fruit maturation time (Mat) and fruit size (FS). The continuous traits evaluated were the following: plant height (PH) and diameter of the canopy projection (DC). The procedures for measuring phenotypic traits were described by Alkimim et al. (2020).
The varietal groups Conilon and Robusta were composed of 45 and 37 genotypes, respectively, besides 83 interpopulation hybrids.
For the discrimination of Conilon, Robusta or hybrid genotypes, based on phenotypic traits, it was used the multivariate statistical technique, known as discriminant analysis (DA). DA is a technique used to classify and/or differentiate individuals from a sample or population (BARROSO et al., 2013). It is also used to classify a new observation into one of k different groups, based on functions of the observed variables, which aim to minimize the likelihood of poor classification. Unlike other multivariate approaches, in the discriminant analysis, the groups in which individuals are classified must be previously known. Thus, the rules of discrimination or classification are generated based on the characteristics of the groups (MINGOTI, 2005).
In this work, two types of discriminant functions were evaluated: Fisher's linear (1936) and quadratic.
In order to establish the classification of genotypes in one of the three populations, based on Fisher's linear discriminant function, each population is considered to have a vector of means µ and a matrix of variances and homogeneous covariance(∑ 1 = ∑ 2 = ∑ 3 = ⋯ = ∑ ). According to Ferreira (2018), one should classify x in the population for which the value ( ) = µ ′ ∑ −1 − 1 2 µ ′ ∑ −1 µ , be the maximum in relation to all possible values of i. Since three groups were considered in this study (Conilon, Robusta and hybrids), we have that i = {1,2,3}.
On the other hand, to establish the classification of genotypes in one of the three populations, using the functions of quadratic discriminants, it is considered that each population has vector of means µ , ∑ the heterogeneous variance and covariance matrix (∑ 1 ≠ ∑ 2 ≠ ∑ 3 ) and the a priori probability of the genotypes belonging to the population . The quadratic discriminant functions were generated according to Varella (2004): where ( ) is the classification score of the ith population, x is the vector of variables representing the characters involved in the analysis, ∑ is the matrix of variances and covariance of the population and, in this study, they were considered a priori identical probabilities = 1 3 ⁄ . From these functions, the ith genotype is classified in the population , k = {1,2,3}, if ( ) = ( ). Thus, according to the discriminate function used, the samples were classified as belonging to the population for which they obtained the highest classification score. Additionally, a multivariate analysis of variance (MANOVA) was performed, with the seven characteristics in three groups. And then, a univariate analysis of variance (ANOVA) was carried out for each characteristic evaluated, in order to verify which variables were most important for the discrimination of the groups (Conilon, Robusta and hybrid), considering a significance level of 5%.
A cross-validation method was used to avoid the underestimation of classification errors. This method allows the evaluation of the generalization capacity of a predictive model in a data set (JAMES et al., 2013). The K-Folds cross-validation method was used, considering K equal to five. In the discriminant analysis, 132 genotypes were used as a training population, while the remaining 33 genotypes were considered the validation population. The discriminant functions were defined through the training population, and the genotypes of the validation population were later classified, assuming that its classification was unknown and using the functions already obtained in the training population. This procedure was repeated 5 times so that all groups were used in the validation population only once.
The evaluation of the efficiency of discriminant functions for the classification of varietal groups and hybrid was based on the apparent error rate (AER), calculated through the quotient between the number of incorrectly rated observations and the number of ratings (correct and incorrect); and also, on the proportions of incorrectly classification observations, which represent the total estimated probability of incorrect ratings.
The statistical analyses were performed in the R software (R CORE TEAM, 2020). The "MASS" package in version 7.3-51.4 and the "lda" and "qda" functions were used to carry out the discriminant analysis.

Results and discussions
The discriminatory capacity results of the C. canephora varietal groups and hybrids, considering the linear and quadratic discriminant functions, are shown in Table 1. The correct and incorrect classifications of the genotypes are presented, considering the sum of the results of the cross-validation and the seven traits evaluated (vegetative vigor, rust incidence, cercosporiose incidence, fruit maturation time, fruit size, plant height and canopy diameter).
Fisher's linear discriminating function is indicated for cases of nonnormality and requires the covariance matrix to be homogeneous (FERREIRA, 2008). According to the Box-M test (p-value < 0.01), the hypothesis that covariance matrices are homogeneous was rejected, and in this sense, the use of quadratic discriminant functions is recommended (MINGOTI, 2005). However, the use of the quadratic function requires the assumption of data normality. The Shapiro-Wilk multivariate normality test (p-value < 0.01) revealed that the data do not follow a multivariate normal distribution. Thus, in this study, both functions were considered in the analysis.
In Table 1, the main diagonal exhibits the correct classifications, while for the other cells, the classifications are incorrect. The identification of an incorrect classification is confirmed in the table rows. It was observed that the linear function presented 11 incorrect classifications, four belonging to the Conilon group, six to the Robusta group, and one belonging to the hybrids. When the quadratic function was applied (results in parentheses), there were six incorrect classifications, two of Conilon group, two of Robusta group and two hybrids (Table 1). The efficiency of the discrimination of functions is associated with the quantity and quality of variables observed in discrimination (CRUZ et al., 2004). The apparent error rate (APER) is a method used for the analysis of these functions, which was 0.0667 and 0.0364 for linear function and quadratic function, respectively (Table 1). Both of them presented relatively low APER values, which indicates that the discriminant functions were efficient in distinguishing the varietal groups and hybrid. However, the data revealed that the linear function exhibited almost twice the APER value observed by the quadratic function. According to De Carvalho et al. (2018), the break in the assumption of data normality has little influence on the method efficiency when using the quadratic function. On the other hand, the lack of homogeneity of the covariance matrices seems to have influenced the results of classification of the linear function. Table 2 exhibits the incorrect classification proportions for the two discriminating functions: linear and quadratic. Approximately 8.89% of the Conilon genotypes evaluated were classified incorrectly by the linear discriminant function, and 4.44% for the quadratic function. For Robusta, 16.22% of the genotypes were incorrectly classified when using the linear function, and 5.41%, when using the quadratic function. For hybrids, the proportion of incorrect classification was 1.21% by the linear function and 2.41%, for the quadratic function. The discrimination of Conilon and Robusta were more challenging than hybrid discrimination. This result can be explained by the reproductive system of the species (cross-fertilization), high phenotypic and heterozygous range, and possible natural crosses between the two varietal groups REN et al., 2013;FERRÃO et al., 2015). Although it is not an easy task, the evaluation and characterization of genetic variability and discrimination genotypes are essential to manage the crop and develop genetic breeding programs (ANAGBOGU et al., 2019).  Table 3 presents the coefficients of the linear discriminating functions, thus allowing to measure the relative contribution from the traits in the separation of the groups. Based on these functions, the phenotypic traits cercosporiosis incidence (Cer) and rust incidence (Rus) presented higher coefficients in absolute values. Another important variable for the differentiation of these genotypes is the vegetative vigor (Vig). Those traits provided the greatest contribution to the discrimination of the varietal and hybrid groups. Rus and Vig were previously mentioned as important traits for varietal group discrimination (FERRÃO et al., 2015). In general, Robusta coffees are resistant to rust and present higher vigor than Conilon coffees, which are susceptible to this disease. On the other hand, Conilon coffees present competitive advantages in Brazil, due to their tolerance to drought and greater adaption to the conditions of the country (SOUZA et al., 2013). Further analysis using MANOVA for seven traits across three groups (Conilon, Robusta and Hybrids) indicated that the groups are significantly different (p-value < 0.01). Aiming to determinate which traits are responsible for the observed difference, it was performed a univarite analysis of variance (ANOVA). ANOVA revealed significant differences between the groups for "Vig", "Fer", "Cer", "AP1", "DCo", which means that these traits are the most important to discriminate the groups (Table 4). Since the phenotypic variables "Mat" and "TFr" were not significant for the discrimination of populations, the discriminant functions were reestimated without considering the information related to these variables. Table 5 summarizes the varietal and hybrid groups' classification of this new analysis, in which the main diagonals are the correct classifications, and the incorrect ones are in the other cells. It was observed that, in this case, linear function resulted in 10 misclassifications, 5 of which belonged to the Conilon group; 4, to the Robusta group; and 1, to the hybrids (Table 5). When the quadratic function was used, 11 incorrect classifications occurred; 5 belonged to the Conilon group; 4, to the Robusta group; and 2, to the hybrids (Table 5). As the "Mat" and "TFr" variables were removed, the linear discriminant analysis obtained a performance similar to that of the previous analysis, which demonstrates that the variables "Mat" and "TFr" are not significant for the discrimination of these groups. This performance can be verified through the APER, in which, for linear discriminant function with all variables, the APER was 0.0667 (Table 1). When the variables were excluded, it decreased to 0.0606 (Table 5). The quadratic function in this scenario was less efficient in discriminating varietal groups and hybrids, with an APER of 0.0606 (Table 5), which was greater than that observed in the previous analysis (APER = 0.0364). The removal of these characteristics ("Mat " and " TFr ") for the construction of the discriminant function can be interesting, since it will reduce labor in the field.
In practice, in C. canephora breeding process, the correct classification of its botanical varieties is extremely important, since each of its varieties requires specific crop management procedures (MARCOLAN et al., 2009). Inadequate cultural treatments due to possible classification errors can reduce crop productivity (PEREIRA et al., 2000). Another issue is related to the correct recommendation of cultivars for obtaining hybrids. Successful hybridization is directly linked to the correct classification of parents .
Thus, the discriminant analysis technique for the purpose of discriminating the varietal and hybrid groups of C. canephora was successfully applied in this study. In other cultures, this technique has already been applied efficiently. For example, Nogueira et al. (2008) used the discriminant analysis to differentiate eleven soybean cultivars. In their study, four experiments were carried out at different times (two in the summer and two in the winter), based on 7 phenotypic characteristics. As a result, they obtained an apparent error rate of (12.73% and 9.09%) and (1.82% and 16.36%), respectively, for summer (December and February) and winter sowing seasons (May and June). Thus, using discriminant analysis, it was possible to distinguish soybean cultivars. Das et al. (2020) used the quadratic discriminant analysis to assess the discriminative power of 17 wavelengths of 14 rice genotypes. As a result, the quadratic discriminate analysis was efficient for discriminating rice genotypes, with a relative error rate of 2%. The Fisher's discriminant analysis was also used to classify forest communities in the pampa biome. In their study, eight structural variables were used in order to classify five types of forests. The discriminant analysis classified all samples correctly in their respective predicted groups, that is, the apparent error rate was 0% (KILCA et al., 2015).
In coffee data, this technique had already been evaluated for classification purposes, but with different objectives, for which the discriminant analysis also obtained satisfactory results. Campos et al. (2016) selected groups of seven quality characteristics of coffee seedlings, using six selection criteria. He used Fisher's discriminant analysis to transform the groups of characteristics into a new variable and then compare the results obtained with the univariate analysis of the new variable with the multivariate analysis. As a result, the discriminant analysis proved to be a viable option for treatment discrimination.
Other multivariate techniques have already been successfully used in studies of C. canephora. Ivoglo et al. (2008) used cluster analysis to quantify the genetic divergence of populations of C. canephora. In their study, they demonstrated that, when there is a database with different characteristics of interest, multivariate analysis proves to be a powerful tool for integrating multiple information and assisting the breeding program in the choice of divergent parents that will be more likely to promote superior results. Da Fonseca et al. (2006) applied cluster analysis and canonical variables in studies with C. canephora clones, in order to separate and classify genotypes of varietal groups from two environmental regions of Espírito Santo, based on 7 phenotypic characteristics, which corroborates the vital role of a correct classification in breeding programs.

Conclusions
The discriminant analysis was efficient for distinguishing between varietal groups (Conilon and Robusta) and Coffea canephora hybrids. The variables vegetative vigor, evaluation of the incidence of coffee rust in the field, incidence of cercosporiosis, height of the plant and diameter of the canopy projection were the phenotypic characteristics that proved to be important in the discrimination of these groups. Therefore, with this phenotypic information for new individuals, we can obtain their classification into varietal and hybrid groups efficiently.