Probabilistic Modeling of the Coffee Market in Brazil

The coffee is one of the most valuable commodities in the world and in this scenario, Brazil presents itself as the largest producer and exporter in the world. However, the high fluctuations in prices promote insecurity in the agents of the sector. In this sense, the objective of this work is to propose the best probabilistic model for the monthly analysis of prices and to calculate probabilities of occurrences of average prices of coffee bags according to levels of practical interest. To this end, we use historical data provided by COOXUPÉ corresponding to the period from January 1981 to December 2022, arranged into monthly subseries. For supporting the results, goodness of fit test were performed. The results indicated that the Gamma and log-Normal distributions fit the coffee bag price data in all months. The log-Normal distribution outperformed in all months. The Gamma and log-Normal distributions fitted the monthly data of average coffee bags prices. The Log-Normal distribution is more suitable for the probabilistic study of the variable in all months. January, February, and March are the months with the highest probability of higher average values and are therefore the most recommended for coffee trading.


Introduction
Coffee is one of the most valuable commodities in the world, and approximately 70% of the global production is traded internationally (ICO 2020; Paiva et al. 2018).According to the International Coffee Organization (2021), this is a growing market, with global production volume having increased by over 60% since the 1990s and now holding an estimated value of over USD 200 billion annually.The coffee's notable participation in the global economy reflects the influence of the commodity on the economies of various emerging producing countries.In countries with minimal development, exports can contribute to foreign exchange, highlighting the impact on economic and political structure, as well as the risks of collapse if a permanent price decline occurs (Sá Barreto & Zugaib, 2016).
Brazil is the world's largest producer and exporter of coffee, and the coffee activity has a significant role in the country's economy and national agricultural production (Carrasco-Gutierrez, Almeida, 2013; CONAB, 2022; Mehta, Chavas, 2008).According to the MAPA (2022), coffee moved USD 6.4 billion in 2021, accounting for 8.28% of the country's exports.However, the high market volatility causes insecurity for coffee growers, as fluctuations impact the price of the final product and the economic returns of the activity (Arêdes et al., 2009;Sá Barreto & Zugaib, 2016).Being a commodity, coffee is one of the agricultural products that is primarily traded in its raw or semi-processed form.Globally, its price is standardized with quotations and negotiations conducted through commodity exchanges (Albieri & Terra, 2022).According to Sá Barreto & Zugaib (2016), price fluctuations are primarily linked to supply and demand movements, as experienced throughout history.An excess of supply leads to price decreases, while scarcity results in price increases.Nevertheless, there are other non-systemic and market-related factors that influence this determination and shape production cycles.The authors mention among market agents the actions of investment funds, exchange rates, market speculation, interest rates, economic crises, and the existence of possible others.For non-systemic agents, climate changes were highlighted, a factor closely related to production.
Furthermore, the coffee market is characterized by oligopsony, where the supply comes from many producers, but the demand is concentrated among a small number of buyers (Ghoshray & Mohan, 2021;Sá Barreto & Zugaib, 2016).This market concentration enhances the influence of coffee buyers from importing countries on prices, while diminishing such capability in producing countries (Ghoshray & Mohan, 2021).Another important factor for price dynamics is that the activity was impacted by the Covid-19 pandemic; the uncertainties of the period affected both supply and demand in the sector while causing disruptions throughout the production chain (ICO, 2020).Currently, coffee production is being influenced by the military conflict involving Russia and Ukraine.In 2020, these countries accounted for 3.85% of global coffee consumption.Russia supplies 20% of the ammonia transported by sea in the global market, and the tension caused an increase in oil prices (ICO, 2022).
Understanding the dynamics of the coffee market enables better risk management, allowing producers to make informed decisions in cultivation and position their product in the market with optimal pricing and profitability (Carrasco-Gutierrez & Almeida, 2013).This understanding can also support the formulation of political and economic strategies by governments of exporting and importing countries, cooperatives, futures market participants, roasters, and others (Carrasco-Gutierrez & Almeida, 2013; Ghoshray & Mohan, 2021; Sá Barreto & Zugaib, 2016).The application of statistical and econometric models in studying the dynamics of the sector is widely spread.The employed model will vary depending on the intended objective.For example, if the purpose is to understand the causal structure of the market, it is possible to estimate and test structural equations.However, if the goal is to test hypotheses among variable correlations, forecasting, or even formulating policies, it is possible to perform partial or complete sector modeling (Carrasco-Gutierrez & Almeida, 2013).
Given the current market scenario, its significance, and the disruptions caused by the Covid-19 pandemic and the conflicts between Ukraine and Russia, understanding the dynamics of coffee prices is of substantial importance.In the present study, this understanding will be achieved through probabilistic models, enabling the development of strategies by all market agents involved, within the context of price volatility and its related consequences.To achieve the proposed objective, the following steps were taken: i) Which probabilistic model is the most appropriate for modeling the average sales prices for bags of coffee in Brazil?; and ii) Which probability(s) of occurrence of a price of bags coffee be greater than a value of practical interest?What is the expected price(s) such that the probability of this being exceeded occurs with low probability?

Data description
The series of historical data obtained by COOXUPÉ for 41 years was analyzed, covering the period from January 1981 to December 2022, consisting of records of coffee bag prices at the national level was followed.The time series consist of monthly information about the average price of coffee bags traded at COOXUPÉ in dollars.In keeping with the proposed objectives, the data series was divided into monthly subseries, totaling twelve subseries.Each subseries was split into two parts, one called the training set and dedicated to fitting the probability distributions and the other series, called the test set, dedicated to measuring the goodness of fit.It is worth noting that COOXUPÉ's arabica coffee production is exported to 49 countries across five continents.In the coffee export ranking, the cooperative has been the top exporter among operating companies in Brazil for seven consecutive years.Exports account for more than 60% of COOXUPÉ's activities (COOXUPÉ, 2022).The decision to use dollar prices was based on the intention of mitigating the effects of changes in monetary policies that occurred during the analysis period.It is important to highlight that the data obtained represent the average price of the values passed on over the years to cooperators, therefore, it reflects the average amount delivered per bag of coffee to each cooperator, regardless of the variety of coffee produced.In this context, it is noted that these values only include the incomes received from coffee production by cooperators, which are rural producers.
To follow the data description, descriptive statistics were calculated for the data, encompassing measures such as maximum value, minimum value, mean, median, standard deviation, coefficient of variation, and skewness coefficient.

Probabilistic modeling of the price of coffee bags
Let the month m (m=January, …, December), and the respective training dataset, the probability distributions used in this study for each m will be presented below.Let X the random variable associated with the price of a coffee bags (in dollars), the Gamma distribution has a probability density function given by: defined in X ≥0,  is the scale parameter ( > 0) ,  is the shape parameter ( > 0), and the gamma function is defined by The probability density function of the Log-Normal distribution is given by: where  > 0, ln represents the natural logarithm,  (−∞ <  < ∞) is the mean of the natural logarithms of variable X, and  ( > 0) is the standard deviation of the natural logarithms of random variable X.The Normal distribution has a probability density function given by: where −∞ <  < ∞,  (−∞ <  < ∞) is the mean and  2 is the variance of the random variable X (Devore, 2018).The probabilistic models in equations 1 and 2 have asymmetric shape, and the one in equation 3 has symmetric shape.The parameters of these functions are initially unknown.To determine them, some estimation method is required.For this purpose, the maximum likelihood method was employed, which involves determining the values of the parameters , ,  and  that maximize the probability of the sample having occurred, under the condition of independence.This fact was verified by the Ljung Box and Mann-Kendal tests and more details are in section 2. 3. Further details about the maximum likelihood method can be found in Liska et al. (2020).
With each of the above models, we can proceed with the calculation of probabilities and/or quantiles of practical interest.The probability of a coffee sack's average price occurring up to a hypothetical value y is calculated using the cumulative distribution function (()), which means that in other words, the probability of an average price occurring up to a quantity y is given by the integral solution of any of the probability density functions, (; ) which  is a parameter or vector of parameters, presented earlier.Therefore, the probability of an average coffee sack price exceeding a value y is given by: The calculation of quantiles of practical interest is obtained by taking the inverse of the cumulative distribution function given the probability.Further details can be found in Casella and Berger (2001).

Hypothesis testing and goodness of fit
After the proposal of a probabilistic model and the estimation of its parameters, it is necessary to assess whether the proposed model fits the data.This is done through a goodness of fit test and it was done with the training set, according explained on the 2.1 section.The main idea of an adherence test is to compare the empirical probabilities with the theoretical probabilities predicted by the distribution function under test, checking whether the sample values can be found from a population with that distribution (Hartmann et al., 2011).Among them, there is the Kolmogorov-Smirnov (KS) test, which is based on the analysis of proximity or adjustment between the sampling distribution function and the population distribution function.The test statistic is given by, where ( () ) being the theoric cumulative distribution function of the densities (1), ( 2) e (3) with the respectively parameters estimates,  ̂( () ) is the empirical cumulative distribution function and n the length do training set.In this test, the null hypothesis claims whether the distribution function from which the sample comes from follows the distribution function that is assumed to be known, and the alternative hypothesis the opposite.Then, value (6) must be compared with the critical value (using tables), for the level of significance of the test.Depending on the result, the null hypothesis is rejected or not.Strictly speaking, the KS test assumes that the theoretical distribution function refers to a population and, consequently, its parameters.However, its use with parameter estimates has been widely used and, as reported by Bautista et al. (2004), the type 2 error rates can be inflated.To get around this fact, the Chi-Square also was applied and it compares the observed and expected cumulative frequencies using the test statistic: where k is the number of bins, calculated according Sturges rule, where  = 1 + 3.222 ln ,  ̂( () ) is the observed cumulative frequency of the i-th bin, and ( () ) is the expected frequency of the i-th bin according to the distribution being tested.The hypotheses of the chi-squared test are the same as those of the KS test.The Q statistic is evaluated using the chi-squared distribution with k-1 degrees of freedom, and if  >   2 ( − 1), where α is the significance level, the null hypothesis of the test is rejected.In both tests, the null hypothesis is also rejected if the p-value is lower than the adopted significance level.The Ljung Box (LB) independence test, whose statistics are compared with the α-th quantile of the chi-squared distribution with one degree of freedom.If the LB test statistic is higher than the aforementioned quantile, the null hypothesis of independence of the training set is rejected.Alternatively, if the p-value is smaller than the adopted significance level, the null hypothesis is rejected.The Mann-Kendall test was used to determine if the series has a statistically significant time trend (Sá et al., 2018).When very small values of p-value are found, it indicates evidence in favor of the alternative hypothesis, that is, there is some tendency to modify the behavior of the analyzed series.For all tests, a significance level  of 1% were used.
In order to evaluate the predictive capacity of the fitted models based on the test set, as described in section 2.1, the expected return prices were calculated.With the expressions in (4) and ( 5), the expected price can be obtained given a return time T. The return time T represents the inverse of the probability with which a given event E has occurred.Given the occurrence of an event E, the return time T is the average time required (in days, months or years) for this event to recur, at any given time.By definition, it follows that the return time T associated with event E is given by In the present work, event E is the price of a bag of coffee (in dollars) exceeding a certain value and the probability of exceeding E is obtained by 1 − ().So  = ), the return level can be calculated, i.e.   =  −1 (1 − ).In this way, with the estimates of the parameters of the probabilistic models, a prediction of the return price for the return time t is obtained, given by Given the above, the goodness of fit of the models was assessed in comparisons of the of empirical return leve (  ) with the expected return level (  ̂) and considering the return times of 2, 3, 4, 5, 6, 7, 8, 9 and 10 years.The mean absolute percentage error (MAPE) was used, given by where,   is the number of predictions made, in this case   =10.The model that presents the lowest MAPE is evidence in favor of it.Concluding the methodology, he statistical computational system R (R Core Team, 2021) was used for conducting the analyses with help of the packages MASS (Venables & Ripley, 2002), hydroGOF (Zambrano-Bigiarini, 2020) and trend (Pohlert, 2023).

Results and Discussion
The results shown in Table 1 present some descriptive statistics related to the data rearranged in monthly series, as mentioned in Section 2.1.It is possible to observe that the highest values recorded for the maximum of average prices per sack occurred in the months of April, May, and January, respectively, while the lowest values recorded for the minimums were in the months of October, June, and July.
The highest standard deviations of the commodity were observed in the months of January, February, and March, in that order.This measure of data dispersion is traditionally used in determining price volatility and considered a risk measure by some market speculators (Capitani & Mattos, 2017).
Among the monthly subseries, the months of February, January, and March, respectively, had the highest means, while the lowest means occurred for the months of October, September, and November.These values may be correlated with the phenological and physiological cycle of the coffee plant, combined with market reactions.According to Sá Barreto and Zugaib (2016), one of the main factors contributing to price instability is production and supply situation.Camargo and Camargo (2001) outlined the phenology of coffee plants (Coffea arabica L.) for Brazilian tropical conditions and noted in their work that during the flowering period, which occurs from September to December of the first year and in months preceding the rise in average prices, the incidence of high temperatures and an intense water deficit at its beginning leads to flower abortion.Subsequently to fertilization, there are the stages of small green berries and expansion.During these stages, a severe drought can hinder fruit growth and result in a low-quality harvest.In the grain formation stage, which occurs from January to March of the second year, corresponding to months with higher average prices, water stress leads to the production of poorly formed fruits and drying degree (poor development) of grains (Camargo & Camargo, 2001).Coffee harvesting takes place during the dry months, typically between March and April to September, a period when average prices tend to decline.
Beyond that, Silveira, Mattos, and Saes (2016) discuss the importance of crop reports and their impacts on the market and future price volatility.This is a significant factor considering that coffee trade mainly occurs in commodity exchanges and futures markets.Reports released by the International Coffee Organization, especially during flowering periods, have a significant impact on prices.Similar effects, albeit on a smaller scale, are observed in studies published by CONAB (National Supply Company) in September and December.
From Table 2, it can be observed that the Gamma and log-Normal distributions fitted the coffee bags price data for all months according to the Kolmogorov-Smirnov and  2 tests at a 1% significance level.Given the aforementioned and considering that both distributions are asymmetric, a similar result was found for the commodity price of fat ox in the state of São Paulo by Lucca Filho et al. (2022).In addition to the results of the adherence tests, it is possible to verify in the histograms in Figure 1 that the probability densities analyzed had a satisfactory fit to the data.Therefore, to choose the most appropriate probability distribution in a given month, the combination of results from Table 2 and Figure 1 must be considered.
When analyzing the calculated skewness coefficients for the monthly subseries, a positive skewness is observed in all months.Therefore, asymmetric distributions are the most suitable for monthly modeling.The predilection for a particular distribution in some months arises from a slightly better fit.Both goodness of fit tests have their limitations, and in some cases, the KS and  2 tests selected different distributions, making the MAPE a third criterion for choice.In this way, the log-Normal distribution performed better in all months.In Table 3, it is possible to observe the probabilities of commodity price occurrences and the 1 and 5% quantiles.In the month of January, based on the probabilities calculated using the log-normal distribution identified as the best fit for that month, the probability of prices exceeding 50, 100, 200, and 400 dollars are 93.47%,53.69%, 9.23%, and 0.30%, respectively.At February, January and March, respectively, showed the highest probabilities for the occurrence of values surpassing 50, 100 and/or 200 dollars.Among all months, January and February are also the months with the highest probability of prices exceeding 400 dollars.On the other hand, among all the months, October has the lowest probability of values exceeding 100, 200 and/or 400 dollars.This trend is like what occurred in the years 2018 and 2019, driven in 2018 by the completion of the harvest and the confirmation of Brazil's record production of 59.4 million bags, coupled with global production exceeding consumption (CONAB, 2018(CONAB, , 2019)).
The volatility among prices and monthly variations can be explained by the low shortterm elasticity of demand and supply in the coffee market (Sá Barreto & Zugaib, 2016;Silveira et al., 2016), in addition to other factors such as dependency on exchange rates (Carrasco-Gutierrez & Almeida, 2013), macroeconomic, financial, climatic, and meteorological variables (Cuaresma et al., 2018).Furthermore, the monthly series studied were shown to be trendless and independent using the Mann-Kendal and Ljung-Box tests, which makes the parameter estimates obtained using the maximum likelihood method plausible, in terms of bias.If any violation in these tests were found, the incorporation of this aspect in the analysis would be necessary, such as those carried out by Lucca Filho et al. (2022) with the incorporation of the trend in probabilistic modeling and Teixeira, Liska and Santos (2023) with the incorporation of temporal modeling of coffee bag prices.Regarding quantiles, it is expected that with a 5% probability there will be an average price higher than $233.61 in January.On the other hand, it is expected that with a 1% probability there will be an average price higher than $289.85 in October (Table 3).To achieve better returns, the sales of coffee bags should be concentrated in the months with a higher probability of higher average prices.The months of January, February, and March are the most recommended for commercialization.According to the data, provided by COOXUPÉ, the reality contradicts the indications, as the highest averages in terms of the number of bags sold occur from July to October.
Storing coffee beans and selling during the off-season is one of the possible strategies for coffee farmers and cooperatives to achieve higher profits and revenues.Administratively, product storage becomes viable when the difference between the expected price and the current price is greater than the sum of the interest rate and the storage cost.It becomes The training dataset covered 31 years (1981 to 2012), around 75% of the data, and the test dataset the last 10 years (2013 to 2022), in accordance with what is recommended by Menezes et al. (2017).

( 1 )Figure 1 .
Figure 1.Histograms of the monthly average prices of coffee bags (in USD) from January 1981 to December 2012, along with plots of the probability density functions of the fitted Gamma, Normal, and log-Normal distributions.

Table 1 .
Descriptive measures of the monthly average prices of coffee bags (in USD) traded at COOXUPÉ from January 1981 to December 2021 and p-values of the Mann-Kendal (MK) and Ljung-Box (LB) tests

Table 2 .
Parameter estimates for the probabilistic models, results of the Kolmogorov-Smirnov (KS) and Chi-Square ( 2 ) tests, and the Mean Absolute Percentage Error (MAPE) values for the monthly subseries of coffee bags average prices in dollars

Table 3 .
Probabilities (in %) of the occurrence of exceeding average prices of coffee bags (in USD) and quantiles of practical interest for all months of the year Probability distribution that show the best goodness of fit index in Table2