Classification of sugarcane areas in Landsat images using machine learning algorithms

Monitoring sugarcane areas through remote sensing is essential for planning and management of the national sugarcane industry. The use of machine learning algorithms has provided many benefits to remote sensing. This article aims to compare the prediction quality of three important machine learning methods in identifying sugarcane areas using Landsat images: Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF). LR was applied in three versions: LR without penalization, LR with Ridge penalization (LR-R) and LR with Lasso penalization (LR-L). Data obtained in this study refer to a region of approximately 306,000 ha located in the state of São Paulo, Brazil, which was segmented into approximately 46,000 polygons (observations). Six spectral bands and vegetation indices observed along 17 months resulted in 102 covariates, which were reduced via Principal Component Analysis (PCA). In total, 19 Principal Components were chosen to account for 94.61% of the cumulative explained variance ratio and were used in machine learning methods to classify each polygon as sugarcane or other land covers. The method with the highest accuracy considering testing sample of 20% of data was RF (78.51%), followed by DT (72.30%), LR-L (69.64%), LR-R (69.64%), and LR (69.52%).


Introduction
Sugarcane is of great importance in Brazilian agribusiness, as Brazil is the world's largest sugar producer and one of the largest biofuel markets (Luciano et al., 2019).This crop has production disseminated throughout the country, with greater concentration in the southeastern region, with the state of São Paulo accounting for 60% of national production (Silva et al., 2011).According

Sugarcane data
The test area selected for this study has approximately 306,000 ha and is located in the midnorthern region of the state of São Paulo, Brazil.Figure 1 shows the location of the study region.
The study region was segmented based on satellite images into approximately 46,000 polygons.Aiming to calibrate classification models, each polygon was classified as "sugarcane" or "other", according to reference maps from the Canasat project (Rudorff et al., 2010).
The dataset was obtained from spectral indices captured by the Landsat satellite, already processed by the method used in Luciano et al., 2018.Spectral indices are measurements of the interaction of light with matter which are used to identify different types of materials, such as water, vegetation, and soil, and also to measure characteristics, such as vegetation health, soil moisture, and soil nutrient concentration.In remote sensing, spectral indices are used to identify different land cover types, such as vegetation, water and soil, and to identify plantations such as sugarcane, wheat, rice, soybeans and corn, among others (  To help discriminate between the two classes of the target variable (sugarcane or not), information on the annual growth cycle and on the previous crop was considered through spectral indices shown in

Machine learning methods
In this subsection, a brief description of the statistical and machine learning methods used in this paper will be presented.

Dimensionality reduction via PCA
Before running a machine learning algorithm, Principal Component Analysis (PCA, Johnson & Wichern, 2002) can be used to reduce data mass, with the least possible loss of information.This is helpful, for instance, to speed up the execution of the machine learning algorithm.PCA is a statistical technique of multivariate analysis that transforms a set of correlated variables into a new set of variables called principal components, which have important properties: (1) each principal component is a linear combination of all original variables; (2) they are independent of each other; and (3) they are obtained with the purpose of retaining, in descending order, the maximum amount of information, in terms of the total variation contained in data.
In order to calculate the principal components, one first calculates the matrix of variances and covariance of the d considered variables.Commonly, the values of variables are standardized to ensure that they are weighted equally in the analysis.When values are standardized, the correlation matrix is used instead of the variance and covariance matrix.Then, the eigenvalues and eigenvectors of this matrix are calculated.Eigenvalues are the variances of principal components, while eigenvectors are vectors that point in the direction of principal components.Eigenvectors, also referred to as loading, correspond to the coefficients used in the linear combinations of the original variables to create each principal component (PC).The ith principal component is obtained by projecting data onto the ith eigenvector (sorted in descending order by the corresponding eigenvalues).For example, to calculate the first principal component, simply multiply the standardized dataset by the first eigenvalue.

Logistic regression
In the logistic regression model, the response variable Y follows a Bernoulli distribution, that is, the probability of success is given by P(Y = 1) = π and the probability of failure as P(Y = 0) = 1 -π.The probability mass function for the Bernoulli random variable Y with success probability π ∈ (0, 1) is given by: This model is widely used in classification problems in different areas of knowledge.In practical problems associated with each response variable Y i a vector of covariates where β 0 is the intercept of the model and β j is the parameter associated with the jth covariate, with j = 1, 2, . . ., d.To estimate the parameters of a Logistic Regression model, the maximum likelihood method can be used, which consists of obtaining parameters that maximize the likelihood function.
It is also possible to use penalties such as Lasso or Ridge regression to estimate the Logistic Regression parameters, which are techniques used to prevent overfitting and improve model interpretability.Both achieve this result by adding a penalty term to the loss function, but they differ in the type of penalty used and their effect on coefficients.While Lasso uses the sum of the absolute values of parameters, Ridge uses the sum of their squares.
Both regularization techniques have some specific characteristics.For example, Lasso can take coefficients close to zero and can even set some to exactly zero.This performs covariate selection by effectively removing irrelevant covariates from the model and can be especially useful for high-dimensional datasets with many correlated covariates.On the other hand, Ridge can take coefficients close to zero, but less aggressively than Lasso, as the coefficients tend to remain non-zero.Thus, Ridge does not perform explicit covariate selection.For more details see James et al., 2013.

Decision tree
Decision tree was proposed by Breiman et al., 1984 and can be applied for both classification (predicting categorical values) and regression problems (predicting continuous values).A decision tree is built by recursive partitioning on the covariate space.Each partitioning is named node and each end result is named leaf.Initially, the algorithm checks whether the condition at the first node is satisfied.If so, go to the left.Otherwise, continue to the right.This continues until a single leaf is obtained.
A decision tree structure for classification problems can be created in two main steps: (i) the creation of a complete and complex tree and (ii) the pruning of this tree, in order to avoid overfitting.Formally, the first step consists of creating a partition of the covariate space generated by covariates x i = (x i1 , x i2 , . . ., x id ) ′ , with i = 1, 2, . . ., n, into M regions distinct and disjoint denoted by R 1 , R 2 , . . ., R M .The proportion of elements in the mth region belonging to the kth class, k=0,1 for binary problems, is given by where N m is the number of elements in the region R m .A suggested criterion for finding the best partition at each stage of the process is the Gini index, given by where R represents one of the regions induced by the tree.The objective is to minimize this index.
For the pruning stage, in general, the proportion of errors in the validation set is used as a risk estimate.More details about regression trees (predicting continuous values) and the pruning process can be found in Hastie et al., 2001.

Random forest
Although decision trees are methods of easy interpretation and simple understanding, they tend to have low predictive power when compared to other estimators.To overcome this limitation, another well-known method called Random Forest can be explored.This approach consists of obtaining B distinct trees and combining their results to improve the predictive power in relation to an individual tree.To create B distinct trees, B bootstrap samples from the original sample are used.Let Ĉb (x) be the class prediction of the bth random forest tree, then the prediction function is given by ĈB (x) = mode{ Ĉb (x), b = 1, 2, . . ., B}.
More details about the random forest can be obtained in Hastie et al., 2001.In real life, the dataset is used to train a machine learning model/algorithm, which is used to predict outputs that are unknown on new datasets.To check the predictive capability of the model for new data, it is common to divide data into two parts, a larger one to train the method and a smaller one to validate it.This procedure is known as the Holdout method (Morettin & Singer, 2022; Izbicki & dos Santos, 2020; Lantz, 2019).This split ensures that the model would not simply memorize the training data, but truly understand the underlying rules, leading to robust and generalizable performance.

Confusion matrix and performance metrics
The confusion matrix is a table that presents the performance of a classification method (Morettin & Singer, 2022; Izbicki & dos Santos, 2020; Lantz, 2019).It is commonly used in supervised learning problems where the output data is known.For a classification problem with only two possible classes, the confusion matrix has four main elements: true positive (TP), where the method correctly classifies an instance as positive; false positive (FP), where the method incorrectly classifies an instance as positive; true negative (TN), where the method correctly classifies an instance as negative; and false negative (FN), where the method incorrectly classifies an instance as negative.An example of confusion matrix is shown in Table 2.
Based on the four elements of the confusion matrix, three performance metrics are used in this work: sensitivity, specificity and accuracy.These metrics are defined below.
• Sensitivity: measures how well the method can correctly identify instances of the positive class.
In other words, it is the proportion of positive results, given that the actual value is positive.It is calculated as: Sensitivity = TP TP + FN .
• Specificity: measures how well the method can correctly identify instances of the negative class.That is, it is the proportion of negative results, given that the actual value is negative.It is calculated as: Specificity = TN TN + FP .
• Accuracy: measures the proportion of correctly classified instances in relation to the total number of instances.It is calculated as: Accuracy = TP + TN TP + FP + TN + FN .

Results and Discussion
Spectral bands and vegetation indices observed over 17 months resulted in 102 covariates (inputs) to predict the target variable (classification as sugarcane or not).After performing PCA (using the correlation matrix of data), principal components with eigenvalues greater than 1 were chosen to be applied in machine learning methods to classify each polygon of the study area as sugarcane (1positive) or other land cover (0 -negative).The 19 principal components accounted for 94.61% of the cumulative explained variance ratio, as shown in Table 3.
The loading of the four first principal components is shown in Figure 2. Component 1 can be defined as a contrast between vegetation indices (EVI and NDVI) and indices related to moisture and water detection (NDMI, NDWI, SWIR 1 and SWIR 2) through all months, from January/2015 to May/2016.A pattern in the loading of component 2 was observed, with lower (negative) values from November/2015 to January/2016, which correspond to rainy months in the state of São Paulo.
Seasonal pattern of loading related to vegetation and moisture/water indices may be suggested, as seen in the contrast along the months, especially in mid-fall/winter seasons (from May/2015 to September/2015, in component 3) and the first and second semesters of 2015, in component 4. Seasonal variation in the vegetation and moisture/water index should be carefully taken into account since the sugarcane areas under study presented diverse harvest management and cultural practices throughout the months.
The remaining components were used in the classification algorithms as previously described, but their loading values are not shown in Figure 2.
The 19 principal components were used to form the new database, which was split into training (80%) and testing (20%) subsets prior to applying the machine learning methods, including the calculation of the Lasso and Ridge penalty parameters of the logistic regression.The confusion matrices of prediction methods applied to the testing subset are presented side by side in Table 4.The performance metrics, sensitivity, specificity and accuracy, with respective 95% confidence intervals, for the classification methods used are presented in Table 5.
The logistic regression, with or without Ridge and Lasso penalization, showed similar results for sensitivity (0.7462, 0.7537 and 0.7543, respectively), specificity (0.6058, 0.5961 and 0.5949, respectively) and accuracy (0.6952, 0.6964 and 0.6964, respectively) (Table 5).Consequently, penalization did not interfere with classification performance and thus can be considered unnecessary to classify sugarcane land cover under the conditions tested.
Although Decision tree accuracy, 0.7230 [0.7137, 0.7321], was greater than observed with logistic regression, the method's performance metrics were inferior to those obtained using Random Forest.
Random Forest presented greater sensitivity, 0.9051, compared to logistic regression and decision tree, and specificity, 0.5751, similar to values shown by other algorithms, which resulted in the greatest accuracy, 0.7851 [0.7766, 0.7935], observed among classification methods used (Table 5).Therefore, Random Forest can be considered the best option to classify sugarcane land cover using   PCA and vegetation indices under the conditions described in this study.Figure 3 contains the maps of the study region, comparing the polygon classification predictions obtained by the five methods and the true classification.The maps were created using the trained methods and the complete dataset (training and testing) to fully form the maps.As the training set has greater accuracy, the images may appear to be more accurate than those described in Table 5.
The map of the study region shows areas highlighted in dark colors (dark green and dark blue), representing polygons that the methods correctly predicted as sugarcane (true positive) and nonsugarcane (true negative), respectively.Areas in green tones indicate positive predictions, that is, areas classified as sugarcane.The three maps referring to LR (maps a, b and c) are similar and have large number of polygons classified as false positive (light green) and false negative (light blue).In contrast, the map referring to DT (map d) has large number of polygons classified as false positive (light green).Finally, the map referring to the RF (map e) has few incorrectly classified polygons, resembling the actual map (map f ).
The logistic regression model was used as a parametric approach for classification with and with- out a penalty term (Lasso and Ridge).For a non-parametric approach, decision tree and random forest were used.The decision tree algorithm is simple and easy to interpret.Furthermore, by aggregating many decision trees, using methods such as random forest, the predictive performance of trees could be improved.As final consideration, it makes no sense to structurally compare parametric and non-parametric approaches because they make different assumptions about the methods.

Conclusions
In this paper, the prediction quality of logistic regression, decision tree and random forest was compared in identifying sugarcane areas using Landsat images.Methods were calibrated on training data covering 17 consecutive months, starting on January 2015, in a region of 306,000 hectares located in the state of São Paulo, Brazil, which was divided into 46 thousand polygons.The method with the highest accuracy considering test sample of 20% of data was RF, with 78.51%, followed by DT with 72.30%, LR-L with 69.64%, LR-R with 69.64% and LR with 69.52%.However, it is important to note that the performance of machine learning algorithms may vary depending on the specific dataset and the parameters used.
According to performance metrics used in the study, RF is the most recommended method among those investigated for sugarcane mapping, considering the spectral indices from January to May of the following year, totaling 17 months.RF provided satisfactory result for the test set (20% of the total dataset).It is worth mentioning that the methodology presented here can be applied to map other crops using satellite images.
Future studies should be carried out in order to reduce the number of polygons incorrectly classified as sugarcane (i.e.false positives) by implementing a post-classification method that considers the classification of neighboring segments to confirm the prediction of each segment.

Figure 2 .
Figure 2. Loading of the four first principal components.

Figure 3 .
Figure 3. Representation of polygons on the map of the study region with colors to present the prediction results of methods considered in the study, as well as the actual map.

Table 1 .
Description of vegetation indices.NIR, R, B, and G are reflectances in the near-infrared, red, blue and green bands, respectively

Table 2 .
The structure of a confusion matrix for binary problems

Table 3 .
Principal component, explained variance ratio and cumulative explained variance ratio

Table 4 .
Confusion matrices for the classification algorithms.
LR: logistic regression; LR R: logistic regression with Ridge; LR L: logistic regression with Lasso; DT: decision tree; RF: random forest