Abstract

Conventional soil maps are designed based on expert criteria, a characteristic that reduces their reproducibility and generates subjective uncertainty. Pedometric mapping uses mathematical and statistical principles, which makes it the opposite of conventional mapping. It was proposed to apply the pedometric mapping in San Mateo de Otao and find out its characteristics against the conventional one. Satellite and field data were used to extract covariables (soil-forming factors) and soil classes. The data were modeled with Naïve Bayes, global uncertainty was calculated by resubstitution, cross-validation and retention, and local uncertainty with the confusion and Shannon indices. A low uncertainty map was obtained with six identified soil classes, relief, and parent material having the most important covariates. We conclude that pedometric mapping has considerable advantages over conventional mapping and its application is possible under the context of soil survey in Peru.

1. Introduction

Conventional soil maps present the location and boundaries of soil groups (mapping units); they are consolidated as a basic input for the creation of new maps that diagram the properties of the soils used for various purposes. Its main use is for the development of land use plans, its evaluation, and prediction of the effects of these uses [1]. The main limitations they have are the lack of an uncertainty value, soil classes cannot be mapped and the relationship of the soil with its formation factors is understood subjectively. Even if they are well designed with geographic information systems, Brevik et al. [2] emphasized that this method preserves the limitations of paper maps, since the use of human criteria for the synthesis and interpretation of the model of soil formation factors leads to an accumulation of error that is impossible to measure and far from randomness. Conventional soil maps are considered obsolete and are opposed to pedometric soil maps [3] that apply mathematical, statistical, and computational methods to model, interpret, and map soils. Several countries use this method to produce soil classes maps on national scales, Germany [4], India [5], Iran [6], France [7], and the United States [8] are some examples. The common characteristic of all studies is a large number of predictors and samples used to generate and validate the models. The rugged topography of the Peruvian Andes limits the sampling process to a low density of soil samples, which limits the use of computationally sophisticated mapping methods. The recent increase in the number and types of remote sensing platforms is an opportunity to improve the number of predictors, staying as a challenge using machine learning models to increase the pedological understanding [9]. Few soil studies in Peruvian territory use statistical [10] or computational [11, 12] methods to map soils, commonly conventional mapping is chosen [13]. Bayesian methods can be used for small datasets; Steinbuch [14] showed that in small datasets, an appropriate prior selection improves overall accuracy compared to non-Bayesian methods. A common and simple machine learning method that incorporates the Bayesian approach is Naïve Bayes. Incorporating pedometric mapping into Peruvian soil surveys can provide new knowledge of the relationship of soil and their formation factors and the uncertainty value of the maps. It was proposed applying pedometric mapping in the district of San Mateo de Otao to map soil classes using Naïve Bayes model and understand the differences from conventional mapping.

2. Materials and Methods

2.1. Location

The study area is a small catchment (77.48 km2) located within the limits of San Mateo de Otao district of the Huarochiri Province, Lima, Peru. The main Holdridge life zones (Figure 1(a)) were tropical montane moist forest (TM-mf), tropical premontane perarid desert (TP-pd), tropical montane steppe (TM-s), tropical lower montane thorn steppe (TLM-ts), tropical lower montane desert scrub (TLM-ds), and tropical premontane desert scrub (TPM-ds). All life zones, except the TM-mf, are arid ecosystems (evapotranspiration > precipitation). The elevation varies from 1200 (TP-pd) to 4350 (TM-mf) meters above sea level; as a consequence, the air temperature is very variable. TP-ds and TPM-ds are the warmest areas (22–24°C), TLM-ds and TLM-ts are cool zones (15–17°C), and TM-s and TM-mf are the coldest (10–12°C) zones [15]. The relative area occupied by each life zone was 21.60 (TM-mf), 2.12 (TP-pd), 21.59 (TM-s), 33.69 (TLM-ts), 9.79 (TLM-ds), and 11.22 (TPM-ds) percent. Geology (Figure 1(b)) consists of alluvial deposits (Qp-al), porphyritic andesite (P-r), rhyolitic tuff (Nm-h), andesites (N-and, J-ar), tonalities, diorites, and granodiorites (Ki-bc/p-di, tn; Ki-bc/sr-tn, gd; PN-tn, gd) [16, 17].

2.2. Data Source

All the Landsat 8 images collection 1 and level 1 (courtesy of the United States Geological Service) of 2018 were downloaded. The dark object subtraction atmospheric correction was applied using the semiautomatic classification plugin [18] of the QGIS 3.10; clouds and shadows were removed using the quality assessment (QA) band. The images were averaged to later perform the index calculation. ASTER L1T satellite image from 07/12/2015, Jaxa/Meti digital elevation model, WorldClim 2.0 climate grids [19], and the integrated geology map [20] were used as base inputs for the generation of representative variables of the soil formation factors.

2.3. Soil Sampling and Analysis

The sampling design was by stratification of study area and free mapping (locations were selected by convenience within the strata). The limits of the strata were equivalent to the limits of the homogeneous land units. The geological units were grouped into extrusive, intrusive igneous rocks, and alluvial sediments. Life zones were grouped by climatic affinity in four keys: 1 (TP-pd), 2 (TLM-ds and TPM-ds), 4 (TLM-ts), 5 (TM-s), and 10 (TM-mf). The landscape was classified using Zinck methodology [21] in valleys (Va), extrusive mountain slopes (Le), and intrusive mountain slopes (Li). The homogeneous land units were obtained by the intersection of the landscape and grouped life zones (Figure 1(c)). A total of 64 samples (soil profiles) were selected by free mapping and were taken within each homogeneous land unit. Each sample (soil profile) was evaluated by describing the pedon’s internal and external characteristics. Then, the subsamples (horizons) were evaluated in the soil laboratories of the National Agrarian University La Molina. The physical-chemical analyses were carried out based on the Soil Survey Staff protocols ([22, 23]). Particle size distribution was carried with hydrometer method, pH, and electrical conductivity in suspension soil:water 1 : 1, total equivalent carbonates by acid digestion, soil organic carbon by Walkley and Black method, cation exchange capacity with ammonium acetate at pH 7.0, and exchangeable cations by atomic absorption spectrophotometry with a Perkin Elmer model AAnalyst 200 equipment. The soil classes were obtained with the soil taxonomy system [24] down to the subgroup level.

2.4. Predictor Variables

The digital elevation model was processed with the basic terrain analysis module of the SAGAGIS 6.4.0 to extract ten representative variables of the relief factor. The climate factor was extracted from WorldClim v 2.0. and the ET was calculated using the climate and weather-evapotranspiration module of SAGAGIS 6.4.0, Hargreaves method. The ASTER images were used to calculate the quartz, carbonate, and mafic index [25] and Landsat 8 for the Normalized Difference Vegetation Index and soil biological crust [26]. The complete description of the variables obtained is found in Table 1.

2.5. Sampling Quality

The objective of the sampling is to obtain a fraction of representative individuals of the population whose distribution function is close to the population distribution. This comparison was made graphically by observing the histograms of the population and the sample with the Scott method for the number of bins and using the Kolmogorov–Smirnov hypothesis test at a significance level of five percent (α = 0.05).

2.6. Modeling and Uncertainty

The Naïve Bayes method was used including the soil classes as the response variable (qualitative) and the predictor variables. The method requires conditional independence between the predictors [27], but all of them were used for our model. Before modeling, a Yeo-Johnson standardization and transformation were performed. The database presented unbalanced classes, a problem that was solved using an oversampling balancing in the response variable. We use a Laplace smoother of 100 and the kernel method to estimate the density function without adjusting the bandwidth (value of one).

General uncertainty measures were obtained with three procedures: (1) by resubstitution, generating a confusion matrix with the same data that the model was generated; (2) by retention, dividing the total data set into 70 percent to estimate the model and 30 percent for validation; and (3) 10-fold cross-validation with 100 repetitions. Local uncertainty was measured using the confusion and Shannon indices. The first was proposed by Burrough et al. [28] and measures the degree of confusion between the two soil taxa with the highest probability of occurrence. A class close to zero means that two soil taxa are similarly likely to occur and values close to one that only one soil taxa dominate. The Shannon Index is very common in ecology and is used to measure the diversity of species; we use the original formula proposed by Shannon [29] with base e for logarithm and the result units were bits. Values close to zero indicate the predominance of few soil taxa and high values the existence of several taxa at the same pixel. The feature importance based on the ROC curve was calculated using caret package, the result was scaled between 0 and 100, and radar plots were used to summarize the importance of the variables in the prediction of each soil class. All procedures previously described were carried out in R language [30] and caret package [31].

3. Results and Discussion

3.1. Soil Description

Six soil classes were determined at the subgroup level: twenty-three Lithic Torriorthents (LT), thirteen Typic Torriorthents (TT), eleven Typic Haplocambids (TH), five Sodic Haplocambids (SH), six Fluventic Haplocambids (FH), and six Lithic Haplocambids (LH). They are distributed in colluvial-alluvial parent materials (4.3% of study area), residual intrusive materials (35.4%), and residual extrusive materials (60.3%). The intrusive material was the product of the weathering of granodiorites and diorites while extrusive material was of andesite and volcanic tuff. The sediments are coarse-textured with alkali pH, a product of the interaction with the arid climate.

3.2. Soil Sampling Quality

The results of the Kolmogorov–Smirnov test (Table 2) indicate that the ICO, RSP, PPT, and RS differ from the population distribution. In the graphic analysis (Figure 2), no marked differences are observed, which indicates that the samples are representative of the population and will not have negative repercussions on the estimation of the proposed model. This way of evaluating the quality of the sampling based on the predictors was successfully tested by Carvalho [32].

3.3. Uncertainty Analysis

All validation results show that the model is good (Table 3); global accuracy values are greater than the noninformative ratio which confirms the statement. The result for resubstitution is the highest; unfortunately, the numerical value is not very useful because when using the same data in modeling and validation, it tends to be overestimated [33]. The retention values were 0.67 and 0.60 for global accuracy and Kappa Index, respectively, and are the lowest of the three methodologies. These have the disadvantage of sacrificing samples for validation in a context where samples less than 100 units are not convenient. Accuracy with cross-validation is lower than retention, but the confidence intervals overlap, showing that both results are not statistically different. Cross-validation is one of the most widely used methodologies in soil classes pedometric mapping [3436] which reinforces that cross-validation is the most appropriate method to estimate global uncertainty in the conditions of our study.

The results are very promising compared to similar studies. Silva et al. [37] obtained values of 0.49 and 0.33 for the uncertainty coefficients using the random forest in 74 samples. Taghizadeh et al. [38] modeled the subgroups of 194 samples with ensemble models and obtained 0.43 of global accuracy and Mirakzehi et al. [39] obtained 0.44 with 108 samples and random forest. Jeune et al. [40] obtained a Kappa Index of 0.45 with multiple linear regression and 0.42 with random forest. All cases used k-fold cross-validation to obtain the parameters, but the values are much lower than ours.

The local uncertainty (Figure 3) shows that in the lower and upper part of the microbasin (lower left and upper right), there is little diversity of soils (Shannon Index), but there is a cooccurrence of the two most probable taxa (Confusion Index). These concepts can be used to delimit cartographic consociations and associations. The mean values of indexes are 0.31 and 0.43; in general, we can say that the study area has low pedodiversity. Lamichhane et al. [41] obtained similar results for little soil diversity and a slight predominance of a single taxon, with the exception of mountainous areas. However, their study was carried out on a national scale (1 : 1000000) and as a result of a disaggregation process to estimate the distribution of taxa based on the cartographic units of the soil. Odgers et al. [42] mentioned values of 0.1 and 0.9 to decide if there is confusion or not, these are very conservative, but using them as a reference, it is possible to affirm that the blue areas have greater uncertainty. This coincides with the areas where the density of samples was lower due to the harsh accessibility conditions.

3.4. Soil Classes Prediction

The cross-validation model was used to predict the probabilities of each soil class. A threshold of 0.5 was used to decide the soil class to be represented per pixel (Figure 4). Fluventic Haplocambids cover 13.36 percent, Lithic and Typic Haplocambids occupy 11.28 percent each; 15.45 is occupied by the Typic Torriorthents, 10.79 by the Sodic Haplocambids, and 37.84 by the Lithics Torriorthents. Arid conditions, steep slopes, and absence of vegetation favor the regressive evolution of soils and result in the predominance of Torriorthents, especially Lithic Torriorthents. The map shows that Haplocambids have random distribution and Torriorthents are spatially aggregated. Despite the favorable climatic conditions for the development of Aridisols, the other factors locally conditioned their distribution.

Studies that seek to predict soil classes use an average of 20 covariates [43] and the most widely used are those related to the digital elevation model, climate grids, and vegetation [44]. Our study used 21 covariates where Topographic Wetness Index (TWI), derived from the digital elevation model, stands out in importance for the prediction of Lithic Haplocambids, Sodic Haplocambids, Typic Torriorthents, and Lithic Torriorthents (Figure 5). Fluventic and Typic Haplocambids are more influenced by Total Solar Radiation (RS), quartz (QI), and carbonate (CI) indices. The irregularity of the importance of predictors does not allow us to indicate which soil-forming factor is the most influential for the study area; however, in broad strokes, we can highlight the relief and parent material.

Conventional soil maps represent the spatial distribution of mapping units and abstractions of the soil boundaries determined by photointerpretation [1]. In contrast, pedometric maps show the distribution of taxonomic units or soil classes. Consequently, conventional maps significantly underestimate the area occupied by each soil class [45] having a direct effect on the interpretation of the stakeholders. Pedometric mapping also allows quantifying the uncertainty of the final maps [46], data that is difficult to obtain with conventional methods. Reproducibility is another factor outperformed by pedometric mapping [47], using the same data and computer codes, different users obtain the same result. Some strong disadvantages are that pedologists require additional skills such as computational and statistical knowledge [48]. In places where the training of pedologists does not include these skills, it leads to rejection in the adoption of these methods. The development of applied research on pedometric mapping facilitates the migration from conventional methods and shows the advantages of using the methodology in local conditions.

4. Conclusion

Pedometric mapping applying the Naïve Bayes model produced the soil map of San Mateo de Otao with low uncertainty (accuracy = 0.77) and quantified the area occupied by each soil class. The Lithic Torriorthents and Sodic Haplocambids were the largest and smallest subgroups in the district. An aggregate (Torriorthents) and random (Haplocambids) distribution was observed present in the maps, and relief and parent material were identified as the most important factors in predicting soil classes. Everything suggests a considerable advantage of pedometric mapping compared to conventional mapping. The application on San Mateo de Otao and under the conditions of the soil survey of Peru suggest that it can be used in the national territory. It is recommended to continue investigating the application of pedometric mapping at different cartographic scales and with the most modern machine learning models.

Data Availability

The soil data, covariates, and code used to support the findings of this study have been deposited in the FigShare repository (https://doi.org/10.6084/m9.figshare.c.5839364.v1).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the laboratory for analysis of soils, plants, water, and fertilizers (LASPAF) of the National Agrarian University La Molina (UNALM). The authors thank Vanessa Llamoja, Rosel Aguilar, Michael Sandoval, Jose Tarrillo, Sandro Martinez, and the San Mateo community for their support in the field and laboratory evaluations.