V.V. Dokuchaev Soil Science Institute

E-mail: info@esoil.ru
Tel/Fax: +7 (495) 951-50-37
search  search  

The large scale digital mapping of soil organic carbon using machine learning algorithms

A. V. Chinilin1, I. Yu. Savin2

1RSAU-MTAA, 127550, Russian Federation, Moscow, Timiryazevskaya st., 49
2V.V. Dokuchaev Soil Science Institute, Russia, 119017, Moscow, Pyzhevskii per. 7-2

The results of digital mapping of organic carbon content within the arable horizons of soils and the assessment of obtained models accuracy with the use of machine learning methods for the area of Central Russian Upland in Voronezh Oblast are presented. The digital mapping was based on 22 points of soil samplings, applied for the learning and verification of models, and also on several sets of predictor variables. We took also digital elevation model, its derivatives and also remote sensing data of different spatial resolution as predictor variables. Several methods were used to create the spatial variability models for the investigated property based on the decision trees methods: random forest, boosting regression trees and Bayessian regression trees. The assessment of the models obtained accuracy was conducted by a method of cross-validation. As the accuracy indices we used the determination coefficient, mean absolute error and the root mean square error. The modelling results showed that the use of predictor variables presented by digital elevation model, its derivatives and Landsat 8 data we were able to obtain more sustainable models. The determination coefficient varied from 0.6 to 0.7, RMSEcv, i.e., the prognosing error varied from 0.5791 to 0.6520. Whereas, the best model was obtained with the method of Bayessian regression trees; whereas the predictor variables presented by the digital elevation model, its derivatives and Sentinel 2 data determination coefficient varied from 0.47 to 0.55, and the prognosing error varied from 0.7031 to 0.7909. It was revealed that in the described models according to different data sets the most significant were the various predictor variables.

Key words: spatial prediction, digital elevation model, random forest, boosting

DOI: 10.19047/0136-1694-2018-91-46-62

Citation: Chinilin A.V., Savin I. Yu. The large scale digital mapping of soil organic carbon using machine learning algorithms, Dokuchaev Soil Bulletin, 2018, Vol. 91, pp. 46-62. doi: 10.19047/0136-1694-2018-91-46-62


  1. Dobrovol'skii G.V., Urusevskaya I.S. Soil geography, Moscow, MGU Publ., 2015, 458 p. (in Russian)
  2. Zhogolev A.V. Regional soil maps actualization based on geographical information systems and remote sensing data (the case of Moscow region), Extended abstract of candidate's thesis, 2016, 22 p. (in Russian)
  3. Savin I.Yu., Prudnikova E.Yu. About optimal dates of satellite images acquisition for arable soil mapping, Dokuchaev Soil Bulletin.2014, V. 74, pp. е52-е61.
  4. Florinsky I.V. The Dokuchaev Hypothesis as a Basis for Predictive Digital Soil Mapping (On the 125th Anniversary of Its Publication), Eurasian Soil Science, 2012, V. 45 (4), pp. 445-451. doi: 10.1134/S1064229312040047
  5. Arrouays D., Savin I., Leenaars J., McBratney A.B. (eds.) GlobalSoilMap - Digital Soil Mapping from Country to Globe, Balkem, CRC Press, 2018, 174p.
  6. Arrouays D., McKenzie N., Hempel J., Richer de Forges A., McBratney A. GlobalSoilMap: basis of the global spatial soil information system, Balkem, CRC Press, 2014, 494 p.
  7. Breiman L. Random Forests, Machine Learning, 2001, No. 1 (45), pp. 5–32. doi: 10.1023/A:1010933404324
  8. Bui E.N., Henderson B.L., Viergever K. Knowledge discovery from models of soil properties developed through data mining, Ecological Modelling, 2006, No. 3 (191), pp. 431–446. doi: 10.1016/j.ecolmodel.2005.05.021
  9. Chen T., Guestrin C. XGBoost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp.785–794. doi: 10.1145/2939672.2939785
  10. Chipman H.A., George E.I., McCulloch R.E. BART: Bayesian additive regression trees, The Annals of Applied Statistics, 2010, No. 1 (4), pp. 266–298. doi: 10.1214/09-AOAS285
  11. Conrad O., Bechtel M., Bock M., Dietrich H., Fischer E. System for Automated Geoscientific Analyses (SAGA) v. 2.1.4, Geoscientific Model Development, 2015, No. 7 (8), pp. 1991–2007. doi: 10.5194/gmd-8-1991-2015
  12. Gobin A. Participatory and spatial-modeling methods for land resources analysis, PhD thesis, Katholik Universiteit, Leuven, 2000, 282 p.
  13. Grinand C., Arrouays D., Laroche D., Martin M.P. Extrapolating regional soil landscapes from an existing soil map: Sampling intensity, validation procedures, and integration of spatial context, Geoderma, 2008, No. 1 (143), pp. 180–190. doi: 10.1016/j.geoderma.2007.11.004
  14. Hengl T., Heuvelink G.B.M., Kempen B., Leenaars J.G.B., Walsh M. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions, PLOS ONE, 2015, No. 6 (10), pp. e0125814. doi: 10.1371/journal.pone.0125814
  15. Hengl T., Mendes de Jesus J., Heuvelink G.B.M., Ruiperez Gonzalez M., Kilibarda M. SoilGrids250m: Global gridded soil information based on machine learning, PLOS ONE, 2017, No. 2 (12), pp. e0169748. doi: 10.1371/journal.pone.0169748
  16. Hengl T., Leenaars K., Shepherd K.D., Walsh M., Heuvelink G.B.M. Soil nutrient maps of Sub-Saharan Africa: assessment of soil nutrient content at 250 m spatial resolution using machine learning, Nutrient Cycling in Agroecosystems, 2017, No. 1 (109), pp. 77–102. doi: 10.1007/s10705-017-9870-x
  17. Jenny H. Factors of Soil Formation, Soil Science, 1941, No. 5 (52), pp. 415. doi: 10.1097/00010694-194111000-00009
  18. Kuhn M. Building Predictive Models in R Using the caret Package, Journal of Statistical Software, 2008, No. 5 (28), pp. 1-26. doi: 10.18637/jss.v028.i05
  19. Lagacherie P., Holmes S. Addressing geographical data errors in a classification tree for soil unit prediction, International Journal of Geographical Information Science, 1997, No. 2 (11), pp. 183–198. doi: 10.1080/136588197242455
  20. McBratney A., Mendonça Santos M., Minasny B. On digital soil mapping, Geoderma, 2003, No. 1–2 (117), pp. 3–52. doi: 10.1016/S0016-7061(03)00223-4
  21. Minasny B., McBratney A.B. A conditioned Latin hypercube method for sampling in the presence of ancillary information, Computers & Geosciences, 2006, No. 9 (32), pp. 1378–1388. doi: 10.1016/j.cageo.2005.12.009
  22. R Core Team R: A language and environment for statistical computing, 2016.
  23. Sollich P., Krogh A. Learning with ensembles: How overfitting can be useful, Proceedings of the 1995 Conference, V. 8, 1996, pp. 190–196.
  24. Taghizadeh-Mehrjardi R., Minasny B., McBratney A.B., Triantafilis J. Digital soil mapping of soil classes using decision trees in central Iran, Proceedings of the 5th Global Workshop on Digital Soil Mapping, 2012, pp. 197–202. doi: 10.1201/b12728-40
  25. Vermote E., Justice C., Claverie M., Franch B. Preliminary analysis of the performance of the Landsat 8/OLI land surface reflectance product, Remote Sensing of Environment, 2016, No. 185, pp. 46–56. doi: 10.1016/j.rse.2016.04.008