Chapter 2 Introduction

2.1 Modeling Approach

The modelling approach used is machine learning for DSM (Hengl et al., 2017). Specifically, I used the Random Forest (RF) algorithm to establish a relation between topsoil (0-30 cm) SOC stock and some climatic covariates (CHELSA global downscaled climate data set). The spatial DSM used for model calibration is derived from a spatial overlay of SOC observations on covariates. The SOC variable was obtained from the Rapid Carbon Assessment Project (RCAP) using the SoilDB package in R. Once the regression matrix is derived, model calibration, cross-validation and prediction are carried out in the usual way (e.g., Hengl et al., 2017). Specific details of these steps, such as model selection, choice of hyperparameters and computational aspects, are provided after the soil data and covariates have been introduced.

2.2 Study Area

The study is restricted to entire State of New Jersey, USA with approximately 8,729 square miles (22,610 km2), of which 14.9% or 1,304 square miles (3,380 km2) is water, and 85.1%, or 7,425 square miles (19,230 km2), is land.

2.3 Big Geo Data

Dealing with the “geo big data” is a challenging task not only because it requires a massive computational power and data storage capacity, but also because the user time needed to compile and process the extensive types of remote sensing data. Preprocessing most satellite data usually involve geometric and radiometric corrections, application of filters to remove cloud, haze, shadow, and other issues, which requires considerably amount of time to produce large-scale maps. Fortunately, the climatic covariates (CHELSA global downscaled climate data set) platform provides many preprocessed satellite data.

2.4 Calculations

Calculation of derivatives, masking and cropping of raster files were performed in layers of stacked or bricked files