-
Notifications
You must be signed in to change notification settings - Fork 6
/
OAF_maize_prod_preds.Rmd
505 lines (393 loc) · 25.5 KB
/
OAF_maize_prod_preds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
---
title: Predicting spatial yield potentials from survey data
author: M.G. Walsh, J. Chamberlin, J. Silva and S. Aston
date: "`r format(Sys.time(), '%d, %B, %Y')`"
output:
html_document:
toc: true
toc_depth: 1
fig_caption: true
css: style.css
---
```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE)
```
# Introduction
This notebook develops an example of a spatial crop yield potential model. The approach consists of a combination of a standard [Production function (PF)](https://en.wikipedia.org/wiki/Production_function) that considers outputs relative to inputs and a novel **Site index (SI)**, which describes the quality of the production environment for the purpose of growing a particular crop. The combined model can be expressed as: **y ~ f (PF, SI)**, and the explicit multilevel model version of this can be used to gauge the potential productivity of croplands and to provide a comparative frame of reference for evaluating management options.
The SI is the spatial part, which we predict using remote sensing and GIS data with machine learning in [R](https://www.r-project.org/). We intend for the approach to be generalizable to other crop productivity surveys and experiments in which both input/output relationships as well as variability in site quality are deemed important for crop management as well as for program impact monitoring and evaluation.
The example maize yield data we shall use were collected over two cropping seasons (2016 & 2017) by [One Acre Fund (OAF)](https://oneacrefund.org/) in Western Kenya as part of their annual program monitoring activities. This is a fairly large dataset with 5,987 georeferenced plots distributed over the two cropping seasons and across 333 Level-3 administrative areas i.e., *Locations*. The data are split between plots belonging to farmers that were participating in the OAF maize program (*treated*) and those who were not participating (*controls*) at the time. Data on fertilizer input use on each plot (Calcium Ammonium Nitrate & Diammonium Phosphate) and estimated plot size are also included.
Other than GPS coordinates, the original OAF dataset does not contain any data related to site quality for maize production. We will be using a raster stack with 46 layers covering anthropic, climate, organismal/vegetation, relief/topographical, and geological/soil factors. Though these data are regularly updated, the emphasis is on variables that are expected to change relatively slowly e.g., land form and terrain, climate, soils, parent material, infrastructure, long-term vegetation and reflectance averages, among others. The main reason for this is that these *slow variables* can become key controlling factors for the faster variables such as crop yields, but over longer time scales. It is also opportune that most of the feature layers have been harmonized and are available for all of Africa.
The notebook is available and will be maintained on [MGWs Github](https://github.com/mgwalsh/TRM/blob/master/OAF_maize_prod_preds.Rmd). To actually run it to calculate and see additional outputs, you will need to load the packages indicated in the chunk directly below. The authors of these packages are gratefully acknowledged for all of their hard work.
```{r}
# package names
packages <- c("downloader", "rgdal", "sp", "raster", "quantreg", "arm", "leaflet", "htmlwidgets", "devtools", "caret", "mgcv", "MASS", "randomForest", "gbm", "nnet", "plyr", "doParallel", "dismo")
# install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# load packages
invisible(lapply(packages, library, character.only = TRUE))
```
# General data setup
This first section of the notebook sets up the survey data and links these to current remote sensing and GIS data for Kenya that are maintained by [AfSIS (2020)](https://osf.io/4jvnu). The following chunk fetches the data and links these to the remote sensing and GIS layers represented by `grids` variable stack.
```{r, warning=FALSE}
# Data setup --------------------------------------------------------------
# Create a data folder in your current working directory
dir.create("OAF_data", showWarnings=F)
setwd("./OAF_data")
dir.create("./Results")
# download OAF yield data
download("https://www.dropbox.com/s/lhniws8prfvjw9r/OAF_yield_data.csv.zi?raw=1", "OAF_yield_data.csv.zip", mode = "wb")
unzip("OAF_yield_data.csv.zip", overwrite = T)
yield <- read.table("OAF_yield_data.csv", header = T, sep = ",")
yield$trt <- ifelse(yield$trt == 1, "oaf", "control")
# yield <- yield[!duplicated(yield), ] ## removes duplicates if needed
# download GADM-L3 shapefile (@ http://www.gadm.org)
download("https://www.dropbox.com/s/otspr9b9jtuyneh/KEN_adm3.zip?raw=1", "KEN_adm3.zip", mode = "wb")
unzip("KEN_adm3.zip", overwrite = T)
shape <- shapefile("KEN_adm3.zip")
# download raster stack
download("https://osf.io/4jvnu?raw=1", "KE_250m_2020.zip", mode = "wb")
unzip("KE_250m_2020.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)
```
As the `yield` dataframe only covers portions of Western Kenya, we initially define the relevant [Region of Interest](https://en.wikipedia.org/wiki/Region_of_interest) based on the min/max survey coordinates and crop all of the `grids` raster stack to that bounding box. This is mostly to reduce computing times.
```{r}
# set Region of Interest grid extent
ext <- data.frame(lat = c(-1.2,-1.2,1.2,1.2), lon = c(33.9,35.5,33.9,35.5)) ## set ROI extent in degrees
names(ext) <- c("lat","lon")
coordinates(ext) <- ~ lon + lat
proj4string(ext) <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84")
ext <- spTransform(ext, CRS("+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
bb <- extent(ext)
grids <- crop(grids, bb)
```
This next chunk extracts the GIS and grid data to the `yield` dataframe and also does a bit of outlier cleaning. It generates a new dataframe called `gsdat`, which will the main focus for fitting and validating the various machine learning models that are presented in subsequent sections.
```{r}
# attach GADM-L3 admin unit names from shape
coordinates(yield) <- ~lon+lat
projection(yield) <- projection(shape)
gadm <- yield %over% shape
yield <- as.data.frame(yield)
yield <- cbind(gadm[ ,c(5,7,9)], yield)
colnames(yield) <- c("district","division","location","id","lat","lon","year","trt","can","dap","fsize","yield")
# project survey coords to grid CRS
yield.proj <- as.data.frame(project(cbind(yield$lon, yield$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(yield.proj) <- c("x","y")
yield <- cbind(yield, yield.proj)
coordinates(yield) <- ~x+y
projection(yield) <- projection(yield)
# extract gridded variables at survey locations
yieldgrid <- extract(grids, yield)
gsdat <- as.data.frame(cbind(yield, yieldgrid))
# clean data
gsdat <- gsdat[complete.cases(gsdat[,c(1:3,13:44)]), ] ## removes incomplete cases
gsdat <- gsdat[which(gsdat$can < 100 & gsdat$dap < 100), ] ## removes outlier fertilizer treatments
gsdat <- gsdat[which(gsdat$fsize > 0), ] ## removes field size = 0
quant <- quantile(gsdat$yield, probs=c(0.025,0.975))
gsdat <- gsdat[which(gsdat$yield > quant[1]), ] ## removes low outlier yields (<2.5% quantile)
gsdat <- gsdat[which(gsdat$yield < quant[2]), ] ## removes high outlier yields (>97.5% quantile)
# write out clean dataframe
write.csv(gsdat, "./OAF_data/Results/OAF_gsdat.csv", row.names = F)
# yield survey location map widget
w <- leaflet() %>%
setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
saveWidget(w, 'OAF_yield_survey.html', selfcontained = T) ## save widget
w ## plot widget
```
# Fit an aggregate production function
We use [Quantile regression]((https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)) to fit a simple [Cobb-Douglas](https://en.wikipedia.org/wiki/Cobb%E2%80%93Douglas_production_function) type function to the yield survey data ... without any additional spatial features. Median regression (i.e. 50th quantile regression) is preferred to mean regression in this case, because it is quite robust to outlying observations. The residuals of this function are then evaluated and classified into 2 groups of plots: group A for which observed yields lie above the regression line and group B for which observed yields lie below it. This creates a new variable `silab` in the `gsdat` dataframe.
```{r}
# this is a basic production function using quantile regression
si.rq <- rq(log(yield)~year+factor(trt)+log(dap+1)*log(can+1), tau = 0.5, data = gsdat)
gsdat$silab <- as.factor(ifelse(exp(predict(si.rq, gsdat)) > gsdat$yield, "B", "A")) ## classify the residuals
summary(si.rq)
```
As might be expected (and as shown in the boxplot below), there are substantial differences in the observed maize yields between the two, what we shall refer to as **Site Index labels**, which are **not attributable** to differences between years, treatments (OAF *treated vs control*), nor the application of fertilizers. In the next sections we explore to what degree these differences may be attributable to the spatial characteristics of the production environment ... in other words the SI as indicated by `silab`.
```{r, echo=FALSE, fig.align = "center", fig.cap = "Differences in observed maize yields (t/ha) between Site Index labels."}
par(pty="s", mar=c(4,4,1,1))
boxplot(yield~silab, notch=T, ylab="Maize yield (t/ha)", xlab="SI label", cex.lab=1.4, gsdat)
```
# Machine-learning-based predictive mapping of SI
The following chunks generate `silab` predictions using different machine learning algorithms (MLAs) with varying remote sensing and GIS (covariate) features. The main idea is to train a number of potentially contrasting models with [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). At the end of the model training processes, the various models will be ensembled [(stacked)](https://datasciblog.github.io/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/) on an *independent* validation dataset. When consistently applied over time and space, this is a form of [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning), which should produce increasingly accurate predictions as new field and remote data or different MLAs are obtained and run.
Note that you are not limited to only the models that we use. The [`caret`](https://topepo.github.io/caret/index.html) package, which we rely on here, offers many compelling alternatives. We encourage you to explore and experiment with those.
## Model setups
The following chunk scrubs some of the objects in memory and creates a randomized partition between the training and validation dataframes.
```{r}
rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrubs extraneous objects in memory
# set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)
# split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$silab, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]
# Site index calibration labels
labs <- c("silab")
lcal <- as.vector(t(gs_cal[labs]))
# raster calibration features
fcal <- gs_cal[,13:31,35:58]
```
Note that in running the models below everything is parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are also other options available for this (e.g., [foreach](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html), among others.
## Spatial trend model ([mgcv](https://cran.r-project.org/web/packages/mgcv/mgcv.pdf))
This is a simple spatially smoothed *generalized additive model* applying the `gam` function on the SI label at different sampling locations in Western Kenya, based only on their georeference. It is similar to ordinary indicator kriging with cross-validation ... but it is simpler and much faster to compute in this context.
```{r, results='hide'}
# select central place covariates
gf_cpv <- gs_cal[,32:34]
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
summaryFunction = twoClassSummary, allowParallel = T)
# model training
gm <- train(gf_cpv, lcal,
method = "gam",
preProc = c("center","scale"),
family = "binomial",
metric = "ROC",
trControl = tc)
# model outputs & predictions
summary(gm)
gm.pred <- predict(grids, gm, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)
```
## Central place model ([MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf))
Central places are influential variables in places where human impacts occur. They are correlated with both extraction and deposition of soil nutrients and toxic elements, soil erosion and deposition, acidification and many other soil disturbance processes. The model below focuses on central place indicators such as distances to roads and settlements, surface water sources, cell towers and electricity networks among others.
```{r, results='hide'}
# select central place covariates
gf_cpv <- gs_cal[,20:31]
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
summaryFunction = twoClassSummary, allowParallel = T)
# model training
gl1 <- train(gf_cpv, lcal,
method = "glmStepAIC",
family = "binomial",
preProc = c("center","scale"),
trControl = tc,
metric ="ROC")
# model outputs & predictions
gl1.pred <- predict(grids, gl1, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gl1.rds", sep = "")
saveRDS(gl1, fname)
```
## GLM model with all the spatial features ([MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf))
This model is very similar to the Central place model above, but it contains all of the 46 spatial features and then backward selects from those to generate a prediction via a generalized linear model. Alternatively (or additionally) you could also try regularized regression with e.g., [glmnet](https://cran.r-project.org/web/packages/glmnet/index.html) here.
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
summaryFunction = twoClassSummary, allowParallel = T)
# model training
gl2 <- train(fcal, lcal,
method = "glmStepAIC",
family = "binomial",
preProc = c("center","scale"),
trControl = tc,
metric ="ROC")
# model outputs & predictions
gl2.pred <- predict(grids, gl2, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gl2.rds", sep = "")
saveRDS(gl2, fname)
```
## Random forest ([randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf))
The below is a bagging chunk that uses [Breiman & Cutler's](https://link.springer.com/article/10.1023/A:1010933404324) algorithm with all of the feature data. A good, short article to look at for reference in context here is [Barnard et al.](https://www.researchgate.net/publication/331328203_Can't_see_the_random_forest_for_the_decision_trees_selecting_predictive_models_for_restoration_ecology).
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
summaryFunction = twoClassSummary, allowParallel = T)
tg <- expand.grid(mtry = seq(1,5, by=1)) ## model tuning steps
# model training
rf <- train(fcal, lcal,
preProc = c("center","scale"),
method = "rf",
ntree = 501,
metric = "ROC",
tuneGrid = tg,
trControl = tc)
# model outputs & predictions
rf.pred <- predict(grids, rf, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_rf.rds", sep = "")
saveRDS(rf, fname)
```
## Generalized boosting ([gbm](https://cran.r-project.org/web/packages/gbm/gbm.pdf))
This next chunk represents one of the *boosting* techniques that can be used for both regression or classification. It is similar to the `randomForest` above, but uses a boosting technique that emphasizes successful predictions rather than penalizing poor predictions via *bagging*. There is a wide-array of literature around the so-called *"greedy algorithms"*. Very good descriptions of these are provided in [Hastie et al, 2009](https://web.stanford.edu/~hastie/ElemStatLearn/).
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T, summaryFunction = twoClassSummary,
allowParallel = T)
## for initial <gbm> tuning guidelines see @ https://stats.stackexchange.com/questions/25748/what-are-some-useful-guidelines-for-gbm-parameters
tg <- expand.grid(interaction.depth = seq(2,5, by=1), shrinkage = 0.01, n.trees = seq(101,501, by=50),
n.minobsinnode = 50) ## model tuning steps
# model training
gb <- train(fcal, lcal,
method = "gbm",
preProc = c("center", "scale"),
trControl = tc,
tuneGrid = tg,
metric = "ROC")
# model outputs & predictions
gb.pred <- predict(grids, gb, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gb.rds", sep = "")
saveRDS(gb, fname)
```
## Neural network ([nnet](https://www.rdocumentation.org/packages/nnet/versions/7.3-15/topics/nnet))
With the last model we fit here is a simple feed-forward neural network i.e., a "single layer perceptron". This is a type of linear classifier, which combines a set of weights with the feature vector. Note that more complex multilayer network structures, such as [Deep-learning](https://en.wikipedia.org/wiki/Deep_learning) could be applied, but we leave those options for you to explore.
```{r, results='hide'}
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
summaryFunction = twoClassSummary, allowParallel = T)
tg <- expand.grid(size = seq(2,10, by=2), decay = c(0.001, 0.01, 0.1)) ## model tuning steps
# model training
nn <- train(fcal, lcal,
method = "nnet",
preProc = c("center","scale"),
tuneGrid = tg,
trControl = tc,
metric ="ROC")
# model outputs & predictions
nn.pred <- predict(grids, nn, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_nn.rds", sep = "")
saveRDS(nn, fname)
```
# SI ensemble predictions
The main point here is not to evaluate a *best individual model* but rather to evaluate the combination of the previously fitted models against a 20% [hold-out](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another in an ensemble. It also prevents most [overfitting](https://en.wikipedia.org/wiki/Overfitting) problems.
## Stacking model setup on validation set
```{r}
# Model stacking setup ----------------------------------------------------
preds <- stack(gm.pred, gl1.pred, gl2.pred, rf.pred, gb.pred, nn.pred)
names(preds) <- c("gm","gl1","gl2","rf","gb","nn")
# plot(preds, axes = F)
# extract model predictions
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val)
gspred <- as.data.frame(cbind(gs_val, gspred))
# stacking model validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs]))
fval <- gspred[,60:65] ## subset validation features
```
## Model stacking
The following chunk fits the model ensemble with the `glmStepAIC` function from the `MASS` library. You could explore other options here, but we find that this provides a reasonable combination and weighting of the 6 models that were produced in the individual model training steps.
```{r, results='hide'}
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
summaryFunction = twoClassSummary, allowParallel = T)
# model training
si <- train(fval, lval,
method = "glmStepAIC",
family = "binomial",
metric = "ROC",
trControl = tc)
# model outputs & predictions
si.pred <- predict(preds, si, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_si.rds", sep = "")
saveRDS(si, fname)
```
```{r, echo=FALSE}
summary(si)
```
SI theoretically takes on values between 0 - 1, just like a probability. The chunk below generates the prediction map of the SI across the OAF Area of Interest in Western Kenya.
```{r}
# project si.pred to EPSG:3857
sill <- projectRaster(si.pred, crs="+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
# set color pallette
pal <- colorBin("Greens", domain = 0:1)
# render map
w <- leaflet() %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addRasterImage(sill, colors = pal, opacity = 0.5) %>%
addLegend(pal = pal, values = values(sill), title = "Site index")
w ## plot widget
```
# Prediction checks
This next chunk provides some prediction checks on the ensemble model based on both the validation, as well as the complete dataframes. It also writes out all of the prediction grids to a geotif file, which you can import into a GIS of your choosing for display, query and additional analyses.
```{r}
# validation set receiver-operator characteristics ------------------------
cp_pre <- predict(si, fval, type="prob")
cp_val <- cbind(lval, cp_pre)
cpa <- subset(cp_val, cp_val=="A", select=c(A))
cpb <- subset(cp_val, cp_val=="B", select=c(A))
cp_eval <- evaluate(p=cpa[,1], a=cpb[,1]) ## calculate ROC's on test set (~0.71)
# plot(cp_eval, 'ROC') ## plot ROC curve
# generate feature mask ---------------------------------------------------
t <- threshold(cp_eval) ## calculate classification threshold based on validation set ROC
r <- matrix(c(0, t[,1], 0, t[,1], 1, 1), ncol=3, byrow = T) ## set threshold value <kappa>
mask <- reclassify(si.pred, r) ## reclassify stacked predictions
# write prediction grids to geotif ----------------------------------------
gspreds <- stack(preds, si.pred, mask)
names(gspreds) <- c("gm","gl1","gl2","rf","gb","nn","si","mk")
fname <- paste("./OAF_data/Results/","OAF_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)
# Site Index prediction check ---------------------------------------------
coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))
gsout$mzone <- as.factor(ifelse(gsout$mk == 1, "A", "B"))
confusionMatrix(gsout$mzone, gsout$silab) ## overall prediction accuracy stats
```
# Multilevel model of maize yield potentials
We use a [Multilevel model](https://en.wikipedia.org/wiki/Multilevel_model) to combine the predicted SIs with the previously presented aggregate production function using the [`arm`](https://www.rdocumentation.org/packages/arm/versions/1.11-2) package. In this particular example we specify a random intercept model with in which the model intercepts i.e., the efficiency parameters of the combined model, are allowed to vary by `year` and administrative `location`.
```{r}
yld.lme <- lmer(log(yield)~factor(trt)*log(si/(1-si))+log(can+1)*log(dap+1)+(1|year)+(1|location), data = gsout)
gsout$yldf <- exp(fitted(yld.lme, gsout))
summary(yld.lme) ## mixed model yield estimate results
```
# Uncertainty estimates
There are many ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach using quantile regression with ([quantreg](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)). The plot below shows the spread of the ROI-wide predictions (sensu, their 95% probable intervals).
```{r}
stQ <- rq(yield~yldf, tau=c(0.025,0.5,0.975), data=gsout) ## quantile regression fit
print(stQ)
```
```{r, fig.align = "center", fig.cap = "Quantile regression plot of modeled potential vs observed maize yields (t/ha) in Western Kenya. The blue lines are the 2.5% and 97.5% quantile regression estimates."}
par(pty="s", mar=c(4,4,1,1))
plot(yield~yldf, xlab="Yield potential (t/ha)", ylab="Measured yield (t/ha)", cex.lab=1.4,
xlim=c(0,8), ylim=c(0,8), gsout)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=0, to=8, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=0, to=8, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=0, to=8, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
```
We can now write out the complete output dataframe, including all of the predictions for reproducibility, reference and reuse.
```{r}
# Write output data frame -------------------------------------------------
write.csv(gsout, "./OAF_data/Results/OAF_gsout.csv", row.names = F)
```