excuse me my long question but I really hope that somebody could try to help me improve my code. Basically that's what I would like to do: reiterate the same model (as example random forests) 10 times with different inputs. As a result of each iteration I would like to extract from each model several parameters and after all iterations make a mean and standard deviation from them (for example mean AUC, mean bias). I may upload the input files but my problem is connected to a step that doesn't directly relies on them and I presume it may be solved using some coding. Here is an example:
I'm working with species distribution models using data from a vignette accompanying "dismo" package. All the code may be found here: https://rspatial.org/raster/sdm/6_sdm_methods.html#random-forest First I'm creating a data of species occurences (pb=1) and pseudo-absences (pb=0). Those are accompanied by longitude and lititude cooridinates in two columns, later environmental variables are joined to each point. Everything works fine here, so I'm able to create a model. But I would like to make a several models and average their results.
These are my initial steps:
require(raster)
#that is my file with occurrence points:
points_herb <- read.csv("herbarium.csv",header=TRUE)
points_herb <- points_herb[,2:3]
points_herb <- SpatialPointsDataFrame(coords = points_herb, data = points_herb, proj4string + CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
> head(points_herb)
lon_x lat_y
1 19.62083 49.62917
2 19.64583 49.62917
3 20.23750 49.61250...
#Variables (I use variables from PCA ran on climate)
files <- list.files("D:/variables/",pattern='asc',full.names=TRUE)
predictors <- raster::stack(files)
> predictors
class : RasterStack
dimensions : 1026, 1401, 1437426, 2 (nrow, ncol, ncell, nlayers)
resolution : 0.008333333, 0.008333333 (x, y)
extent : 16.36667, 28.04167, 42.7, 51.25 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
names : PCA1, PCA2
#Assigning variables to points
presvals <- extract(predictors, points_herb)
reading background points (about 20000):
points_back <- read.csv("back.csv",header=TRUE,dec = ".",sep = ",")
points_back <- points_back[,2:3]
points_back <- SpatialPointsDataFrame(coords = points_back, data = points_back, proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
assigning variables to background/pseudoabsence points
absvals <- extract(predictors, points_back)
absvals <- unique(absvals)
#**this is important!** Sampling 1000 random points from my entire dataset containing ca. 20000
absvals_1 <- absvals[sample(nrow(absvals), 1000), ]
#making an input file for the modeling
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals_1)))
sdmdata1 <- data.frame(cbind(pb, rbind(presvals, absvals_1)))
sdmdata1 <- na.omit(sdmdata1)```
> head(sdmdata1)
pb PCA1 PCA2
1 1 9.985359 2.419048
2 1 8.711462 2.229476
...
I run the model:
#Random Forest
library(dismo)
library(randomForest)
#rf1- first random forest model
model_rf1 <- pb ~ PCA1 + PCA2
bc <- randomForest(model_rf1, data=sdmdata1)
#the model is predicted over a geographic space
bc_mod <- predict(predictors, bc, progress='')
#let's test it using CalibratR
require(CalibratR)
#extracting model probabilities to presence and absence points (those are actually from a separate dataset)
points_pres1 <- extract(bc_mod, points_pres1, cellnumbers=TRUE)
points_abs1 <- extract(bc_mod, points_abs1, cellnumbers=TRUE)
#prepare those data to test the model
testECE <- c(rep(1, nrow(points_pres1)), rep(0, nrow(points_abs1)))
testECE <- data.frame(cbind(testECE, rbind(points_pres1, points_abs1)))
testECE <- na.omit(testECE)
testECE <- subset(testECE, select = c(testECE, layer))
#make Expected Calibration Error
ECE <- getECE(testECE$testECE, testECE$layer, n_bins = 10)
#make Maximum Calibration Error
MCE <- getMCE(testECE$testECE, testECE$layer, n_bins = 10)
#some other test
require(Metrics)
#get RMSE values
RMSE <- rmse(testECE$testECE, testECE$layer)
random_forest_1 <- data.frame(mget(c('ECE', 'RMSE', 'MCE')))
rownames(random_forest_1) <- "random_forest1"
Then I would like to run the same model but using a different background points. So in that case I make another input file, with another 1000 random points from the entire dataset:
absvals_2 <- absvals[sample(nrow(absvals), 1000), ]
pb <- c(rep(1, nrow(presvals_2)), rep(0, nrow(absvals_2)))
sdmdata2 <- data.frame(cbind(pb, rbind(presvals_2, absvals_2)))
sdmdata2 <- na.omit(sdmdata2)
model_rf2 <- pb ~ variable1 + variable2
bc <- randomForest(model_rf2, data=sdmdata2)
bc_mod <- predict(predictors, bc, progress='')
#again, let's test it using CalibratR
points_pres2 <- extract(bc_mod, points_pres2, cellnumbers=TRUE)
points_abs2 <- extract(bc_mod, points_abs2, cellnumbers=TRUE)
# everything just as above, the objects are overwritten
testECE <- c(rep(1, nrow(points_pres2)), rep(0, nrow(points_abs2)))
testECE <- data.frame(cbind(testECE, rbind(points_pres2, points_abs2)))
testECE <- na.omit(testECE)
testECE <- subset(testECE, select = c(testECE, layer))
ECE <- getECE(testECE$testECE, testECE$layer, n_bins = 10)
MCE <- getMCE(testECE$testECE, testECE$layer, n_bins = 10)
RMSE <- rmse(testECE$testECE, testECE$layer)
random_forest_2 <- data.frame(mget(c('ECE', 'RMSE', 'MCE')))
rownames(random_forest_2) <- "random_forest2"
#And finally let's make a mean from ECE, MCE, and RMSE
rf_results <- rbind(random_forest_1, random_forest_2)
rf_results_mean <- sapply(rf_results, 2, FUN=mean)
#and standard deviation
rf_results_sd <- sapply(rf_results, 2, FUN=sd)
result <- rbind(rf_results_mean, rf_results_sd)
In this example a made just 2 repetitions, but ideally I would like to make a 10 or 100. How to make it more elegant and automatic rather than creating manually 100 objects..?
Here's at least part of a solution for you using
purrr
anddplyr
and iterating over lists. This would give the advantage of storing your samples and results in one dataframe.In my example below I've used a randomly generated dataframe and a very simple function to demonstrate. I'll point out at the end how you might fit this to your own data and processes. I haven't tried this on your code and data above, as it's quite long and complex and would take a while to get my head around your methods. But hopefully you'll be able to see how to work this into your own process.
In your data above, you'd want to use
absvals[sample(nrow(absvals), 1000), ]
to sample your data in step 3, and then put the parts after this into a function which returns a dataframe with the columns you need.This is perhaps short of giving a full answer but hopefully some helpful steps in iterating using
purrr
as a useful tool.Edit: P.S. please do comment with any questions or parts that aren't working and I'll see if I can add any clarifications or notes above.