excuse me my long question but I really hope that somebody could try to help me improve my code. Basically that's what I would like to do: reiterate the same model (as example random forests) 10 times with different inputs. As a result of each iteration I would like to extract from each model several parameters and after all iterations make a mean and standard deviation from them (for example mean AUC, mean bias). I may upload the input files but my problem is connected to a step that doesn't directly relies on them and I presume it may be solved using some coding. Here is an example:

I'm working with species distribution models using data from a vignette accompanying "dismo" package. All the code may be found here: https://rspatial.org/raster/sdm/6_sdm_methods.html#random-forest First I'm creating a data of species occurences (pb=1) and pseudo-absences (pb=0). Those are accompanied by longitude and lititude cooridinates in two columns, later environmental variables are joined to each point. Everything works fine here, so I'm able to create a model. But I would like to make a several models and average their results.

These are my initial steps:

require(raster)
#that is my file with occurrence points:
points_herb <- read.csv("herbarium.csv",header=TRUE)
points_herb <- points_herb[,2:3]
points_herb <- SpatialPointsDataFrame(coords = points_herb, data = points_herb, proj4string + CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
> head(points_herb)
 lon_x    lat_y
1 19.62083 49.62917
2 19.64583 49.62917
3 20.23750 49.61250...

#Variables (I use variables from PCA ran on climate)
files <- list.files("D:/variables/",pattern='asc',full.names=TRUE)
predictors <- raster::stack(files)
> predictors
class      : RasterStack 
dimensions : 1026, 1401, 1437426, 2  (nrow, ncol, ncell, nlayers)
resolution : 0.008333333, 0.008333333  (x, y)
extent     : 16.36667, 28.04167, 42.7, 51.25  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
names      : PCA1, PCA2

#Assigning variables to points
presvals <- extract(predictors, points_herb)
reading background points (about 20000):
points_back <- read.csv("back.csv",header=TRUE,dec = ".",sep = ",")
points_back <- points_back[,2:3]
points_back <- SpatialPointsDataFrame(coords = points_back, data = points_back, proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

assigning variables to background/pseudoabsence points
absvals <- extract(predictors, points_back)
absvals <- unique(absvals)

#**this is important!** Sampling 1000 random points from my entire dataset containing ca. 20000
absvals_1 <- absvals[sample(nrow(absvals), 1000), ]

#making an input file for the modeling
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals_1)))
sdmdata1 <- data.frame(cbind(pb, rbind(presvals, absvals_1)))
sdmdata1 <- na.omit(sdmdata1)```

> head(sdmdata1)
  pb   PCA1    PCA2 
1  1  9.985359 2.419048 
2  1  8.711462 2.229476 
...

I run the model:

#Random Forest
library(dismo)
library(randomForest)
#rf1- first random forest model
model_rf1 <- pb ~ PCA1 + PCA2
bc <- randomForest(model_rf1, data=sdmdata1)
#the model is predicted over a geographic space
bc_mod <- predict(predictors, bc, progress='')

#let's test it using CalibratR
require(CalibratR)
#extracting model probabilities to presence and absence points (those are actually from a separate dataset)
points_pres1 <- extract(bc_mod, points_pres1, cellnumbers=TRUE)
points_abs1 <- extract(bc_mod, points_abs1, cellnumbers=TRUE)
#prepare those data to test the model 
testECE <- c(rep(1, nrow(points_pres1)), rep(0, nrow(points_abs1)))
testECE <- data.frame(cbind(testECE, rbind(points_pres1, points_abs1)))
testECE <- na.omit(testECE)
testECE <- subset(testECE, select = c(testECE, layer))
#make Expected Calibration Error
ECE <- getECE(testECE$testECE, testECE$layer, n_bins = 10)
#make Maximum Calibration Error
MCE <- getMCE(testECE$testECE, testECE$layer, n_bins = 10)
#some other test
require(Metrics)
#get RMSE values
RMSE <- rmse(testECE$testECE, testECE$layer)

random_forest_1 <- data.frame(mget(c('ECE', 'RMSE', 'MCE')))
rownames(random_forest_1) <- "random_forest1"

Then I would like to run the same model but using a different background points. So in that case I make another input file, with another 1000 random points from the entire dataset:

absvals_2 <- absvals[sample(nrow(absvals), 1000), ]
pb <- c(rep(1, nrow(presvals_2)), rep(0, nrow(absvals_2)))
sdmdata2 <- data.frame(cbind(pb, rbind(presvals_2, absvals_2)))
sdmdata2 <- na.omit(sdmdata2)

model_rf2 <- pb ~ variable1 + variable2
bc <- randomForest(model_rf2, data=sdmdata2)
bc_mod <- predict(predictors, bc, progress='')

#again, let's test it using CalibratR
points_pres2 <- extract(bc_mod, points_pres2, cellnumbers=TRUE)
points_abs2 <- extract(bc_mod, points_abs2, cellnumbers=TRUE)
# everything just as above, the objects are overwritten
testECE <- c(rep(1, nrow(points_pres2)), rep(0, nrow(points_abs2)))
testECE <- data.frame(cbind(testECE, rbind(points_pres2, points_abs2)))
testECE <- na.omit(testECE)
testECE <- subset(testECE, select = c(testECE, layer))
ECE <- getECE(testECE$testECE, testECE$layer, n_bins = 10)
MCE <- getMCE(testECE$testECE, testECE$layer, n_bins = 10)
RMSE <- rmse(testECE$testECE, testECE$layer)

random_forest_2 <- data.frame(mget(c('ECE', 'RMSE', 'MCE')))
rownames(random_forest_2) <- "random_forest2"

#And finally let's make a mean from ECE, MCE, and RMSE
rf_results <- rbind(random_forest_1, random_forest_2)
rf_results_mean <- sapply(rf_results, 2, FUN=mean)
#and standard deviation
rf_results_sd <- sapply(rf_results, 2, FUN=sd)

result <- rbind(rf_results_mean, rf_results_sd)

In this example a made just 2 repetitions, but ideally I would like to make a 10 or 100. How to make it more elegant and automatic rather than creating manually 100 objects..?

1

There are 1 best solutions below

0
On

Here's at least part of a solution for you using purrr and dplyr and iterating over lists. This would give the advantage of storing your samples and results in one dataframe.

In my example below I've used a randomly generated dataframe and a very simple function to demonstrate. I'll point out at the end how you might fit this to your own data and processes. I haven't tried this on your code and data above, as it's quite long and complex and would take a while to get my head around your methods. But hopefully you'll be able to see how to work this into your own process.

library(dplyr)
library(purrr)

# step 1: create a function that takes a dataframe and returns a dataframe
calculate_mean_sd <- function(df){
  tibble(
    mean_lat = mean(df$lat),
    sd_lad = sd(df$lat),
    mean_long = mean(df$long),
    sd_long = sd(df$long)
  )
}

# random dataframe with all values you'd want to use (i.e. your `absvals` above)
full_df <- tibble(
  id = 1:100000,
  lat = runif(100000, 0, 100),
  long = runif(100000, 0, 100)
)

# step 2: create an empty list with the number (100) of loops you want to do
df <- as.list(1:100) %>% 
  map(~ tibble(iteration = .x))  # makes the iteration number into a dataframe to use later

# step 3: for each of 100 rows take a sample of a specified size and add to list as a dataframe
samples <- df %>%
  map( ~ mutate(.x, sample = list(full_df[sample(nrow(full_df), 100),])))

# step 4: iterate over list and pass your dataframes to the function, add results to new column
results <- samples %>% 
  map_df( ~ mutate(.x, results = list(bind_cols(.x[1], calculate_mean_sd(.x$sample[[1]]))))) 

# final, optional step: output a dataframe with iteration labelled and results
results$results %>% 
  reduce(bind_rows)

In your data above, you'd want to use absvals[sample(nrow(absvals), 1000), ] to sample your data in step 3, and then put the parts after this into a function which returns a dataframe with the columns you need.

This is perhaps short of giving a full answer but hopefully some helpful steps in iterating using purrr as a useful tool.

Edit: P.S. please do comment with any questions or parts that aren't working and I'll see if I can add any clarifications or notes above.