sf_distance within for within for each parallelisation

32 Views Asked by At

Overall, I have a data frame with buildings with a spatial variable connected. Then I have another file of e.g. forests, similarly

Total_df:

Id Variables SFC Point object Zip codes
1 10 POINT (543611.8 6389285) 2324
2 15 POINT (513611.8 6349285) 2324
3 12 POINT (533611.8 6359285) 2329

About 2 mil observations

forest_distance:

Id Variables SFC Polygon object
1 10 POLYGON Z ((455302.7 6252026 9.09, 455292.6 6252034 9.09, 455274.8 6252036 9.9, 455246 6252113 14.25, 455286.1 6252124 14.15, 455293.5 6252126 14.13, 455317.8 6252068 14.13, 455331.5 6252073 14.13, 455345.5 6252044 14.78, 455302.7 6252026 9.09))

The forest_distance is saved in list, where the original forest_distance is divided into 10 equal parts.

The distance between I have figured out to do, and I have also split up the Total_df, to do it on smaller subsets decided by zip codes.

But now, to speed up the calculation, I want to do a parallelisation, where I also subdivide the forest_distance to smaller files.

I imagine it would be quicker to do a parallelisation, such that each session does a part of the subdivided forest_distance.

Also, is it possible to print from the different sessions, to see the progress?

    registerDoParallel(cores = 6)    

# Use foreach to loop over list.dfs in parallel
foreach(d = 1:length(list.dfs), .packages = "sf", .combine = 'c') %dopar% {
  # Get the data frame at position 'd' in the list
  df <- list.dfs[[d]]
  
  # Open a list to store combined inner results 
  grand_list <- list()
  
  # Initialize an empty list to store the results of the inner loop
  inner_results <- list()
  
  # zip_code 
  zipcode <- sort(unique(Total_df$zipcode))
  

  # Use a regular for loop to iterate over zipcode
  for(i in zipcode) {
    cat(i, "\n")
    start_time <- Sys.time()
    
    # Subset the data
    subset_df <- Total_df[Total_df$zipcode == i, ]
    
    if(nrow(subset_df) > 0) {
      # Calculate distances
      distances <- sf::st_distance(subset_df, df)
      
      # Define the 'miin' function, or replace it with an appropriate function
      miin <- function(x) min(x, na.rm = TRUE)
      
      # Calculate minimum distances
      min_distances <- apply(distances, 1, miin)
      
      # Store minimum distances in a new column
      subset_df$min_distances <- min_distances
    }
    
    end_time <- Sys.time()
    print(paste("Time for municipality Forest", i, ": ", end_time - start_time))
    
    # Store the updated subset_df in the inner_results list
    inner_results[[i]] <- subset_df
  }
  
  # Combine the results of the inner loop using do.call
  grand_list[[d]] <- do.call(rbind, inner_results)
  
}

It has run for many hours, and had to stop it, but it has not saved any results during.

1

There are 1 best solutions below

0
Sirius On

This is untried, but a rewrite towards something like this might work:


registerDoParallel(cores = 6)

# Use foreach to loop over list.dfs in parallel
grand_list <- foreach(df = list.dfs, .packages = "sf") %dopar% {

  # Initialize an empty list to store the results of the inner loop
  inner_results <- list()

  # zip_code
  zipcode <- sort(unique(Total_df$zipcode))


  # Use a regular for loop to iterate over zipcode
  for(i in zipcode) {
    cat(i, "\n")
    start_time <- Sys.time()

    # Subset the data
    subset_df <- Total_df[Total_df$zipcode == i, ]

    if(nrow(subset_df) > 0) {
      # Calculate distances
      distances <- sf::st_distance(subset_df, df)

      # Define the 'miin' function, or replace it with an appropriate function
      miin <- function(x) min(x, na.rm = TRUE)

      # Calculate minimum distances
      min_distances <- apply(distances, 1, miin)

      # Store minimum distances in a new column
      subset_df$min_distances <- min_distances
    }

    end_time <- Sys.time()
    print(paste("Time for municipality Forest", i, ": ", end_time - start_time))

    # Store the updated subset_df in the inner_results list
    inner_results[[i]] <- subset_df
  }

  # Combine the results of the inner loop using do.call
  do.call(rbind, inner_results)

}

(The prints that you do may not work though)

Tip: Debug your code with %do% instead of %dopar% , and run only the first two values:

grand_list <- foreach(df = list.dfs[1:2], .packages = "sf") %do% { ... }

Fill this with debug statements etc to your heart's content. When it works, remove [1:2] and change it to dopar instead.