Getting weather data for multiple stations conditional for specific dates in R

253 Views Asked by At

I have the following problem:

For an analysis of weather effects on volunteers observing nature (animals, plants etc.) for a citizen science web page, I need to match the daily observations with the weather information of the nearest weather station. I'm using rdwd (for data of German weather service) and already managed to combine each observation location with the nearest weather station. So I now have a data frame (my_df_example) like this with 100 rows:

     ID      Date         lat     long      Station_id                   Stationname
   1317186439 2019-05-03 47.77411 9.540569        4094     Weingarten, Kr. Ravensburg
   -2117439060 2019-05-19 48.87217 9.396229       10510             Winterbach/Remstal
   -630183789 2019-04-30 48.86810 9.285427        4928      Stuttgart (Schnarrenberg)
   -390672435 2019-05-10 50.71187 8.706279        1639             Giessen/Wettenberg
   262182713 2019-05-01 50.82548 8.892961        3164 Coelbe, Kr. Marburg-Biedenkopf
   -373270631 2019-05-24 51.61666 7.950153        5480                           Werl

with dput(my_df_example):

   structure(list(ID = c(1317186439L, -2117439060L, -630183789L, -390672435L, 262182713L, -373270631L,...
   Datum = structure(c(1556841600, 1558224000, 1556582400, 1557446400, 1556668800, 1558656000, 1558224000, 1557532800,..., class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
   lat = c(47.7741093721703, 48.8721672952686, 48.8681024146134, 50.7118683229165, 50.8254843786222, 51.6166575725419, 48.7357007677785,...
   long = c(9.54056899481679, 9.3962287902832, 9.28542673587799, 8.70627880096436, 8.89296054840088, 7.95015335083008, 11.3105964660645,... 
   Stations_id = c(4094L, 10510L, 4928L, 1639L, 3164L, 5480L, 3484L,... 
   Stationsname = c("Weingarten, Kr. Ravensburg", "Winterbach/Remstal", "Stuttgart (Schnarrenberg)", "Giessen/Wettenberg", "Coelbe, Kr. Marburg-Biedenkopf", "Werl",... 
   row.names = c("58501", "89910", "69539", "24379", "45331", "77191", "50028", 
   class = "data.frame")

What I need to do now is get the weather information for each station on that specific date. I'm trying to use the rdwd package in R to do so. I tried two options so far, that both didn't work out.

Option 1:

    urls <- selectDWD(name=my_df_final$Stationsname, res="daily", var="kl", per="historical", outvec=TRUE)
    kl <- dataDWD(urls[1:100])

That gives me a list of 100 lists. Each list of the 100 includes the weather data for every recorded day of a certain station. So I would need to filter the data from those lists so that the date matches the dates in my_df_example. I don't know how to extract info from a list inside a list though.

Option 2:

   stat <- my_df_example$Stationname
   link <- selectDWD(c(stat), res="daily", var="kl", per="hist") 
  file <- dataDWD(link, read=FALSE)
  clim <- readDWD(file, varnames=TRUE)

The problem here is, that dataDWD doesn't work for lists. And since "link" includes multiple Station names it is not just a vector.

I don't really know if one of these options is the right way at all or if an alternative would make more sense.

Thank you for any advice you can give.

2

There are 2 best solutions below

2
On

According to your problem:

What I need to do now is get the weather information for each station on that specific date.

Then, once you have your list of lists (kl) then you can subset from this "meta"-list the information that you are looking for this way:

query <- lapply(kl, function(x) {
  x[which((as.Date(x$MESS_DATUM) %in% as.Date(my_final_df$Date)) &
           (x$STATIONS_ID %in% my_final_df$Station_id)), ]
})

x represents the object kl passed to the function definition. The %in% operator, as its letters indicate, will look for the elements in common between $MESS_DATUM and $Date variables and (&) also for the matches between STATIONS_ID and Station_id . which() ensures that no logical surprises occur while subsetting the data and as.Date() returns a common date format for both data frames.

After performing the extraction, you have to collapse the information into a single data frame. Since all the columns in all the lists inside the meta-list are the same, you can use do.call() + rbind() directly. Like:

query <- do.call(rbind,query)

To avoid messy rownames, call:

rownames(query) <- NULL

Then, to see the station names in the query data set, merge the query with my_final_df:

colnames(query)[1] <- "Station_id" # the key needs to have the samen name in both data frames
query <- merge(query,my_final_df, by = "Station_id", all = TRUE)

The final result looks like this:

   Station_id MESS_DATUM QN_3   FX  FM QN_4 RSK RSKF    SDK SHK_TAG  NM  VPM     PM  TMK   UPM  TXK TNK  TGK eor          ID       Date
2        1639 2019-05-01   10  7.1 2.0    3 0.0    0 11.383      NA 0.3  9.0 991.15 12.6 65.67 20.6 3.3 -0.4 eor  -390672435 2019-05-10
7        3164 2019-04-30   NA   NA  NA    3 0.0    0     NA       0  NA  8.9     NA 12.3 64.92 18.7 5.4  3.4 eor   262182713 2019-05-01
16       4094 2019-05-10   10 10.3 3.4    3 5.7    4  5.933      NA  NA 10.4     NA 11.9 76.04 16.8 8.5  6.8 eor  1317186439 2019-05-03
21       4928 2019-05-03   10 10.0 3.2    3 0.4    6  3.183      NA 7.5  9.0 973.66 10.4 72.38 14.2 7.8  7.3 eor  -630183789 2019-04-30
29       5480 2019-05-19   10 11.0 1.8    3 1.0    6  5.000      NA 7.2 13.0 995.10 14.0 82.38 21.8 6.8  5.2 eor  -373270631 2019-05-24
36      10510 2019-05-24   10  5.9 1.4   NA  NA   NA     NA      NA  NA   NA     NA   NA    NA   NA  NA   NA eor -2117439060 2019-05-19
        lat     long                    Stationname
2  50.71187 8.706279             Giessen/Wettenberg
7  50.82548 8.892961 Coelbe, Kr. Marburg-Biedenkopf
16 47.77411 9.540569     Weingarten, Kr. Ravensburg
21 48.86810 9.285427      Stuttgart (Schnarrenberg)
29 51.61666 7.950153                           Werl
36 48.87217 9.396229             Winterbach/Remstal

This data set matches the dates and station's ids and names you first provided in the my_df_example.

Provided more time, maybe someone will tell us how to solve this with tidyverse notation, because I suspect it is even more straightforward to do the subsetting-extraction algorithm with this package.

0
On

I would suggest a data.table solution:

library(data.table)

full = rbindlist(kl) # Convert list to one huge DF
setDT(my_df_final) # Convert your df to DT

new_df <- merge(my_df_final, full, by.x = c("ID", "Datum"), by.y = c("STATIONS_ID", "MESS_DATUM"), all.x = T) # Merge full and your df

new_df
      ID      Datum      lat     long Stations_id                   Stationsname QN_3   FX  FM QN_4 RSK RSKF   SDK
1:  1639 2019-05-10 50.71187 8.706279        1639             Giessen/Wettenberg   10  9.1 3.3    3 9.3    6 4.000
2:  3164 2019-05-01 50.82548 8.892961        3164 Coelbe, Kr. Marburg-Biedenkopf   NA   NA  NA    3 0.0    0    NA
3:  4094 2019-05-03 47.77411 9.540569        4094     Weingarten, Kr. Ravensburg   10  6.4 2.2    3 5.2    4 0.000
4:  4928 2019-04-30 48.86810 9.285427        4928      Stuttgart (Schnarrenberg)   10  7.9 2.7    3 0.0    6 3.583
5: 10510 2019-05-19 48.87217 9.396229       10510             Winterbach/Remstal   10 11.3 1.8   NA  NA   NA    NA
   SHK_TAG  NM  VPM     PM  TMK   UPM  TXK TNK TGK eor
1:      NA 6.6 10.2 985.16 11.1 78.21 15.9 7.7 5.9 eor
2:      NA  NA  9.7     NA 12.3 71.00 20.0 3.2 1.4 eor
3:      NA  NA 10.0     NA  8.7 88.92 11.6 5.3 3.0 eor
4:       0 4.9  9.3 981.55 10.5 75.58 15.3 7.3 3.7 eor
5:      NA  NA   NA     NA   NA    NA   NA  NA  NA eor

(should also work in base R, but is certainly faster this way)