Downloading NOAA data

576 Views Asked by At

I'm trying to download NOAA data using the rnoaa package and I'm running into a bit of trouble.

I took a vector from a dataframe and it looks like this:

df <- dataframe$ghcnd

Grabbing necessary column

This gives me an output like:

[1] "GHCND:US1AKAB0058" "GHCND:US1AKAB0015" "GHCND:US1AKAB0021" "GHCND:US1AKAB0061"
 [5] "GHCND:US1AKAB0055" "GHCND:US1AKAB0038" "GHCND:US1AKAB0051" "GHCND:US1AKAB0052"
 [9] "GHCND:US1AKAB0060" "GHCND:US1AKAB0065" "GHCND:US1AKAB0062" "GHCND:US1AKFN0016"
[13] "GHCND:US1AKFN0018" "GHCND:US1AKFN0015" "GHCND:US1AKFN0011" "GHCND:US1AKFN0013"
[17] "GHCND:US1AKFN0030" "GHCND:US1AKJB0011" "GHCND:US1AKJB0014" "GHCND:US1AKKP0005"
[21] "GHCND:US1AKMS0011" "GHCND:US1AKMS0019" "GHCND:US1AKMS0012" "GHCND:US1AKMS0020"
[25] "GHCND:US1AKMS0018" "GHCND:US1AKMS0014" "GHCND:US1AKPW0001" "GHCND:US1AKSH0002"
[29] "GHCND:US1AKVC0006" "GHCND:US1AKWH0012" "GHCND:US1AKWP0001" "GHCND:US1AKWP0002"
[33] "GHCND:US1ALAT0014" "GHCND:US1ALAT0013" "GHCND:US1ALBW0095" "GHCND:US1ALBW0087"
[37] "GHCND:US1ALBW0020" "GHCND:US1ALBW0066" "GHCND:US1ALBW0031" "GHCND:US1ALBW0082"
[41] "GHCND:US1ALBW0099" "GHCND:US1ALBW0040" "GHCND:US1ALBW0004" "GHCND:US1ALBW0085"
[45] "GHCND:US1ALBW0009" "GHCND:US1ALBW0001" "GHCND:US1ALBW0094" "GHCND:US1ALBW0013"
[49] "GHCND:US1ALBW0079" "GHCND:US1ALBW0060"

In reality, I have about 22,000 weather stations. This is just showing the first 50.

rnoaa code

library(rnoaa)
options("noaakey" = Sys.getenv("noaakey"))
Sys.getenv("noaakey")

weather <- ncdc(datasetid = 'GHCND', stationid = df, var = 'PRCP', startdate = "2020-05-30",
                enddate = "2020-05-30", add_units = TRUE)

Which produces the following error: Error: Request-URI Too Long (HTTP 414)

However, when I subset the df into just, say, the first 100 entries, I can't get data for more than the first 25. However, the package details say I should be able to run 10,000 queries a day.

Loop Attempt

df1 <- df[1:125] ## Splitting dataframe. Too big otherwise

for (i in 1:length(df1)){
  weather2<-ncdc(datasetid = 'GHCND', stationid=df1[i],var='PRCP',startdate ='2020-06-30',enddate='2020-06-30',
          add_units = TRUE)
  
}

But this just producing a dataframe of a single row, that row being the 125th weather station.

If anyone could give advise on what to try next that would be great :)

Also, cross linked: https://discuss.ropensci.org/t/rnoaa-getting-county-level-rain-data/2403

2

There are 2 best solutions below

1
On BEST ANSWER

Figured it out, with a lot of help from @Dave2e and a bud on the ropensci link above.

df <- cleaned_emshr$ghcnd  ## Grabbing necessary column

z <- split(df, ceiling(seq_along(df)/100))
out <- list()
for (i in seq_along(z)) {
  out[[i]] <- ncdc(datasetid = 'GHCND', stationid = z[[i]], var = 'PRCP', 
                   startdate = "2020-05-30", enddate = "2020-05-30", 
                   add_units = TRUE, limit = 100)
}

weather <- bind_rows(lapply(out, "[[", "data"))
2
On

In your loop attempt, weather2 is overwritten on each iteration of the loop.

Since the number of requests and the length of the return is unknown, one way to solve this problem is to wrap the call to ncdc inside a lapply statement and save each response in a list. Then at the end of the lapply statement merge all the data into one large dataframe.

library(rnoaa)
library(dplyr)

stationlist <-ghcnd_stations() %>% filter(state == "DE")
df <- paste0("GHCND:", stationlist$id[1:10]) 

#call request data multiple time and store individual results in a list 
 output<-lapply(df, function(station){
    weather <- ncdc(datasetid = 'GHCND', stationid = station, var = 'PRCP', startdate = "2020-05-30",
                    enddate = "2020-05-30", add_units = TRUE)
    #weather$data
    #to include the meta data
    data.frame(t(unlist(weather$meta)), weather$data)
 })
 
 #merge into 1 data frame
 answer <-bind_rows(output)

I would verify this process on a small subset of stations as the call to NOAA can be slow. I attempt to reduce the down the number of stations searched to the area of interest and to the ones still actively collecting data.

Also concerning the limit request.
From the help page: "Note that the default limit (no. records returned) is 25. Look at the metadata in $meta to see how many records were found. If more were found than 25, you could set the parameter limit to something higher than 25."