I'm using nc_open to get a DatasetNode from a THREDDS Data Server, and reading a subset of the data in ncvar_get by specifying start and count. Reproducible example below:

library(thredds)
library(ncdf4)

Top <- CatalogNode$new("https://oceanwatch.pifsc.noaa.gov/thredds/catalog.xml") 
DD <- Top$get_datasets() 
dnames <- names(DD)
dname <- dnames[4] # "Chlorophyll a Concentration, Aqua MODIS - Monthly, 2002-present. v.2018.0"   
D <- DD[[dname]]

dl_url <- file.path("https://oceanwatch.pifsc.noaa.gov/thredds/dodsC", D$url)
dataset <- nc_open(dl_url)

dataset_lon <- ncvar_get(dataset, "lon") # Get longitude values
dataset_lat <- ncvar_get(dataset, "lat")  # Get latitude values
dataset_time <- ncvar_get(dataset, "time") # get time values in tidy format

# specify lon/lat boundaries for data subset:
lonmin = 160
lonmax = 161
latmin = -1
latmax = 0

LonIdx <- which(dataset_lon >= lonmin & dataset_lon <= lonmax)
LatIdx <- which(dataset_lat >= latmin & dataset_lat <= latmax)

# read the data for first 10 timesteps:
dataset_array <- ncvar_get(dataset, 
  start=c(findInterval(lonmin, dhw_lon), findInterval(latmax, sort(dhw_lat)), 1), 
  count=c(length(LonIdx), length(LatIdx), 10), varid="chlor_a", verbose=TRUE)


Is there a way to calculate the approximate file size for the ncvarget before reading the data?

1

There are 1 best solutions below

0
On

Many thanks to both @michael-delgado and @robert-wilson for the above. I've edited the original post to include a reproducible example and answered my own question in case it helps anyone else later down the line.

If I understand correctly all current implementations of R use float32. Using the example Aqua MODIS Chlorophyll dataset in the post above:

An upper bound on the file size (assuming no NA) before to downloading the data with ncvar_get would be 23,040 bytes:

(length(LonIdx) * length(LatIdx) * 10) * 4 # based on 10 time steps 

which is confirmed with the dimensions of data after downloading:

(dim(dataset_array)[1] * dim(dataset_array)[2] * dim(dataset_array)[3]) * 4

Writing the output array to disk produces a 20,444 byte file:

dataset_output <-  as.data.frame.table(dataset_array)
saveRDS(dataset_output, "dataset_output.rds")

which is close to the calculated upper limits (23,040 bytes). For me this approach is useful in obtaining an upper limit and approximate size before downloading the data using ncvar_get, many thanks to both of you.

(Out of interest, excluding NA in the above example leaves 4559 out of 5760 cells: (sum(!is.na(dataset_array)) * 4) which gives 18,236 bytes, smaller than the actual file size (20,444 bytes).