I am working with a NetCDF file in R using the ncdf4 library. I want to subset the file based on specific conditions. The file contains oceanographic data (temperature) with dimensions for longitude, latitude, depth, and time. My goal is to subset the data by longitude and depth, considering west (>= 37), south (< 37), shelf (<= 266.04314), and slope (above 266.04314). Additionally, I want to calculate the mean value of temperature based on each of these factors and based on time as well. The file is from Copernicus (https://data.marine.copernicus.eu/product/GLOBAL_MULTIYEAR_PHY_001_030/description)
I'm facing the challenge of handling a large dataset, and I'm wondering if there's a more efficient way to subset the NetCDF file based on my conditions and calculate the mean temperature for each region (west shelf, south shelf, west slope, south slope) without having to load the entire dataset into a DataFrame. Any insights on optimizing this process would be greatly appreciated.
Thank you!
Here is my code:
library(lubridate)
library(tidyverse)
library(ncdf4)
# OPEN the NCD file
ncfname <- paste("path/Temp_data_1993-01-02.nc") # path of the file and file name
ncin <- nc_open(ncfname)
# Names of the variables
names(ncin$var) # thetao
# Names of the dimensions
names(ncin$dim) # depth # latitude #longitude #time
# Get coordinates, time and depth variables
longitude <- ncin$dim[[3]]$vals
latitude <- ncin$dim[[2]]$vals
time <- ncin$dim[[4]]$vals
depth <- ncin$dim[[1]]$vals
# Get the time variable
time
nt <- dim(time)
nt # 2 days
# Get the time units
t_units <- ncatt_get(ncin, "time", "units")
t_units # its in hours since 1950-01-01
# convert time -- split the time units string into fields
t_ustr <- strsplit(t_units$value, " ")
t_dstr <- strsplit(unlist(t_ustr)[3], "-")
date <- ymd(t_dstr) + dhours(time)
date
# Get the oceanographic variable
#sea_water_potential_temperature
T_array <- ncvar_get(ncin,ncin$var[[1]])
T_vector <- as.vector(T_array)
# ----- How I would subset it in a dataframe ----#
# Get all the variables into a dataframe
longitude <- as.matrix(longitude)
latitude <- as.matrix(latitude)
depth <- as.matrix(depth)
coords <- expand.grid(latitude, longitude, depth, date)
coords <- data.frame(coords)
data_fram <- cbind(coords, T_vector)
# Create bathymetric factor (slope and shelf)
data_fram$bath <- factor(ifelse(data_fram$depth <=266.04314, "shelf", "slope"))
# Create coast line factor (west and south)
data_fram$coast <- factor(ifelse(data_fram$longitude <=37, "south", "west"))
# Create the area factor (bathymetric + coast)
data_fram$area <- paste(data_fram$bath,"_",data_fram$coast)
# Extract the mean
data_mean_area <- aggregate(Temp ~ area * date, FUN = mean, data = data_fram)