Using open_dataset with fields that might be numbers but are not fails:
Reprex:
td <- tempfile()
dir.create(td)
dat <- data.frame(fieldA=rep(as.character(1:3), each=3), fieldB=1:9)
for (d in unique(dat$fieldA)) {
dir.create(file.path(td, paste0("fieldA=", d)), showWarning = FALSE)
arrow::write_parquet(subset(dat, fieldA==d), file.path(td, paste0("fieldA=", d), "1.pq"))
}
list.files(td, recursive=TRUE)
# [1] "fieldA=1/1.pq" "fieldA=2/1.pq" "fieldA=3/1.pq"
If I had instead used arrow::write_dataset, the "fieldA" would be omitted, which may be the only way I have around this ... but for the sake of the reprex, assume that the partitioning variable is still present in the data.
arrow::open_dataset(file.path(td, "fieldA=1")) |>
dplyr::collect()
# # A tibble: 3 × 2
# fieldA fieldB
# <chr> <int>
# 1 1 1
# 2 1 2
# 3 1 3
arrow::open_dataset(td) |>
dplyr::collect()
# Error in arrow::open_dataset(td) :
# Type error: Unable to merge: Field fieldA has incompatible types: string vs int32
The class of fieldA is character going into the write, and when one file is read it retains that class. If we write the "normal way" where fieldA is omitted in the target parquet file, then it is read in as integer.
Because the real data has a number of columns, I would prefer to use the schema already defined in the data, and I don't want to have to define the schema for all columns. If I try to override just one with schema=arrow::schema(fieldA=arrow::string()) then that's the only column retrieved.
There are times where it is necessary to read in individual files instead of relying on arrow's nice lazy filtering on a connection, so I'd like to be able to read an individual file without losing the partition directory variable (which would happen if I used write_dataset).
Am I stuck with removing the partitioning variable from the data? Is there another way to specify the schema for a partition field without having to specify fields' schema?