Apply wildcard when reading multiple sheets from an .xls file using rio::import_list

430 Views Asked by At

I have been trying to write a code that will read selected sheets in an .xls file using import_list (rio package), then rbinding and converting them to a .csv file. So far, I have been able to make the code read either all the sheets in the xls file at the same time, sequential sheets (eg 1:4) but not selected sheets based on their names. Please help. I have attached the current iteration of my code and a part of my data.

library(rio)
d <- import_list("allTP.xls", which = (("201901,Xizhi,PM10")), setclass="data.table",range = "A2:Y33", rbind = T, col_names=T, rbind_label = "_file", rbind_fill = TRUE) 

Sample data

1

There are 1 best solutions below

7
On

The which argument has to be a vector of sheet names. That is, it needs to be in this format, using the c function:

which = c("201901,Xizhi,PM10", "201902,Xizhi,PM10", "201902,Wanli,PM10")

or whatever sheets you want to import.

The which argument doesn't appear to support regular expression, so importing sheets using wildcards is not possible. But you can import all sheets and then use the _file column to select the data to individually export (if you know which sheet number is which since the sheet names are not saved, only the number).

In your data, you have all the information in the first row, but you don't import this because you specify the range as "A2:Y33". If you select "A1:Y33" then you'll get "characters" for all columns, and you don't want that since the column names start in row 2.

One solution is to import all sheets, range="A2:Y33", as you did, and then repeat for range "A1:Y2", saving the result as d2. This range contains the information you need to select. You can then merge d with d2 on _file and export individual csv files based on the information in d2.

library(rio)

d1 <- import_list("sample_data.xlsx", 
                  setclass="data.table", 
                  range = "A2:Y33", rbind = T, 
                  col_names=T, 
                  rbind_label = "_file", 
                  rbind_fill = TRUE) 

For d2 we import only range "A1:Y1", then subset columns 1, 7, 13 and 26, and set the names appropriately.

d2 <- setNames(
  subset(
    import_list("sample_data.xlsx", 
                setclass="data.table", 
                range = "A1:Y1", rbind = T, 
                col_names=F, 
                rbind_label = "_file"), 
    select=c(1,7,13,26)), 
  c("Site","Param","YM","_file"))

Then merge the two datasets on the _file column.

d3 <- merge(d1, d2, by="_file")

Check the data.

table(d3$Site, d3$Param, d3$YM)
, ,  = YM: 2019/01                     
                     Param: AMB_TEMP(℃) Param: PM10(μg/m3)
  Site: Wanli                        31                 31
  Site: Xizhi                        31                 31

, ,  = YM: 2019/02                     
                     Param: AMB_TEMP(℃) Param: PM10(μg/m3)
  Site: Wanli                        31                 31
  Site: Xizhi                        31                 31

So there's eight sheets each containing 31 records. But we have to clean these names because if we use these for the filenames, R will complain.

d3$Site <- sub(".*: (.+)", "\\1", d3$Site)
d3$Param <- sub(".*: (.+)\\(.+", "\\1", d3$Param)
d3$YM <- sub(".*: (\\d{4})\\/(.+)", "\\1_\\2", d3$YM)

Now split on the interaction of all three contextual variables

f <- with(d3, split(d3, list(Site, Param, YM)))

And save to individual CSV files.

lapply(names(f), function(x) write.csv(f[[x]], file=paste(x, ".csv", sep="")))

dir()
[1] "Wanli.AMB_TEMP.2019_01.csv" "Wanli.AMB_TEMP.2019_02.csv" "Wanli.PM10.2019_01.csv"    
[4] "Wanli.PM10.2019_02.csv"     "Xizhi.AMB_TEMP.2019_01.csv" "Xizhi.AMB_TEMP.2019_02.csv"
[7] "Xizhi.PM10.2019_01.csv"     "Xizhi.PM10.2019_02.csv"

You can then easily choose to export only the data you want using a subset of f or d3.