How do I include empty columns when reading an xlsx spreadsheet into R?

4.3k Views Asked by At

I'm attempting to write an R script that will loop through about 100 small Excel spreadsheets and stack the 15-20 rows that have full cells (the exact number of rows varies in each small file) into one.

This code that I'm using (contained within the loop) reads the files in, but the columns that do not contain any entries are skipped.

newDf <- read.xlsx(filename,sheetName="Coding",header=FALSE,
    rowIndex=rows,colIndex=columns)

My current workaround is to create a dummy first dataframe with values in every column; when I then rbind each newDf to it, the columns are forced to line up. But I'm hoping there's a better way.

Any help would be appreciated. Thank you.

3

There are 3 best solutions below

0
On BEST ANSWER

If you're willing to use the XLConnect package rather than xlsx, the function XLConnect::readWorksheet allows you to specify the startCol, endCol, and autofitCol arguments, which would solve this issue for you:

library(XLConnect)
##
# wb1 <- loadWorkbook("~/tmp/tmp1.xlsx", create = FALSE)
# wb2 <- loadWorkbook("~/tmp/tmp2.xlsx", create = FALSE)
df1 <- readWorksheet(
  object = wb1,
  sheet = "sheet1",
  startCol = 1,
  endCol = 3,
  autofitCol = FALSE)
##
df2 <- readWorksheet(
  object = wb2,
  sheet = "sheet1",
  startCol = 1,
  endCol = 3,
  autofitCol = FALSE)
##
R> head(df1,3)
  A B Col3
1 1 6   NA
2 2 7   NA
3 3 8   NA
R> head(df2,3)
  A B  C
1 1 6 11
2 2 7 12
3 3 8 13

where df1 was read from a worksheet containing only two non-empty columns (see data below).


Data:

wb1 <- loadWorkbook("~/tmp/tmp1.xlsx", create = TRUE)
createSheet(wb1, "sheet1")
writeWorksheet(
  object = wb1,
  data = data.frame(
    A = 1:5,
    B = 6:10),
  sheet = "sheet1")
saveWorkbook(wb1)
##
wb2 <- loadWorkbook("~/tmp/tmp2.xlsx", create = TRUE)
createSheet(wb2, "sheet1")
writeWorksheet(
  object = wb2,
  data = data.frame(
    A = 1:5,
    B = 6:10,
    C = 11:15),
  sheet = "sheet1")
saveWorkbook(wb2)
0
On

I'm not sure why read.xlsx has this behaviour...

There are a couple of potential solutions. The first is to use the readColumns function in the XLSX package, instead of read.xlsx:

wb     <- loadWorkbook("c:/path/to/your/xlsx/file")
sheets <- getSheets(wb)
sheet <- sheets[["NAMEOFSHEET"]]
newDf <- readColumns(sheet, startColumn=1, endColumn=3, startRow=1, endRow=4)

Alternatively, you can continue reading the files as you are, using read.xlsx, and just add blank columns in afterwards:

if("COLUMN_THAT_SHOULD_BE_THERE" %in% colnames(newDf) == FALSE){
    newDf$COLUMN_THAT_SHOULD_BE_THERE <- NA
}
1
On

Maybe try using read.xlsx2?

I saved an Excel file with this format and named it test.xlsx:

    id blank   names
 1:  1          John
 2:  2         Jacob
 3:  3        Jingel
 4:  4        Heimer
 5:  5       Schmidt
 6:  6           Joe
 7:  7        Public
 8:  8          Jane
 9:  9           Doe
10: 10        Ramsey

And another where I simply deleted the column header, which i named test2.xlsx:

    id    names
 1:  1     John
 2:  2    Jacob
 3:  3   Jingel
 4:  4   Heimer
 5:  5  Schmidt
 6:  6      Joe
 7:  7   Public
 8:  8     Jane
 9:  9      Doe
10: 10   Ramsey

When I read them in with read.xlsx, blank is skipped only when it's not named:

> read.xlsx("test.xlsx",sheetIndex=1)
   id blank   names
1   1    NA    John
2   2    NA   Jacob
3   3    NA  Jingel
4   4    NA  Heimer
5   5    NA Schmidt
6   6    NA     Joe
7   7    NA  Public
8   8    NA    Jane
9   9    NA     Doe
10 10    NA  Ramsey

> read.xlsx("test2.xlsx",sheetIndex=1)
   id   names
1   1    John
2   2   Jacob
3   3  Jingel
4   4  Heimer
5   5 Schmidt
6   6     Joe
7   7  Public
8   8    Jane
9   9     Doe
10 10  Ramsey

However, if I use read.xlsx2 on test2.xlsx:

> read.xlsx2("test.xlsx",sheetIndex=1)
   id X.   names
1   1       John
2   2      Jacob
3   3     Jingel
4   4     Heimer
5   5    Schmidt
6   6        Joe
7   7     Public
8   8       Jane
9   9        Doe
10 10     Ramsey

Now it's read in as a factor column named X..

If I try and extend this to have a bunch of empty columns, here's what I get:

> read.xlsx2("test3.xlsx",sheetIndex=1)
   id X.   names X..1  names2 X..2  names3 X..3  names4
1   1       John         John         John         John
2   2      Jacob        Jacob        Jacob        Jacob
3   3     Jingel       Jingel       Jingel       Jingel
4   4     Heimer       Heimer       Heimer       Heimer
5   5    Schmidt      Schmidt      Schmidt      Schmidt
6   6        Joe          Joe          Joe          Joe
7   7     Public       Public       Public       Public
8   8       Jane         Jane         Jane         Jane
9   9        Doe          Doe          Doe          Doe
10 10     Ramsey       Ramsey       Ramsey       Ramsey

So the name creation by read.xlsx2 is predictable, as well. Also (I can't pin down where I heard this so I can't source it), I think it's good practice to use read.xlsx2 in general anyway--faster, etc.

I can also add this from ?read.xlsx:

The read.xlsx2 function does more work in Java so it achieves better performance (an order of magnitude faster on sheets with 100,000 cells or more). The result of read.xlsx2 will in general be different from read.xlsx, because internally read.xlsx2 uses readColumns which is tailored for tabular data.

So @user1578653's suggestion to use the underlying readColumns is covered by just using read.xlsx.