R sqldf SELECT when col may be missing

107 Views Asked by At

I have a number of .csv files. Certain columns appear in all files, but some others only appear in some files. There are also many other columns I don't want.

I'm using R sqldf to iterate through the .csv files, select the desired columns, and write it to a new .csv file. However, I don't know how to tell sqldf that some columns may not be found in some of the .csv files, in which case I want it to create it anyway (even though it will be blank).

In the example below, the column userlocation1 only appears in some of the .csv files. So the code doesn't work if a file doesn't have it.

I've tried try-catch but that just caused that some files weren't processed at all.

for (fileNumber in fileNumbers){
  newfileName <- paste(sub(paste("\\.", sep = ""), "", fileNames[fileNumber]),
                       "-clean", ".csv", sep = "")
  tx <- paste(readLines(fileNames[fileNumber], encoding="UTF-8"), collapse="%%%")  
  tx_as_table <- fread(text = tx, header = TRUE, sep = ",", nrows = 501, 
                 stringsAsFactors = FALSE, fill = TRUE, select = 1:50)  
  dfmediaurl <- sqldf('select created_at, text, id_str, in_reply_to_status_id, 
                username1, userscreen_name1, userlocation1 from tx_as_table')
  write.table(x = dfmediaurl, file = newfileName, append = FALSE, quote = TRUE, sep 
              = ";", row.names = FALSE, col.names = TRUE)
  
}


1

There are 1 best solutions below

0
On

To refer to all columns use * as shown in this example with the builtin anscombe data frame:

sqldf("select * from anscombe limit 3")
##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71

or to show only certain columns such that the table may not have some of them then intersect the column names with those actually in the table creating a character vector of column names to use, convert that to a comma separated string using toString and insert that into the SQL statement. Note that fn$ must prefix sqldf to enable string interpolation. verbose=TRUE will also display the SQL statement that is run (i.e. it will also show the output lines that begin with sqldf: below -- note that the second last sqldf: line in the output displays the generated SQL statement):

# use those columns among x3, x4 and x5 that are in anscombe
cols <- toString(intersect(c("x3", "x4", "x5"), names(anscombe)))

fn$sqldf("select $cols from anscombe limit 3", verbose = TRUE)
## sqldf: library(RSQLite)
## sqldf: m <- dbDriver("SQLite")
## sqldf: connection <- dbConnect(m, dbname = ":memory:")
## sqldf: initExtension(connection)
## sqldf: dbWriteTable(connection, 'anscombe', anscombe, row.names = FALSE)
## sqldf: dbGetQuery(connection, 'select x3, x4 from anscombe limit 3')
## sqldf: dbDisconnect(connection)
##   x3 x4
## 1 10  8
## 2  8  8
## 3 13  8