I have a number of .csv files. Certain columns appear in all files, but some others only appear in some files. There are also many other columns I don't want.
I'm using R sqldf to iterate through the .csv files, select the desired columns, and write it to a new .csv file. However, I don't know how to tell sqldf that some columns may not be found in some of the .csv files, in which case I want it to create it anyway (even though it will be blank).
In the example below, the column userlocation1
only appears in some of the .csv files. So the code doesn't work if a file doesn't have it.
I've tried try-catch but that just caused that some files weren't processed at all.
for (fileNumber in fileNumbers){
newfileName <- paste(sub(paste("\\.", sep = ""), "", fileNames[fileNumber]),
"-clean", ".csv", sep = "")
tx <- paste(readLines(fileNames[fileNumber], encoding="UTF-8"), collapse="%%%")
tx_as_table <- fread(text = tx, header = TRUE, sep = ",", nrows = 501,
stringsAsFactors = FALSE, fill = TRUE, select = 1:50)
dfmediaurl <- sqldf('select created_at, text, id_str, in_reply_to_status_id,
username1, userscreen_name1, userlocation1 from tx_as_table')
write.table(x = dfmediaurl, file = newfileName, append = FALSE, quote = TRUE, sep
= ";", row.names = FALSE, col.names = TRUE)
}
To refer to all columns use
*
as shown in this example with the builtinanscombe
data frame:or to show only certain columns such that the table may not have some of them then
intersect
the column names with those actually in the table creating a character vector of column names to use, convert that to a comma separated string usingtoString
and insert that into the SQL statement. Note thatfn$
must prefixsqldf
to enable string interpolation.verbose=TRUE
will also display the SQL statement that is run (i.e. it will also show the output lines that begin withsqldf:
below -- note that the second lastsqldf:
line in the output displays the generated SQL statement):