I have many txt
files that contain the same type of numerical data in columns separated by ;. But some files have column headers with spaces and some don't (created by different people). Some have extra columns which that I don't want.
e.g. one file might have a header like:
ASomeName; BSomeName; C(someName%)
whereas another file header might be
A Some Name; B Some Name; C(someName%); D some name
How can I clean the spaces out of the names before I call a "read" command?
#These are the files I have
filenames<-list.files(pattern = "*.txt",recursive = TRUE,full.names = TRUE)%>%as_tibble()
#These are the columns I would like:
colSelect=c("Date","Time","Timestamp" ,"PM2_5(ug/m3)","PM10(ug/m3)","PM01(ug/m3)","Temperature(C)", "Humidity(%RH)", "CO2(ppm)")
#This is how I read them if they have the same columns
ldf <- vroom::vroom(filenames, col_select = colSelect,delim=";",id = "sensor" )%>%janitor::clean_names()
Clean Headers script
I've written a destructive script that will read in the entirety of the file, clean the header of spaces, delete the file and re-write (vroom complained sometimes of not being able to open X thousands of files) the file using the same name. Not an efficiency way of doing things.
cleanHeaders<-function(filename){
d<-vroom::vroom(filename,delim=";")%>%janitor::clean_names()
#print(head(d))
if (file.exists(filename)) {
#Delete file if it exists
file.remove(filename)
}
vroom::vroom_write(d,filename,delim = ";")
}
lapply(filenames,cleanHeaders)
fread's
select
parameter admits integer indexes. If the desired columns are always in the same position, your job is done.I imagine vroom also has this capability, but since you are already selecting your desired columns, I don't think lazily evaluating your character columns would be helpful at all, so I advice you stick to data.table.
For a more robust solution though, since you have no control over the structure of the tables: you can read one row of each file, capture and clean the column names, and then match them against a clean version of your
colSelect
vector.(Here purrr can be easily replaced with traditional lapply's, I opted for purrr because of its cleaner formula notation)