I am trying to read a text file (https://www.bls.gov/bdm/us_age_naics_00_table5.txt) into R, but I am not sure how to go about parsing it. As you can see, the column names (years) are not located all on the same row, and the space between data is not consistent from column to column. I am familiar with using read.csv() and read.delim(), but I'm not sure how to go about reading a complex file like this one.

1

There are 1 best solutions below

2
On

Here is a manual parse:

require(readr)
string = read_lines(file="https://www.bls.gov/bdm/us_age_naics_00_table5.txt")
string = string[nchar(string) != 0]
string = string[-c(1,2)]  # don't contain information
string = string[string != " "]
string = string[-151]     # footnote
sMatrix = matrix(string, nrow = 30)
dfList = sapply(1:ncol(sMatrix), function(x) readr::read_table(paste(sMatrix[,x])))
df = do.call(cbind,dfList)
df = df[,!duplicated(colnames(df))] # removes columns with duplicate names

If you then want to recode "_" as NA, and format the numbers:

df[df == "_"] = NA
df = as.data.frame(sapply(df, function(x) gsub(",","",x)))
i <- apply(df, 2, function(x) !any(is.na(as.numeric(na.omit(x))))) # if a column can be converted to numeric without any NAs, e.g. column 1 can't
df[,i] = lapply(df[,i], as.numeric)