Issues Reading the table in R

Question

Issues Reading the table in R

215 Views Asked by Betty At 15 November 2014 at 00:56

I'm trying to do read.table in R. My data (txt file) is like the following:

a b c d e
Australia 1 2 4 3 2
United States 1 2 4 2 2

The problems with reading this table are that:

1) Line 1 only has 5 elements (a~e), as opposed to 6 elements in all rows below that. It's supposed to have the column name like "Country". Then, a corresponds to the first number 1, b corresponds to 2,..and e corresponds to 2 (in the case of Australia.) How do I add a column name to the first column so that R won't show an error that says "line 1 did not have 6 elements"?

2) In United States case, United States are two words instead of one, so when R reads the data, it puts "States" into the second column instead of reading "United States" as one element name.

(i've been advised by my friend to use rownames. Does anyone know how to go about using rownames??)

How can I fix these issues and correctly read my data?

Thank you very much!!

Original Q&A

There are 2 best solutions below

**akrun** · Answer 1 · 2014-11-15T05:47:11.077000

Assuming that the example data mimics the content in the file, we could read it using readLines and then use regex to separate the country names from the rest. The separated country names can be added as a new column.

lines <- readLines('Betty2.txt')
lines
#[1] "a b c d e"               "Australia 1 2 4 3 2"    
#[3] "United States 1 2 4 2 2"

dat <-  read.table(text=c(lines[1], gsub('[A-Za-z]+\\s+', '',
                lines[-1])), header=TRUE)

In the above code, we are replacing the character elements followed by space. ie. the country names with ''.

i.e 

 gsub('[A-Za-z]+\\s+', '',  lines[-1])
 #[1] "1 2 4 3 2" "1 2 4 2 2"

 dat1 <- data.frame(Country= gsub(" \\d+.*", '', lines[-1]),
                               dat, stringsAsFactors=FALSE)

Similarly, here we are replacing the space followed by number (\\d+) followed by one or more characters .* with ''.

 gsub(" \\d+.*", '', lines[-1])
 #[1] "Australia"     "United States"


dat1
#        Country a b c d e
#1     Australia 1 2 4 3 2
#2 United States 1 2 4 2 2

**Rich Scriven** · Answer 2 · 2014-11-15T06:34:48.583000

Here's another possibility. This one adds quotes to any two words that begin a string

x <- readLines("your.txt")
x[1] <- paste("Country", x[1])
read.table(text=sub("([A-Za-z]{2,}\\s[A-Za-z]{2,})", "'\\1'", x), header=TRUE)
#         Country a b c d e
# 1     Australia 1 2 4 3 2
# 2 United States 1 2 4 2 2

With regard to @akrun's comment about countries containing more than two words, I think this will work:

x[4] <- 'Papua New Guinea 3 4 3 2 5'
xx <- sub("([A-Za-z]{2,}(\\s[A-Za-z]{2,})+)", "'\\1'", x)
read.table(text = xx, header = TRUE)
#            Country a b c d e
# 1        Australia 1 2 4 3 2
# 2    United States 1 2 4 2 2
# 3 Papua New Guinea 3 4 3 2 5

It also occurred to me that the country names might be the row names for the data frame. If that's the case, then you could do

x <- readLines("your.txt")
read.table(text = sub("([A-Za-z]{2,}\\s[A-Za-z]{2,})", "'\\1'", x))
#               a b c d e
# Australia     1 2 4 3 2
# United States 1 2 4 2 2

Issues Reading the table in R

There are 2 best solutions below

Related Questions in R

Related Questions in CSV

Related Questions in FIXED-FORMAT

Trending Questions

Popular # Hahtags

Popular Questions