Why read.table cannot recognize tab separated header accurately in R

3.2k Views Asked by At

I have the following data

GOBPID  Term    ADX_KD_06.ip    ADX_KD_24.ip    ADX_LG_06.ip (more columns)
GO:0000003  reproduction    0   0   0
GO:0000165  MAPK cascade    0   0   0
(more rows)

When I read it like the following

d1 <- read.table("http://dpaste.com/1487049/plain/",sep="\t",header=TRUE)

I expect d1$GOBPID to contain values like GO:0000003, but it access Term column instead.

> d1$GOBPID
[1] reproduction   MAPK cascade ....

Basically, it doesn't assign the header column as it should. Why is that? What's the right way to do it?

2

There are 2 best solutions below

0
On BEST ANSWER

How big are your actual data?

As Richie Cotton pointed out, count.fields is useful for identifying how many delimiters there are in each row of your data. In this case, however, it was a little more useful to open the file up in a decent text editor that shows tab characters, and you would see that every line except for the first has a trailing tab. Because all the other rows have one more tab than the first, R assumes the first "column" should be the row.names which leads to the problem you're having.

Here are two possible options for this data:

Option 1

This is convenient if your data are small: Use gsub to get rid of the trailing tabs, and use read.delim on the output of that:

read.delim(text = gsub("\\t$", "", 
                       readLines("http://dpaste.com/1487049/plain/")))

Option 2

Read the table in skipping the first line, drop the last column (which should be all NA values), and add names by reading just the first line using scan:

out <- read.delim("http://dpaste.com/1487049/plain/", 
                  skip = 1, header = FALSE)
out <- out[-length(out)]
names(out) <- scan("http://dpaste.com/1487049/plain/", 
                   what="", n=length(out), sep = "\t")
2
On

You may have space characters instead of tabs somewhere, or other malformed data. Run

count.fields("http://dpaste.com/1487049/plain/", sep = "\t")

to see which lines are causing the problems.