Skip some lines with fread

2.8k Views Asked by At

I am interested to skip some lines of my data frame before the header names . How can i do it by skiping all the lines before ID_REF or if ID_REF is not present, check for the pattern ILMN_ and deleting all the lines keeping immediate first if not containing #.

# GEOarchive matrix file.               
ID_REF  1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS 1688628068_A.BEAD_STDERR 1688628068_A.Detection Pval
ILMN_1343291    62821.84         135                               413.9399                       0
ILMN_1343292    3255.167         131                               47.76587                       0
ILMN_1343293    42924.91         152                               539.3026                       0
ILMN_1343294    55255.21         100                               746.1457                       0
1

There are 1 best solutions below

0
On BEST ANSWER

In linux, you could use awk with fread or it can be piped with read.table. Here, I changed the delimiter to , using awk

pth <- '/home/akrun/file.txt' #change it to your path
v1 <- sprintf("awk '/^(ID_REF|LMN)/{ matched = 1} matched {$1=$1; print}' OFS=\",\" %s", pth)

and read with fread

library(data.table)
fread(v1)
#         ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS
#1: ILMN_1343291               62821.840                     135
#2: ILMN_1343292                3255.167                     131
#3: ILMN_1343293               42924.910                     152
#4: ILMN_1343294               55255.210                     100
#   1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval
#1:                413.93990                           0
#2:                 47.76587                           0
#3:                539.30260                           0
#4:                746.14570                           0

Or using read.table

read.table(pipe(v1), header=TRUE, sep=',', check.names=FALSE)
#       ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS
#1 ILMN_1343291               62821.840                     135
#2 ILMN_1343292                3255.167                     131
#3 ILMN_1343293               42924.910                     152
#4 ILMN_1343294               55255.210                     100
#  1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval
#1                413.93990                           0
#2                 47.76587                           0
#3                539.30260                           0
#4                746.14570                           0

NOTE: I changed the column name from 1688628068_A.Detection Pval to 1688628068_A.Detection_Pval

For some reason, the extra spaces is creating problems with fread. With read.table it is not an issue. So, the following also works fine with read.table

 v2 <- sprintf("awk '/^(ID_REF|ILMN)/{ matched = 1} matched { print}' %s", pth)

 read.table(pipe(v2), header=TRUE, check.names=FALSE)
 #       ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS
 #1 ILMN_1343291               62821.840                     135
 #2 ILMN_1343292                3255.167                     131
 #3 ILMN_1343293               42924.910                     152
 #4 ILMN_1343294               55255.210                     100
 #  1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval
 #1                413.93990                           0
 #2                 47.76587                           0
 #3                539.30260                           0
 #4                746.14570                           0