Splitting of Big File into Smaller Chunks in Shell Scripting

Question

Splitting of Big File into Smaller Chunks in Shell Scripting

221 Views Asked by Katchy At 26 January 2017 at 00:29

I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.

Sample.txt ( File will be sorted based on the third field on which pattern to be searched )

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/>

"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/>

"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt

EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/>

Used

awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile

and grep commands but it was very time consuming since file is 300+ MB of size.

Original Q&A

There are 2 best solutions below

mauro On 26 January 2017 at 04:32

You can try with Perl:

 perl -ne '/00003/ && print' big_file > small_file

and compare its timing with other solutions...

EDIT

Limiting my answer to the tools you didn't try already... you can also use:

sed -n '/00003/p' big_file > small_file

But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.

**mklement0** · Accepted Answer · 2017-01-26T04:52:54.797000

Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.

It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.

ndx=0; fromRow=1
for val in '00003' '00112' '|'; do  # 2 sample values to match, plus dummy value
  chunkFile="smallfile$(( ++ndx ))"
  fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
    NR < fromRow { next }
    { if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
  ' big_file)
done

Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.

Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:

awk -F'|' -v vals='00003|00112' '
  BEGIN { split(vals, val); outFile="smallfile" ++ndx }
  { 
    if ($3 != val[ndx]) { 
      if (p) { p=0; close(outFile); outFile="smallfile" ++ndx } 
    } else { 
      p=1 
    } 
    print > outFile
  }
' big_file

Splitting of Big File into Smaller Chunks in Shell Scripting

There are 2 best solutions below

Related Questions in BASH

Related Questions in SHELL

Related Questions in SPLIT

Related Questions in CSPLIT

Trending Questions

Popular # Hahtags

Popular Questions