Perl reading only specific gz file lines

1.1k Views Asked by At

I'm trying to make a parsing script that parses a huge text file (2 million+ lines) that is gunzip compressed. I only want to parse a range of lines in the text file. So far I've used zgrep -n to find the two lines that mentions the string that I know will start and end the section of the file I'm interested.

In my test case file I am interested in only reading in lines 123080 to 139361. I've found Tie::File to access the file lines using the array object it returns, but unfortunately this won't work for the gun zipped file I'm working with.

Is there something like the following for a gunzipped file?

use Tie::File
tie @fileLinesArray, 'Tie::File', "hugeFile.txt.gz"
my $startLine = 123080;

my $endLine = 139361;    
my $lineCount = $startLine;
while ($lineCount <= $endLine){
    my $line = @fileLinesArray[$lineCount]
    blah blah...
}
3

There are 3 best solutions below

7
choroba On BEST ANSWER

Use IO::Uncompress::Gunzip which is a core module:

use IO::Uncompress::Gunzip;

my $z = IO::Uncompress::Gunzip->new('file.gz');
$z->getline for 1 .. $start_line - 1;
for ($start_line .. $end_line) {
    my $line = $z->getline;
    ...
}

Tie::File gets very slow and memory hungry when processing large files.

1
mob On

Tie::File is a bad idea for large files, as it needs to store the whole file in memory at once. It is also an impractical, if not impossible idea for compressed files. Instead, you will want to operate on an input stream of your data. And if you are going to modify the data, an output stream to a new copy of the data. Perl has pretty good support for gzip compression through the PerlIO::gzip layer, but you could also pipe data through one or two gzip processes.

# I/O stream initialization
use PerlIO::gzip;
open my $input, "<:gzip", "data.gz";
open my $output. ">:gzip", "data.new.gz";    # if $output is needed

# I/O stream initialization without PerlIO::gzip
open my $input, "gzip -d data.gz |";
open my $output, "| gzip -c > data.new.gz";

Once the input (and optional output) streams are set up, you can use Perl's I/O facilities on them just like any other file handles.

# copy first $startLine lines unedited
while (<$input>) {
    print $output $_;
    last if $. >= $startLine;
}

while (my $line = <$input>) {
    # blah blah blah
    # manipulate $line
    print $output $line;
    last if $. >= $endLine;
}

print $output <$input>; # write remaining input to output stream
close $input;
close $output;
1
Kjetil S. On

You write: "In my test case file I am interested in only reading in lines 123080 to 139361".

This can be done in the shell as well:

zcat file | tail -n +123080 | head -16282

Or by:

my $file = 'the_file.gz';
my($from,$to) = (123080,139361);
my @lines = qx( zcat $file | tail -n +$from | head -@{[-$from+$to+1]});

This might be faster than a normal single core pure perl solution since the zcat, tail and head inside qx will become three processes, and perl is a fourth. And all four might get a separate cpu core on their own. You might want to test the speed with different line numbers.