Perl reading only specific gz file lines

Question

Perl reading only specific gz file lines

1.1k Views Asked by Anfoni At 06 December 2018 at 00:35

I'm trying to make a parsing script that parses a huge text file (2 million+ lines) that is gunzip compressed. I only want to parse a range of lines in the text file. So far I've used zgrep -n to find the two lines that mentions the string that I know will start and end the section of the file I'm interested.

In my test case file I am interested in only reading in lines 123080 to 139361. I've found Tie::File to access the file lines using the array object it returns, but unfortunately this won't work for the gun zipped file I'm working with.

Is there something like the following for a gunzipped file?

use Tie::File
tie @fileLinesArray, 'Tie::File', "hugeFile.txt.gz"
my $startLine = 123080;

my $endLine = 139361;    
my $lineCount = $startLine;
while ($lineCount <= $endLine){
    my $line = @fileLinesArray[$lineCount]
    blah blah...
}

Original Q&A

There are 3 best solutions below

mob On 06 December 2018 at 00:49

Tie::File is a bad idea for large files, as it needs to store the whole file in memory at once. It is also an impractical, if not impossible idea for compressed files. Instead, you will want to operate on an input stream of your data. And if you are going to modify the data, an output stream to a new copy of the data. Perl has pretty good support for gzip compression through the PerlIO::gzip layer, but you could also pipe data through one or two gzip processes.

# I/O stream initialization
use PerlIO::gzip;
open my $input, "<:gzip", "data.gz";
open my $output. ">:gzip", "data.new.gz";    # if $output is needed

# I/O stream initialization without PerlIO::gzip
open my $input, "gzip -d data.gz |";
open my $output, "| gzip -c > data.new.gz";

Once the input (and optional output) streams are set up, you can use Perl's I/O facilities on them just like any other file handles.

# copy first $startLine lines unedited
while (<$input>) {
    print $output $_;
    last if $. >= $startLine;
}

while (my $line = <$input>) {
    # blah blah blah
    # manipulate $line
    print $output $line;
    last if $. >= $endLine;
}

print $output <$input>; # write remaining input to output stream
close $input;
close $output;

Kjetil S. On 08 December 2018 at 01:40

You write: "In my test case file I am interested in only reading in lines 123080 to 139361".

This can be done in the shell as well:

zcat file | tail -n +123080 | head -16282

Or by:

my $file = 'the_file.gz';
my($from,$to) = (123080,139361);
my @lines = qx( zcat $file | tail -n +$from | head -@{[-$from+$to+1]});

This might be faster than a normal single core pure perl solution since the zcat, tail and head inside qx will become three processes, and perl is a fourth. And all four might get a separate cpu core on their own. You might want to test the speed with different line numbers.

**choroba** · Accepted Answer · 2018-12-06T00:50:05.210000

Use IO::Uncompress::Gunzip which is a core module:

use IO::Uncompress::Gunzip;

my $z = IO::Uncompress::Gunzip->new('file.gz');
$z->getline for 1 .. $start_line - 1;
for ($start_line .. $end_line) {
    my $line = $z->getline;
    ...
}

Tie::File gets very slow and memory hungry when processing large files.

Perl reading only specific gz file lines

There are 3 best solutions below

Related Questions in PERL

Related Questions in GZIP

Related Questions in TIE

Trending Questions

Popular # Hahtags

Popular Questions