Perl read seek tell, and text files. Too many bytes being read. Layers and newline handling

3.7k Views Asked by At

I've a Perl script which analyses a text file (can be UNIX or windows line endings) storing file offsets when it find something of interest.

open(my $fh, $filename);
my $groups;
my %hash;
while(<$fh>) {
   if($_ =~ /interesting/ ) {
      $hash{$groups++}{offset} = tell($fh);
   }
}
close $fh;

Then later on in the script I want to produce 'n' copies of the text file but with additional content at each 'interesting' area. To achieve this I loop through the hash of offsets:

foreach my $group (keys %hash) {
   my $href = $hash{$group};
   my $offset = $href->{offset};

   my $top;
   open( $fh, $file);
   read( $fh, $top, $offset);
   my $bottom = do{local $/; <$fh>};
   close $fh;

   $href->{modified} = $top . "Hello World\n" . $bottom;
}

The problem is the read command is reading too many bytes. I suspect this is a line ending issue as the number of bytes (chars?) out is the same as the line number. Using Notepad++ the tell() command is returning the real offset to point of interest, but using that offset value in read() returns characters past the point of interest.

I've tried adding binmode($fh) straight after the open() command prior to the read(). This does find the correct position in the text file, but then I get (CR + CRLF) output and the text file is full of double carriage returns.

I've played with layers :crlf, :bytes, but no improvement.

Bit stuck!

3

There are 3 best solutions below

0
On

From perldoc -f read:

read FILEHANDLE,SCALAR,LENGTH,OFFSET
read FILEHANDLE,SCALAR,LENGTH

So, when you do:

read( $fh, $top, $offset);

your $offset is actually a length. Decide how many characters you need to read. read does not respect line-endings, it reads the number of bytes specified.

If you want to read a line, then don't use read, use:

seek($fh, $offset, 0);
$top = <$fh>;

Is your file full of two new-lines, or are you adding one with a print statement?

2
On
  • A hash with a continuous range of integers as keys should be an array.

  • You are storing a copy of the entire file for every occurrence of /interesting/

  • It sounds like what you need to do is this

    open(my $fh, $filename);
    while (<$fh>) {
      print;
      print "Hello World\n" if /interesting/;
    }
    
0
On

My standard way to handle this, when the input file isn't ginormous, is to slurp the file in and normalize line endings, storing each line as an array element. I sometimes have to deal with Windows (CR+LF) and UNIX (LF only) and Mac (CR only) line endings in the same batch of files. The same script needs to run correctly across all three platforms too.

I generally take a belt-and-braces approach when having to deal with such things. One way that ought to work:

sub read_file_into_array
{
    my $file = shift;
    my ($len, $cnt, $data, @file);

    open my $fh, "<", $file         or die "Can't read $file: $!";
    seek $fh, 0, 2                  or die "Can't seek $file: $!";
    $len = tell $fh;
    seek $fh, 0, 0                  or die "Can't seek $file: $!";

    $cnt = read $fh, $data, $len;
    close $fh;

    $cnt == $len or die "Attempted to read $len bytes; got $cnt";

    $data =~ s/\r\n/\n/g;       # Convert DOS line endings to UNIX
    $data =~ s/\r/\n/g;         # Convert Mac line endings to UNIX

    @file = split /\n/, $data;  # Split on UNIX line endings

    return \@file;
}

Then do all your processing on the lines in @file. For your 'interesting' tags, you would store an array index rather than a file offset. The array index is essentially the line number in the original file, counting starting at 0 instead of 1.

To actually augment the files, instead of looping through hash keys, why not construct a hash consisting of line-number => thing-to-append pairs, generating the augmented file like this:

sub generate_augmented_file
{
    my $file   = shift @_;   # array ref
    my $extras = shift @_;   # hash ref of line => extra pairs
    my $text;        

    foreach my $line ( 0 .. scalar( $file ) - 1 )
    {
        $text .= $file->[$line];
        $text .= $extras->{$line} if defined $extras->{$line};
        $text .= "\n";
    }

    return $text;
}