Perl: Efficiently store/get 2D array of constrained integers in file

106 Views Asked by At

This is an attempt to improve my Perl: seek to and read bits, not bytes by explaining more thoroughly what I was trying to do.

I have x, a 9136 x 42 array of integers that I want to store super-efficiently in a file. The integers have the following constraints:

  • All of the 9136 integers in x[0..9135][0] are between -137438953472 and 137438953471, and can therefore be stored using 38 bits.

  • All of the 9136 integers in x[0..9135][1] are between -16777216 and 16777215, and can therefore be stored using 25 bits.

  • And so on... (the integer bit constraints are known in advance; Perl doesn't have to compute them)

Question: Using Perl, how do I efficiently store this array in a file?

Notes:

  • If an integer can be stored in 25 bits, it can also be stored in 4 bytes (32 bits), if you're willing to waste 7 bits. In my situation, however, every bit counts.

  • I want to use file seek() to find data quickly, not read sequentially through the file.

  • The array will normally be accessed as x[i]. In other words, I'll want the 42 integers corresponding to a given x[i], so these 42 integers should be stored close to each other (ideally, they should be stored adjacent to each other in the file)

  • My initial approach was to just lay down a bitstream, and then find a way to read it back and change it back into an integer. My original question focused on that, but perhaps there's a better solution to the bigger problem that I'm not seeing.

Far too much detail on what I'm doing:

1

There are 1 best solutions below

1
On

I'm not sure I should be encouraging you, but it loks like Data::BitStream will do what you ask.

The program below writes a 38-bit value and a 25-bit value to a file, and then opens and retrieves the values intact.

#!/usr/bin/perl

use strict;
use warnings;

use Data::BitStream;

{
   my $bs_out = Data::BitStream->new(
      mode => 'w',
      file => 'bits.dat',
   );

   printf "Maximum %d bits per word\n", $bs_out->maxbits;

   $bs_out->write(38, 137438953471);
   $bs_out->write(25, 16777215);

   printf "Total %d bits written\n\n", $bs_out->len;
}

{
   my $bs_in = Data::BitStream->new(
      mode => 'ro',
      file => 'bits.dat',
   );

   printf "Total %d bits read\n\n", $bs_in->len;
   print "Data:\n";

   print $bs_in->read(38), "\n";
   print $bs_in->read(25), "\n";
}

output

Maximum 64 bits per word
Total 63 bits written

File size 11 bytes
Total 63 bits read

Data:
137438953471
16777215

38 and 25 is 63 bits of data written, which the module confirms. But there is clearly some additional housekeeping data involved as the total size of the resulting file is eleven bytes, and not just the eight that would be the minimum necessary. Note that, when reopened, the data remembers that it is 63 bits long. However, it is shorter than the sixteen bytes that a file would have to be to contain two simple 64-bit integers.

What you do with this information is up to you, but remember that data packed in this way will be extremely difficult to debug with a hex editor. You may be shooting yourself in the foot if you adopt something like this.