How can I navigate Word tables using WIN32::OLE perl package?

256 Views Asked by At

I have a directory with hundreds of word docs, each containing a standardized set of tables. I need to parse these tables and extract the data in them. I developed the script that spits out the entire tables.

#!/usr/bin/perl;
use strict;
use warnings;

use Carp qw( croak );
use Cwd qw( abs_path );
use Path::Class;
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
=d
my $datasheet_dir = "./path/to/worddocs";
my @files = glob "$datasheet_dir/*.doc";
print "scalar: ".scalar(@files)."\n";
foreach my $f (@files){
    print $f."\n";
}
=cut
#my $file = $files[0];
my $file = "word.doc";
print "file: $file\n";

run(\@files);

sub run {
    my $argv = shift;
    my $word = get_word();

    $word->{DisplayAlerts} = wdAlertsNone;
    $word->{Visible}       = 1;

    for my $word_file ( @$argv ) {
        print_tables($word, $word_file);
    }

    return;
}

sub print_tables {
    my $word = shift;
    my $word_file = file(abs_path(shift));

    my $doc = $word->{Documents}->Open("$word_file");
    my $tables = $word->ActiveDocument->{Tables};

    for my $table (in $tables) {
        my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
        $text =~ s/\r/\n/g;
        print $text, "\n";
    }

    $doc->Close(0);
    return;
}

sub get_word {
    my $word;
    eval { $word = Win32::OLE->GetActiveObject('Word.Application'); 1 }
        or die "$@\n";
    $word and return $word;
    $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
        or die "Oops, cannot start Word: ", Win32::OLE->LastError, "\n";
    return $word;
}

Is there a way to navigate the cells? I want to only return rows that have a specific value in the first column?

For example, for the following table, I want to only grep the rows that have fruit in the first column.

apple       pl
banana      xml
California  csv
pickle      txt
Illinois    gov
pear        doc
1

There are 1 best solutions below

0
On

You could use OLE to access the individual cells of the table, after first getting the dimensions using the Columns object and Rows collection.

Or you could post-process the text into a Perl array, and iterate that. Instead of

my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
$text =~ s/\r/\n/g;
print $text, "\n";

something like

my %fruit; # population of look-up table of fruit omitted

my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
my @lines = split /\r/, $text;
for my $line ( @lines ) {
    my @fields = split /\t/, $lines;

    next unless exists $fruit{$fields[0]};

    print "$line\n";
}

Refinements for case sensitivity, etc., can be added as needed.