combine keys of hashes for output (outer join of hashes)

189 Views Asked by At

I'm analysing a log file with Perl 5.8.8.[1] I'm searching for days that expose some of two trigger patterns, maybe one of it, maybe both (I changed the actual patterns in the code snippet shown below). I'm interested in the count of occurrences per day, next step will be to make a spreadsheet of it, that's why the output formatting with tabs.

Because only one of the patterns may occur in a day, I need a way to combine the keys of both hashes. I did by generating a new hash. Is there a built-in function for that? I searched the web and stack overflow without any result, the only hit I got here was Build a string from 2 hashes, but in that case the key sets were identical.

#!/usr/bin/perl -w
use strict;
use warnings;
use locale;

# input analysis: searching for two patterns:
my %pattern_a = ();
my %pattern_b = ();
foreach my $line (<>) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        if ($what =~ m/beendet/) {
            $pattern_a{$day} ++;
        } elsif ($what =~ m/ohne/) {
            $pattern_b{$day} ++;
        }
    }
}

# generate the union of hash keys:        <-- In Question
my %union = ();
$union{$_} = 1 for keys %pattern_a;
$union{$_} = 1 for keys %pattern_b;

# formatted output sorted by day:
foreach my $day (sort keys %union) {
    print join "\t", $day, 
            ($pattern_a{$day} || 0), 
            ($pattern_b{$day} || 0)."\n";
}

The expected output would look like this:

2017-02-01      0       1
2017-02-18      0       592
2017-02-19      2       0

[1] I'm aware that this Perl version is quite outdated. But I'm using Perl rarely, but when I do, it has to go fast. So figuring out Perl versions and so on gets done later. But the Perl version is not so important for the actual question, at least I hope so...

2

There are 2 best solutions below

11
On BEST ANSWER

It's easier to structure your data first by day, then by pattern. That can be done using a hash reference.

use strict;
use warnings;

my %matches;
while ( my $line = <DATA> ) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        if ($what =~ m/beendet/) {
            $matches{$day}->{a} ++;
        } elsif ($what =~ m/ohne/) {
            $matches{$day}->{b} ++;
        }
    }
}

# formatted output sorted by day:
foreach my $day (sort keys %matches) {
    print join(
        "\t",
        $day,
        $matches{$day}->{a} || 0,
        $matches{$day}->{b} || 0,
    ), "\n";
}

__DATA__
2017-02-01 einmal Pommes ohne
2017-02-02 Wartung gestartet
2017-02-02 Wartung beendet
2017-02-03 ohne Moos nix los

That program produces output as follows

2017-02-01  0   1
2017-02-02  1   0
2017-02-03  0   1

To understand the data structure, you can use Data::Dumper to output it (though I suggest using Data::Printer instead, as that's intended for human consumption and not as a serialization).

use Data::Dumper;
print Dumper \%matches;
__END__

$VAR1 = {
          '2017-02-03' => {
                            'b' => 1
                          },
          '2017-02-02' => {
                            'a' => 1
                          },
          '2017-02-01' => {
                            'b' => 1
                          }
        };

As you can see, the data is structured first by date. Each key represents one day. Inside, there is an additional hash reference that only holds one key. That's the pattern. Later we iterate the day first. Then we get

{
    'b' => 1
}

in the first iteration. Then we iterate all the patterns. The above program does this not by actually iterating, but by explicitly stating each possible key. If it's there it's used. If it's not defined, it's set to 0 with the || operator.


The program can be further simplified to use arbitrary patterns. If you don't care about the order of the patterns in the output, include a header and you can easily add more patterns later.

I used a config hash for the patterns, and Text::Table to create the output.

use strict;
use warnings;
use Text::Table;

my %matches;
my %patterns = (
    beendet => qr/beendet/,
    ohne    => qr/ohne/,
    komplex => qr/foo\sbar?/, # or whatever
);
while ( my $line = <DATA> ) {
    if ($line =~ m/^(\d{4}-\d{2}-\d{2})(.+)$/) {
        my $day = $1;
        my $what = $2;
        foreach my $name ( sort keys %patterns ) {
            if ( $what =~ $patterns{$name} ) {
                $matches{$day}->{$name}++ ;
                last;
            }
        }
    }
}

# formatted output sorted by day:
my @head = sort keys %patterns;
my $tb = Text::Table->new( 'Tag', @head );

foreach my $day (sort keys %matches) {
    $tb->load([ $day, map { $matches{$day}->{$_} || 0 } @head ]);
}

print $tb;

__DATA__
2017-02-01 einmal Pommes ohne
2017-02-02 Wartung gestartet
2017-02-02 Wartung beendet
2017-02-03 ohne Moos nix los

This prints

Tag        beendet komplex ohne
2017-02-01 0       0       1   
2017-02-02 1       0       0   
2017-02-03 0       0       1   

If you don't want to install an additional module, maybe just create a CSV file. Since you're from Germany, I suggest a semicolon ; as the separator, because German Excel uses that as the default.

Here is a verbose example of how to do this instead of Text::Table.

my @head = sort keys %patterns;
print join( ';', @head ), "\n";
foreach my $day (sort keys %matches) {
    my @cols;
    push @cols, $matches{$day}->{$_} || 0 for @head;
    print join ';', $day, @cols;
    print "\n";
}

And the output is

beendet;komplex;ohne
2017-02-01;0;0;1
2017-02-02;1;0;0
2017-02-03;0;0;1

But you should also look into Text::CSV if you don't want this to go to the screen.

4
On

Wouldn't it be easier to use a single hash?

#!/usr/bin/perl
use strict;
use warnings;

my %stats;

while (my $line = readline) {
    my ($day, $pattern) = $line =~ /^(\d{4}-\d{2}-\d{2}).*(beendet|ohne)/
        or next;

    $stats{$day}{$pattern}++;
}

for my $day (sort keys %stats) {
    printf "%s\t%d\t%d\n",
        $day,
        $stats{$day}{beendet} // 0,
        $stats{$day}{ohne} // 0;
}

If you're using a perl before 5.10, replace // by ||; it makes no effective difference in this case. (But consider upgrading: 5.8.8 is from 2006. It's now more than a decade old. The officially maintained perl versions are 5.22 (2015) and 5.24 (2016).)