Remove duplicate keys from a hash of hashes and arrays (and ensure any resulting empty hashs are also removed)

826 Views Asked by At

I have a data stream in JSON format that my script accesses from an internal website. My script converts the JSON to a perl hash using JSON.pm (I'm using perl 5.10.1 on RHEL 6.9)

Within this hash are multiple nested hashes, and nested arrays, some of which are nested within other hashes/arrays inside of the big hash.

I need to walk the entire structure of the hash, including all of the arrays and nested hashes, and remove any keys anywhere in the entire structure, which share the same name as any other key (only for a specific key name though).

Additionally, because of how the data is structured, some nested hashes have ONLY keys that are now deleted, leaving the value for some keys as an empty hash. I also need to remove those keys which have an empty hash for its value

Here is my data after its conversion to perl:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'husky' => {
                                                'name' => 'fred'
                                             },
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'husky' => 'wilma',
                                  'lab' => 'betty'
                               },
                        'c' => 'pebbles' # yes this is intentionally a scalar in the example
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                     {
                        'e' => {
                                  'husky' => 'dino'
                               },
                     },
                   ],
        }

We want to remove all keys named 'husky'

Here is what it should look like:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'labrador' => 'betty'
                               },
                        'c' => 'pebbles'
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                   ],
        }

Here is what I get after I added @Shawn's code and made a tweak to it (this is very close, but we need to account for the empty hashes:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'lab' => 'betty'
                               },
                        'c' => 'pebbles' # yes this is intentionally a scalar in the example
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                     {
                        'e' => {},
                     },
                   ]
        }

I've tried a few variations found elsewhere on SO and on perlmonks. keys %$_ == 0, !%$_ to name a few. But none seem to work with this hash slice.

Code:

use 5.008008;
use strict;
use warnings;
use English; # I know I know, don't use English...
use JSON;
use YAML::Tiny qw(Dump);
# proprietary modules I wrote added here, which themselves load in LWP, HTTP::Cookie and others, and they do the bulk of building and sending the request. They are the back end to this script's front end.

[-snipped a ton of code-]

sub _count_keys
{
    my ($j, $seen) = @ARG;
    my $type = ref $j;
    if ($type eq "ARRAY")
    {
        for (@{$j})
        {
            _count_keys($ARG, $seen);
        }
    }
    elsif ($type eq "HASH")
    {
        while (my ($key, $val) = each %{$j})
        {
            $seen->{$key}++;
            if (ref $val)
            {
                _count_keys($val, $seen);
            }
        }
    }
    return $seen;
}

sub _remove_duplicate_keys
{
    my ($j, $seen) = @ARG;
    $seen //= _count_keys($j, {});

    my $type = ref $j;
    if ($type eq "ARRAY")
    {
        return [ map { _remove_duplicate_keys($ARG, $seen) } @{$j} ];
    }
    elsif ($type eq "HASH")
    {
        my %obj = %{$j};
        delete @obj{grep { $seen->{$ARG} > 1 and $ARG eq 'keyNameToBeExcluded'} keys %obj};
# Here is where I have been putting another delete line but I can't seem to find the right parameters for the grep to make it delete the empty anon hashes. Example of what I tried is the next comment below
#        delete @obj{grep { $seen->{$ARG} > 1 and keys $ARG{assetDetails} == 0 } keys %obj};

        while (my ($key, $val) = each %obj)
        {
            if (ref $val)
            {
                $obj{$key} = _remove_duplicate_keys($val, $seen);
            }
        }
        return \%obj;
    }
    else
    {
        return $j;
    }
}

sub _process_json
{
    my $JSONOUTPUT   = shift;
    my $OPTIONS      = shift;

    # Change true to 1 and false to 0 to prevent blessed objects from appearing in the JSON, which prevents the YAML::Tiny module from barfing
    foreach (@{$JSONOUTPUT})
    {
        s{true(,\n)}{1$1}gxms;
        s{false(,\n)}{0$1}gxms;
    }

    my $JSONPERLOBJ  = JSON->new->utf8->decode(@{$JSONOUTPUT});

# Test code below here; real code not in use while I test getting the output right.
use Data::Dumper;
my $BEFORE = $JSONPERLOBJ;
my $AFTER = _remove_duplicate_keys($JSONPERLOBJ);
#    $JSONPERLOBJ = _remove_duplicate_keys($JSONPERLOBJ);
#print Dumper $BEFORE;
print Dumper $AFTER;
exit 1;
# End test code
}
sub _main
{
    [-snip private code-]
    my @JSONOUTPUT = $RESPONSE->decoded_content;
    my $RC = _process_json(\@JSONOUTPUT, $OPTIONS);

    exit ($RC == 1)?0:1;
}
2

There are 2 best solutions below

17
On BEST ANSWER

I think this does what you want:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use JSON::XS; # Better than JSON; also see JSON::MaybeXS

my $j = <<EOJSON;
{
  "foo": 1,
  "bar": {
      "foo": true,
      "baz": false
      },
  "dog": "woof",
  "cat": [ { "foo": 3 } ]
}
EOJSON

sub count_keys {
  my ($j, $seen) = @_;
  my $type = ref $j;
  if ($type eq "ARRAY") {
    count_keys($_, $seen) for @$j;
  } elsif ($type eq "HASH") {
    while (my ($key, $val) = each %$j) {
      $seen->{$key}++;
      count_keys($val, $seen) if ref $val;
    }
  }
  return $seen;
}

sub remove_dups {
  my ($j, $seen) = @_;
  $seen //= count_keys($j, {});

  my $type = ref $j;
  if ($type eq "ARRAY") {
    return [ map { remove_dups($_, $seen) } @$j ];
  } elsif ($type eq "HASH") {
    my %obj = %$j;
    delete @obj{grep { $seen->{$_} > 1 } keys %obj};
    while (my ($key, $val) = each %obj) {
      $obj{$key} = remove_dups($val, $seen) if ref $val;
    }
    return \%obj;
  } else {
    return $j;
  }
}

my $parsed = decode_json $j;
my $printer = JSON::XS->new->pretty->canonical;
say "Before:";
print $printer->encode($parsed);
say "After:";
my $dedup = remove_dups $parsed;
print $printer->encode($dedup);

produces

Before:
{
   "bar" : {
      "baz" : false,
      "foo" : true
   },
   "cat" : [
      {
         "foo" : 3
      }
   ],
   "dog" : "woof",
   "foo" : 1
}
After:
{
   "bar" : {
      "baz" : false
   },
   "cat" : [
      {}
   ],
   "dog" : "woof"
}

Edit for explanation:

The first time remove_dups is called on a perl data structure representing a json value (Which doesn't have to be a json object), it calls count_keys to recursively walk the structure and create a hash of all the keys and the number of times each one occurs. Then it again recursively walks the structure, returning a deep copy without keys that appeared more than once in the original.

This line is the real magic:

delete @obj{grep { $seen->{$_} > 1 } keys %obj};

It uses a hash slice to delete a bunch of keys all at once, with the grep bit returning a list of keys that appeared more than once. More information on slices.

6
On

I take it that Shawn's answer works for removing duplicates, what it looks good for.

The follow up problem is that we may end up with empty structures and those need be removed as well. But then there may also be structures that contain only empty structures, etc, and I assume that all such need be gone.

I use the desired-result-hashref from the question (from which I remove one name=>... so that there are no duplicates) and add some empty trouble.

use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd pp);

my $hr = {
    'cat' => 'meow',
    'dog' => [
        { 'a' => { 'chow' =>  { 'name' => 'barney' } }  },
        { 'b' => { 'lab' => 'betty' }, 'c' => 'pebbles' },
        { 'd' => { 'shihtzu' => 'bambam' }              },
        {   # all of the following need to go, and this hashref
            'e' => { },  
            'f' => { noval => { } },
            'g' => [ { }, { nada => { } }, [ ] ],
        },
    ],  
};
dd $hr; say '';

for my $k (sort keys %$hr) {
    next_level($hr, $k, $hr->{$k}, 'key');
}

# Takes: data structure (reference), key/index at which it is found, 
# its value for it, and description string of which it is, 'key|idx'
sub next_level {
    my ($ds, $index, $val, $kind) = @_;
    my $type = ref $val;

    if ($type eq 'ARRAY') {
        for my $i (0..$#$val) {
            next_level(
                ( $kind eq 'key' ? $ds->{$index} : $ds->[$index] ),
                $i, $val->[$i], 'idx' 
            );
        }

        # Collect indices for and delete elements that are empty
        my @to_delete;
        for my $i (0..$#$val) {
            if ( (ref $val->[$i] eq 'HASH'  and not keys %{$val->[$i]}) or
                 (ref $val->[$i] eq 'ARRAY' and not      @{$val->[$i]})  )
            {
                say "No value/empty for index $i, record for deletion";
                push @to_delete, $i;
            }
        }
        if (@to_delete) { 
            my %ref_idx = map { $_ => 1 } @to_delete;
            @$val = @$val[ grep { not exists $ref_idx{$_} } 0..$#$val ];
        }
    }
    elsif ($type eq 'HASH') {
        for my $k (sort keys %{$val}) {
            my $ds_next_level = 
                ($kind eq 'key') ? $ds->{$index} : $ds->[$index];

            next_level( $ds_next_level, $k, $val->{$k}, 'key' );

            # Delete if empty 
            if ( (ref $val->{$k} eq 'HASH'  and not keys %{$val->{$k}}) or
                 (ref $val->{$k} eq 'ARRAY' and not      @{$val->{$k}})  )
            {
                say "No value/empty for key $k, delete";
                delete $ds_next_level->{$k}
            }
        }
    }
    #elsif (not $type) { say "A value: ", $val }
}
say ''; dd $hr;

This is a normal recursive traversal of a complex data structure but with a twist: in order to be able to delete components the recursive sub also needs the data structure itself, at which key (in a hashref) or index (in an arrayref) it is found, and which of the two it was, a key or an index.

After the recursion the target is deleted if it is empty, if it's in a hashref. An arrayref is scanned for all empty elements first and then they are removed by overwriting the arrayref, with an array slice that excludes indices for elements that contain only empty data structures.

For the exclusion of "bad" indices a reference hash is used for efficiency. Overwriting the array may be faster using map (see this post), or it may not be if slicing allows specific (interpreter) optimizations.

The output

{
  cat => "meow",
  dog => [
           { a => { chow => { name => "barney" } } },
           { b => { lab => "betty" }, c => "pebbles" },
           { d => { shihtzu => "bambam" } },
           { e => {}, f => { noval => {} }, g => [{}, { nada => {} }, []] },
         ],
}

No value/empty for key e, delete
No value/empty for key noval, delete
No value/empty for key f, delete
No value/empty for key nada, delete
No value/empty for index 0, record for deletion
No value/empty for index 1, record for deletion
No value/empty for index 2, record for deletion
No value/empty for key g, delete
No value/empty for index 3, record for deletion

{
  cat => "meow",
  dog => [
           { a => { chow => { name => "barney" } } },
           { b => { lab => "betty" }, c => "pebbles" },
           { d => { shihtzu => "bambam" } },
         ],
}