Perl "scrub" characters while parsing

176 Views Asked by At

I'm parsing through a file - first thing I do is concatenate the first three fields and prepend them to each record. Then I want to scrub the data of any colons, single quotes, double quotes or backslashes. Following is how I'm doing it, but is there a way for me to do it using the $line variable that would be more efficient?

# Read the lines one by one.
while($line = <$FH>) {

# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
    chomp($line);
    my @fields = split(/,/, $line);
    unshift @fields, join '_', @fields[0..2];

# Scrub data of characters that cause scripting problems down the line.
        $_ =~ s/:/ /g for @fields[0..39];
        $_ =~ s/\'/ /g for @fields[0..39];
        $_ =~ s/"/ /g for @fields[0..39];
        $_ =~ s/\\/ /g for @fields[0..39];
2

There are 2 best solutions below

2
On

I am certain that I have seen a very similar question before but my simple searches won't find it. What stands out is adding a new field before all of the rest that is a function of the original values

You've described that best in Perl terms

unshift @fields, join '_', @fields[0..2];

so the only step left is the removal of rogue characters—single and double quotes, colons, and backslashes

Your code seems to work fine. The only changes I would make would be

  • Use the default variable $_ properly. I think this is what newcomers hate most about Perl, and then come to love most once they understand it

  • Use tr///d instead of s///. It may add a little speed, but most of all frees you from regex syntax when you just want to say what characters to delete and need something simpler

I think this should do what you need

use strict;
use warnings 'all';

while ( <DATA> ) {

    chomp;
    my @fields = split /,/;

    unshift @fields, join '_', @fields[0..2];

    tr/:"'\\//d for @fields; # Delete colons, quotes, and backslash

    print join(',', @fields), "\n";
}

__DATA__
a:a,b"bb",c'ccc',ddd,e,f,g,h

output

aa_bbb_cccc,aa,bbb,cccc,ddd,e,f,g,h
7
On

What would be cleaner for me:

while($line = <$FH>) {
    chomp($line);

    $line =~ s/[:\'"\\]/ /g;

    my @fields = split(/,/, $line);
    unshift @fields, join '_', @fields[0..2];
}

And as @HunterMcMillen said, if this is a standard CSV file it would be better to use a parsing module. It will be easier down the road.