I have created a Perl file to load in an array of "Stop words".
Then I load in a directory with ".ner" files contained in it. Each file gets opened and each word is split and compared to the words in the stop file. If the word matches the word it is changed to "" (nothing-and gets removed) I then copy the file to another location. So I can differentiate between files with stop words and files without. But does this change the file to now contain no stop words or will it revert back to the original?
#!/usr/bin/perl
#use strict;
#use warnings;
my @stops;
my @file;
use File::Copy;
open( STOPWORD, "/Users/jen/stopWordList.txt" ) or die "Can't Open: $!\n";
@stops = <STOPWORD>;
while (<STOPWORD>) #read each line into $_
{
chomp @stops; # Remove newline from $_
push @stops, $_; # add the line to @triggers
}
close STOPWORD;
$dirtoget="/Users/jen/temp/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
@thefiles= readdir(IMD);
foreach $f (@thefiles){
if ($f =~ m/\.ner$/){
print $f,"\n";
open (FILE, "/Users/jen/temp/$f")or die"Cannot open FILE";
if ( FILE eq "" ) {
close FILE;
}
else{
while (<FILE>) {
foreach $word(split(/\|/)){
foreach $x (@stops) {
if ($x =~ m/\b\Q$word\E\b/) {
$word = '';
copy("/Users/jen/temp/$f","/Users/jen/correct/$f")or die "Copy failed: $!";
close FILE;
}
}
}
}
}
}
}
closedir(IMD);
exit 0;
The format of the file I am splitting and comparing is as follows:
'<title>|NN|O Woman|NNP|O jumped|VBD|O for|IN|O life|NN|O after|IN|O firebomb|NN|O attack|NN|O -|:|O National|NNP|I-ORG News|NNP|I-ORG ,|,|I-ORG Frontpage|NNP|I-ORG -|:|I-ORG Independent.ie</title>|NNP|'
Should I be outlining where the words should be split ie: split(/|/)?
@jenniem001,
That will remove stops from your file and create a duplicate. Just call give $duplicate a name :)