How to use Lingua::EN::Ngram for multiple files

197 Views Asked by At

I am implementing a naive Bayesian classification algorithm. In my training set I have a number of abstracts in separate files. I want to use N-gram in order to get the term frequency weight, but the code is not taking multiple files.

I edited my code, and now the error I am getting is cant call method tscore on an undefined value. To check this, I printed @ngrams and it is showing me junk values like hash0*29G45 or something like that.

  #!c:\perl\bin\perl.exe -w

  use warnings;

  use Algorithm::NaiveBayes;
  use Lingua::EN::Splitter qw(words);
  use Lingua::StopWords qw(getStopWords);
  use Lingua::Stem;
  use Algorithm::NaiveBayes;
  use Lingua::EN::Ngram;
  use Data::Dumper;
  use Text::Ngram;
  use PPI::Tokenizer;
  use Text::English;
  use Text::TFIDF;
  use File::Slurp;

  my $pos_file  = 'D:\aminoacids';
  my $neg_file  = 'D:\others';
  my $test_file = 'D:\testfiles';
  my @vectors   = ();

  my $categorizer = Algorithm::NaiveBayes->new;

  my @files = <$pos_file/*>;
  my @ngrams;
  for my $filename (@files) {

    open(FH, $filename);

    my $ngram = Lingua::EN::Ngram->new($filename);

    my $tscore = $ngram->tscore;

    foreach (sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore) {
      print "$$tscore{ $_ }\t" . "$_\n";
    }

    my $trigrams = $ngram->ngram(2);

    foreach my $trigram (sort { $$trigrams{$b} <=> $$trigrams{$a} } keys %$trigrams) {
      print $$trigrams{$trigram}, "\t$trigram\n";
    }

    my %positive;

    $positive{$_}++ for @files;

    $categorizer->add_instance(
      attributes => \%positive,
      label      => 'positive'
    );
  }

  close FH;
1

There are 1 best solutions below

13
On

Your code <$pos_file/*> should work fine ( thanks @borodir ), still, here is an alternative so as to not mess up the history. Try

opendir (DIR, $directory) or die $!;

and then

 while (my $filename = readdir(DIR)) {

    open ( my $fh, $filename );

    # work with filehandle

    close $fh;

}

closedir DIR;

If called in list context, readdir should give you a list of files:

my @filenames = readdir(DIR);
# you could call that function you wanted to call with this list, file would need to be 
# opened still, though

Another point:

If you want to pass a reference to an array, do it like so:

function( list => \@stems );
# thus, your ngram line should probably rather be

my $ngram = Lingua::EN::Ngram->new (file => \@stems );

However, the docs for Lingua::EN::Ngram only talk about scalar for file and so on, it does not seem to expect an array for input. ( Exception being the 'intersection' method )

So you would have to put it in a loop and cycle through, or use map

my @ngrams = map{ Lingua::EN::Ngram->new( file => $_ ) }@filenames

Seems unnecessary to open in filehandle first, Ngram does that by itself.

If you prefer a loop:

my @ngrams;
for my $filename ( @filenames ){ 
   push @ngrams, Lingua::EN::Ngram->new( file => $filename );
}

I think now I got what you actually want to do.

get the tscore: you wrote $tscore = $ngram->tscore, but $ngram is not defined anymore.

Not sure how to get the tscore for a single word. ( "significance of word in text" ) kind of indicates a text.

Thus: make an ngram not for each word, but either for each sentence or each file. Then you can determine the t-score of that word in that sentence or file ( text ).

for my $filename ( @files ){
   my $ngram = Lingua::EN::Ngram->new( file => $filename );

   my $tscore = $ngram->tscore(); 
   # tscore returns a hash reference. Keys are bigrams, values are tscores
   # now you can do with the tscore what you like. Note that for arbitrary length,
   # tscore will not work. This you would have to do yourself.