I am implementing a naive Bayesian classification algorithm. In my training set I have a number of abstracts in separate files. I want to use N-gram in order to get the term frequency weight, but the code is not taking multiple files.
I edited my code, and now the error I am getting is
cant call method tscore on an undefined value
. To check this, I printed @ngrams
and it is showing me junk values like hash0*29G45
or something like that.
#!c:\perl\bin\perl.exe -w
use warnings;
use Algorithm::NaiveBayes;
use Lingua::EN::Splitter qw(words);
use Lingua::StopWords qw(getStopWords);
use Lingua::Stem;
use Algorithm::NaiveBayes;
use Lingua::EN::Ngram;
use Data::Dumper;
use Text::Ngram;
use PPI::Tokenizer;
use Text::English;
use Text::TFIDF;
use File::Slurp;
my $pos_file = 'D:\aminoacids';
my $neg_file = 'D:\others';
my $test_file = 'D:\testfiles';
my @vectors = ();
my $categorizer = Algorithm::NaiveBayes->new;
my @files = <$pos_file/*>;
my @ngrams;
for my $filename (@files) {
open(FH, $filename);
my $ngram = Lingua::EN::Ngram->new($filename);
my $tscore = $ngram->tscore;
foreach (sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore) {
print "$$tscore{ $_ }\t" . "$_\n";
}
my $trigrams = $ngram->ngram(2);
foreach my $trigram (sort { $$trigrams{$b} <=> $$trigrams{$a} } keys %$trigrams) {
print $$trigrams{$trigram}, "\t$trigram\n";
}
my %positive;
$positive{$_}++ for @files;
$categorizer->add_instance(
attributes => \%positive,
label => 'positive'
);
}
close FH;
Your code
<$pos_file/*>
should work fine ( thanks @borodir ), still, here is an alternative so as to not mess up the history. Tryand then
If called in list context, readdir should give you a list of files:
Another point:
If you want to pass a reference to an array, do it like so:
However, the docs for Lingua::EN::Ngram only talk about scalar for file and so on, it does not seem to expect an array for input. ( Exception being the 'intersection' method )
So you would have to put it in a loop and cycle through, or use map
Seems unnecessary to open in filehandle first, Ngram does that by itself.
If you prefer a loop:
I think now I got what you actually want to do.
get the tscore: you wrote
$tscore = $ngram->tscore
, but $ngram is not defined anymore.Not sure how to get the tscore for a single word. ( "significance of word in text" ) kind of indicates a text.
Thus: make an ngram not for each word, but either for each sentence or each file. Then you can determine the t-score of that word in that sentence or file ( text ).