Delphi Phrase Count / Keyword Density

1.3k Views Asked by At

Does anyone know how to or have some code on counting the number of unique phrases in a document? (Single word, two word phrases, three word phrases).

Thanks

Example of what I'm looking for: What I mean is I have a text document, and i need to see what the most popular word phrases are. Example text

I took the car to the car wash.

I : 1
took : 1
the : 2
car: 2
to : 1
wash : 1
I took : 1
took the : 1
the car : 2
car to : 1
to the : 1
car wash : 1
I took the : 1
took the car : 1
the car to : 1
car to the : 1
to the car : 1
the car wash : 1
I took the car to : 1
took the car to the : 1
the car to the car : 1
car to the car wash : 1

I need the phrase, and the count that it shows up.

Any help would be appreciated. The closet thing I found to this was a PHP script from http://tools.seobook.com/general/keyword-density/source.php

I used to have some code for this, but I cannot find it.

4

There are 4 best solutions below

3
On BEST ANSWER

Here is some initial code that solves your problem.

function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
var
  words, seqs : TStrings;
  nw,i,j:integer;
  t :string;
begin
  if Counts=nil then Counts:=TStringList.Create;
  words:=TStringList.Create;        // build a list of all words
  words.DelimitedText:=s;
  seqs:=TStringList.Create;
  for nw:=1 to words.Count do       // build a list of all word sequences
   begin
    for i:=0 to words.Count-nw do
     begin
      t:='';
      for j:=0 to nw-1 do
       begin
        t:=t+words[i+j];
        if j<>nw-1 then t:=t+' ';
       end;
      seqs.Add(t);
     end;
   end;
  words.Destroy;
  for i:=0 to seqs.Count-1 do         // count repeated sequences
   begin
    j:=Counts.IndexOf(seqs.Strings[i]);
    if j=-1 then
      Counts.AddObject(seqs.Strings[i],TObject(1))
    else
      Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
   end;
  seqs.Destroy;
  result:=Counts;
end;

You will need to elaborate this code for real world production, for example, by recognizing more word delimiters (not only blanks), and by implementing some sort of case insensitivity.

To test it, put a Button, an EntryField and a Memo in a Form, and add the following code.

procedure TForm1.Button1Click(Sender: TObject);
var i:integer; l:TStrings;
 begin
  l:=CountWordSequences(edit1.Text,TStringList.Create);
  for i:=1 to l.count do
    memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
 end;

I first try with I took the car to the car wash

gives

"I": 1
"took": 1
"the": 2
"car": 2
"to": 1
"wash.": 1
"I took": 1
"took the": 1
"the car": 2
"car to": 1
"to the": 1
"car wash.": 1
"I took the": 1
"took the car": 1
"the car to": 1
"car to the": 1
"to the car": 1
"the car wash.": 1
"I took the car": 1
"took the car to": 1
"the car to the": 1
"car to the car": 1
"to the car wash.": 1
"I took the car to": 1
"took the car to the": 1
"the car to the car": 1
"car to the car wash.": 1
"I took the car to the": 1
"took the car to the car": 1
"the car to the car wash.": 1
"I took the car to the car": 1
"took the car to the car wash.": 1
"I took the car to the car wash.": 1
0
On

This is how I would go about solving the problem. Assuming that each pass thru the data file would create a new data file for the next step. The control character mentioned can be any character which does not naturally appear in the data. When you write a control character, don't write duplicates.

  1. Run through your document and count each word separately.
  2. Run through your document again and replace any word used only once with a control character, addding to a new list the pairs that occur (words A B C become item A B and item B C). Control characters act as hard delimiters. Any word that is alone between control characters should also be converted since it can't be converted to a pair.
  3. Run through your document again and replace any pair used only once with a control character, adding to a new list any triplets that occur. Convert pairs between control characters to control characters.

Repeat adding another word level to each list until you get an empty list or you have the maximum phrase you want to support.

This method implies the fact that your most common phrases can never contain a smaller phrase used less often.

0
On

The number of possible combinations rises really quickly. Assume 30000 words in mainstream use in a language, then the number of 3 phrase combinations is in the magnitude of 30000^3

Anyway, the level zero implementation would be to build a (hash) list of words, filter the list if necessary for very common words (the,of etc) to reduce the number of phrases. Other things you might want to do is reduce plurals to singles, remove trailing 's, casing etc.

Then walk the text word for word (tokenizer style), skipping the common words, and simply keep an ordered list of phrases you encounter with a count, and hope your memory doesn't run out, since Delphi has no 64-bit version :)

Didn't Knuth have a whole book about combinations?

2
On

From Delphi Basics website.

var
  position : Integer;

begin
  // Look for the word 'Cat' in a sentence
  // Note : that this search is case sensitive, so that
  //        the first 'cat' is not matched
  position := AnsiPos('Cat', 'The cat sat on the Cat mat');
  if position = 0
  then ShowMessage('''Cat'' not found in the sentence')
  else ShowMessage('''Cat'' was found at character '+IntToStr(position));
end;

Maybe it will help