Stemming a word using Linux command line

190 Views Asked by At

I have a corpus file that I need to compare with another file "vertically" and list unique left-over strings. For example:

exclude.txt:

ed
s
ing

And the second file is:

corpus.txt:

worked
working
works
tested
tests
find
found

Expected output:

work/ed,ing,s
test/ed,s

The other words (find and found) can also be returned optionally.

I tried like this:

with open ('/home/corpus.txt') as corpus:
    for i in corpus:
        i = i.strip('\n')
        with open ('/home/exclude.txt') as exclude:
            for x in exclude:
                x = x.strip('\n')
                if i.endswith(x):
                    print (x, i, re.sub(r'(.*)'+x, r'\1', i)+'/'+x)

The output is something like this...

ed worked work/ed
ing working work/ing
s works work/s
ed tested test/ed
s tests test/s

As you can see I am not getting expected output. Nor is this Python code useful if the corpus file is large.


Here's some slightly less trivial input to test with:

$ head -100 exclude.txt corpus.txt
==> exclude.txt <==
works
ed
s
ing
ings

==> corpus.txt <==
worked
working
works
tested
tests
find
found
workings

The expected output given the above would be (printing the unmatched words is optional, as is printing matched/unmatched at the start of each line):

unmatched find
unmatched found
matched working/s
matched work/ed,ing,s,ings
matched test/ed,s
4

There are 4 best solutions below

0
On BEST ANSWER

Using any awk:

$ cat tst.awk
{ lineLgth = length($0) }
NR == FNR {
    suffixes[$0]
    sfxLgths[lineLgth]
    next
}
{
    base = ""
    for ( sfxLgth in sfxLgths ) {
        baseLgth = lineLgth - sfxLgth
        if ( baseLgth > 0 ) {
            sfx = substr($0,baseLgth+1)
            if ( sfx in suffixes ) {
                base = substr($0,1,baseLgth)
                bases2sfxs[base] = bases2sfxs[base] "," sfx
            }
        }
    }
    if ( base == "" ) {
        print "unmatched", $0
    }
}
END {
    for ( base in bases2sfxs ) {
        sub(/,/,"/",bases2sfxs[base])
        print "matched", base bases2sfxs[base]
    }
}

The above uses literal string functionality so it'll work even if the strings in either input file could contain regexp metacharacters, backreference characters/strings, typical string/regexp delimiters, escape sequences, or any other potentially problematic characters (the OP may not have all of those cases but others with similar problems reading this in future might).

For example, given this input which tests all of the above occurring in both files:

$ cat exclude.txt
works
ed
s
ing
ings
.
*
.*
/
&
'
"
\
\1
\t

$ cat corpus.txt
worked
working
works
tested
tests
find
found
workings
abacus
formless
impress
stories
amusing
linings
toppings
underlings
x.
x*
x.*
x/
x&
x'
x"
x\
x\1
x\t

the above script would produce this output (output sorted just for ease of seeing which words got which suffixes):

$ awk -f tst.awk exclude.txt corpus.txt | sort
matched abacu/s
matched amus/ing
matched formles/s
matched impres/s
matched lin/ings
matched lining/s
matched storie/s
matched test/ed,s
matched topp/ings
matched topping/s
matched underl/ings
matched underling/s
matched work/ed,ing,s,ings
matched working/s
matched x./*
matched x/.,*,.*,/,&,',",\,\1,\t
unmatched find
unmatched found

I'm also looping on the length of the strings in exclude.txt (contents of suffixes[]) rather than looping on the actual strings for efficiency.

If we have, say, 10 3-letter words in suffixes that we want to match against the last 3 letters of a line from corpus.txt then we'd populate our suffixes array from exclude.txt and loop through corpus.txt either way but then our options for implementing "check for a match" are (pseudo-code):

a) Compare each suffix from exclude.txt to the last 3 letters on each line from corpus.txt (this will loop 10 times as we have 10 suffixes):

NR == FNR {
    suffixes[$0]
    next
}
{
    for ( sfx in suffixes ) {     # loops 10 times
        wordEnd = substr($0,length($0)-length(sfx))
        if ( wordEnd == sfx ) {
            it's a match
        }
    }
}

b) Compare the last 3 letters of each line from corpus.txt to all suffixes from exclude.txt at once via a hash lookup (this will loop 1 time as all suffixes are 3 chars long in this example):

NR == FNR {
    suffixes[$0]
    sfxLgths[length($0)]
    next
}
{
    for ( sfxLgth in sfxLgths ) {    # loops 1 time
        wordEnd = substr($0,length($0)-sfxLgth)
        if ( wordEnd in suffixes ) {
            it's a match
        }
    }
}

So I implemented option "b" to only loop 1 time for all suffixes that are N characters long instead of looping N times.

Regarding how much more efficient looping by length is than looping by suffix - for what it's worth, according to https://en.wiktionary.org/wiki/Appendix:English_suffixes there are 392 English language suffixes but 368 of those are in a bell curve between 2 and 7 characters long averaging 61 suffixes per length in that range. So looping one time on a length in that range would save on average about 60 iterations compared to looping on each of the suffixes of that length.

The test for if ( baseLgth > 0 ) is necessary to not output <null>/<suffix> if, say, a word from corpus.txt also exists in exclude.txt.

Tweak it to modify the output format and/or add a variable to control printing the unmatched words if you like as your question isn't clear on how to handle that.

5
On

TXR Lisp:

(let ((suffix-regex (flow "exclude.txt"
                      file-get-lines
                      (cons 'or)
                      regex-compile))
      (wh (hash)))
  (with-stream (s (open-file "corpus.txt"))
    (whilet ((w (get-line s)))
      (iflet ((rng (r$ suffix-regex w)))
        (push [w rng] [wh [w 0..(from rng)]]))))
  (dohash (w suffs wh)
    (put-line `@w/@{(reverse suffs) ","}`)))
$ txr condense.tl
work/ed,ing,s
test/ed,s

Compile:

$ txr --compile=condense.tl
work/ed,ing,s
test/ed,s

Now there is a compiled, faster condense.tlo.

If txr condense is used (no suffix), it will choose the .tlo over .tl.

The trick used in this code is to generate a regular expression which matches the word suffixes. This is done by taking the list of words from the file like ("ed" "ing" "s") and tacking on the or symbol with cons to produce (or "ed" "ing" "s"). This expression is the abstract syntax for the regular expression ed|ing|s; we just hand that off to regex-compile to produce the compiled regex object. No escaping is required because the string terms in the abstract syntax denote fixed, literal matches.

4
On
$ awk 'NR==FNR {a[$1]; next} 
               {for(k in a) {
                  i=length($1)-length(k); 
                  if(substr($1,i+1)==k) {
                     w=substr($1,1,i); 
                     b[w]=b[w] sep[w] k; 
                     sep[w]=","}}} 
       END     {for(k in b) print k"/"b[k]}' exclude corpus

work/ed,ing,s
test/ed,s
2
On

This might work for you (GNU sed):

sed 's#.*#s@\\B&$@/&@p#' file1 |
sed -nf - file2 |
sed -E ':a;N;s#(.*)(/.*)\n\1/#\1\2,#;ta;P;D'

The first sed invocation builds a sed script that inserts a / between words in an input file that match the desired word endings.

The second sed invocation applies the above sed script and throws away non-matches.

The third sed invocation gathers up duplicate keys (the front half of the word) and concatenates the endings with a , separator.

N.B. The solution assumes the input file2 is already sorted. If not a simply sort can be placed between the second and third sed invocations.