I have a corpus file that I need to compare with another file "vertically" and list unique left-over strings. For example:
exclude.txt
:
ed
s
ing
And the second file is:
corpus.txt
:
worked
working
works
tested
tests
find
found
Expected output:
work/ed,ing,s
test/ed,s
The other words (find and found) can also be returned optionally.
I tried like this:
with open ('/home/corpus.txt') as corpus:
for i in corpus:
i = i.strip('\n')
with open ('/home/exclude.txt') as exclude:
for x in exclude:
x = x.strip('\n')
if i.endswith(x):
print (x, i, re.sub(r'(.*)'+x, r'\1', i)+'/'+x)
The output is something like this...
ed worked work/ed
ing working work/ing
s works work/s
ed tested test/ed
s tests test/s
As you can see I am not getting expected output. Nor is this Python code useful if the corpus file is large.
Here's some slightly less trivial input to test with:
$ head -100 exclude.txt corpus.txt
==> exclude.txt <==
works
ed
s
ing
ings
==> corpus.txt <==
worked
working
works
tested
tests
find
found
workings
The expected output given the above would be (printing the unmatched words is optional, as is printing matched/unmatched at the start of each line):
unmatched find
unmatched found
matched working/s
matched work/ed,ing,s,ings
matched test/ed,s
Using any awk:
The above uses literal string functionality so it'll work even if the strings in either input file could contain regexp metacharacters, backreference characters/strings, typical string/regexp delimiters, escape sequences, or any other potentially problematic characters (the OP may not have all of those cases but others with similar problems reading this in future might).
For example, given this input which tests all of the above occurring in both files:
the above script would produce this output (output sorted just for ease of seeing which words got which suffixes):
I'm also looping on the length of the strings in
exclude.txt
(contents ofsuffixes[]
) rather than looping on the actual strings for efficiency.If we have, say, 10 3-letter words in
suffixes
that we want to match against the last 3 letters of a line fromcorpus.txt
then we'd populate oursuffixes
array from exclude.txt and loop through corpus.txt either way but then our options for implementing "check for a match" are (pseudo-code):a) Compare each suffix from exclude.txt to the last 3 letters on each line from corpus.txt (this will loop 10 times as we have 10 suffixes):
b) Compare the last 3 letters of each line from corpus.txt to all suffixes from exclude.txt at once via a hash lookup (this will loop 1 time as all suffixes are 3 chars long in this example):
So I implemented option "b" to only loop 1 time for all suffixes that are N characters long instead of looping N times.
Regarding how much more efficient looping by length is than looping by suffix - for what it's worth, according to https://en.wiktionary.org/wiki/Appendix:English_suffixes there are 392 English language suffixes but 368 of those are in a bell curve between 2 and 7 characters long averaging 61 suffixes per length in that range. So looping one time on a length in that range would save on average about 60 iterations compared to looping on each of the suffixes of that length.
The test for
if ( baseLgth > 0 )
is necessary to not output<null>/<suffix>
if, say, a word from corpus.txt also exists in exclude.txt.Tweak it to modify the output format and/or add a variable to control printing the unmatched words if you like as your question isn't clear on how to handle that.