I have two text files. Both of them have the same content but the formatting of each is different. In one file there are extra spaces between words or letters. There are different line breaks as well. For example:
File1:
The annotation framework we presented is
embedded in the Knowledge Management and
Acquisition Platform Semantic Turkey (Pazienza, et
al., 2012), and comes out-the-box with a few
annotation families which differ in the underlying
annotation model and, notably, in the tasks they
support. The default handlers take into consideration
the annotation of atomic ontological resources, and
complex activities that are provided as macros, e.g.
the creation of new instances, the definition of new
subclasses in OWL, or of narrower concepts in
SKOS.
File2:
Theannotationframework we presented is
embedded in th e K n o w l e d ge Management and
Acquisition Platform Semantic Turkey (Pazienza, et
al., 2012), and comes out-the-
box with a few
annotation families which differ in the underlying
annotation model and, notably, in the tasks they
support. The default handlers take into consideration
the a n n o t a t i o n o f a t o m i c ontological resources, and
complex activities that are provided as macros, e.g.
the creation of new instances, the definition of new
subclasses in OWL, or of narrower concepts in
SKOS.
Suppose I select the String the Knowledge Management
from File1 and I want to match it with the String th e K n o w l e d ge Management
in File2.
How can I achieve it? There are no fixed deformities in the second file. Only surety is that the characters are in the same order in both the files and they could be possibly separated by extra spaces or the space between them could be missing.
I thought of applying Sellers Algorithm or Viterbi Algorithm but, I am not sure about it. Approximate string matching could be expensive as well.
Any lead would be helpful. Thanks a lot!
You could import the files as strings, and remove all the white space from both. It should then be a straight string matching activity.
If you also need the start index of the matching pattern, get the index of the starting point in the collapsed string and run a for loop over the spaced out version, counting only characters.