Training a CRF without sentence boundaries

439 Views Asked by At

I need to tag parts of text in an HTML document. However, it mostly consists of text in form of dates, company names, Addresses, etc. I plan to use CRF (sklearn-crfsuite)

My problem is that it is difficult to divide the dataset into sentences. Can we train a CRF model without sentence boundaries treating everything as a single sequence? The tutorials in CRFSuite or sklearn-crfsuite do not talk about this.

If it cannot be done without sentence segmentation, any hints on how to divide such texts into sentences?

The data is something like this: (i cannot share the actual data) enter image description here

1

There are 1 best solutions below

0
On

Yes, you can train without dividing input sequence into sentences - just use a large sequence for everything. For example, https://github.com/scrapinghub/webstruct does it for HTML pages.

Splitting sequence in sentences provides an additional information (hard boundaries), but CRF can work without it. See also: https://stats.stackexchange.com/questions/197291/sequence-length-when-training-a-conditional-random-field-crf.