Training a CRF without sentence boundaries

422 Views Asked by sir_osthara At 28 July 2025 at 02:21

I need to tag parts of text in an HTML document. However, it mostly consists of text in form of dates, company names, Addresses, etc. I plan to use CRF (sklearn-crfsuite)

My problem is that it is difficult to divide the dataset into sentences. Can we train a CRF model without sentence boundaries treating everything as a single sequence? The tutorials in CRFSuite or sklearn-crfsuite do not talk about this.

If it cannot be done without sentence segmentation, any hints on how to divide such texts into sentences?

The data is something like this: (i cannot share the actual data)

Original Q&A

There are 1 best solutions below

Mikhail Korobov On 16 October 2017 at 08:16

Yes, you can train without dividing input sequence into sentences - just use a large sequence for everything. For example, https://github.com/scrapinghub/webstruct does it for HTML pages.

Splitting sequence in sentences provides an additional information (hard boundaries), but CRF can work without it. See also: https://stats.stackexchange.com/questions/197291/sequence-length-when-training-a-conditional-random-field-crf.

Training a CRF without sentence boundaries

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in CRF

Related Questions in CRFSUITE

Related Questions in PYTHON-CRFSUITE

Trending Questions

Popular # Hahtags

Popular Questions