Sample size for Named Entity Recognition gold standard corpus

Question

Sample size for Named Entity Recognition gold standard corpus

1k Views Asked by roelmetgevoel At 28 November 2025 at 00:01

I have a corpus of 170 Dutch literary novels on which I will apply Named Entity Recognition. For an evaluation of existing NER taggers for Dutch I want to manually annotate Named Entities in a random sample of this corpus – I use brat for this purpose. The manually annotated random sample will function as the 'gold standard' in my evaluation of the NER taggers. I wrote a Python script that outputs a random sample of my corpus on the sentence level.

My question is: what is the ideal size of the random sample in terms of the amount of sentences per novel? For now, I used a random 100 sentences per novel, but this leads to a pretty big random sample containing almost 21626 lines (which is a lot to manually annotate, and which leads to a slow working environment in brat).

Original Q&A

There are 2 best solutions below

**fnl** · Answer 1

NB, before the actual answer: The biggest issue I see is that you only can evaluate the tools wrt. those 170 books. So at best, it will tell you how good the NER tools you evaluate will work on those books or similar texts. But I guess that is obvious...

As to sample sizes, I would guesstimate that you need no more than a dozen random sentences per book. Here's a simple way to check if your sample size is already big enough: Randomly choose only half of the sentences (stratified per book!) you annotated and evaluate all the tools on that subset. Do that a few times and see if results for the same tool varies widely between runs (say, more than +/- 0.1 if you use F-score, for example - mostly depending on how "precise" you have to be to detect significant differences between the tools). If the variances are very large, continue to annotate more random sentences. If the numbers start to stabilize, you're good and can stop annotating.

**eldams** · Answer 2

Indeed, the "ideal" size would be... the whole corpus :)

Results will be correlated to the degree of detail of the typology: just PERS, LOC, ORG would require require a minimal size, but what about a fine-grained typology or even full disambiguation (linking)? I suspect good performance wouldn't need much data (just enough to validate), whilst low performance should require more data to have a more detailed view of errors.

As an indicator, cross-validation is considered as a standard methodology, it often uses 10% of the corpus to evaluate (but the evaluation is done 10 times).

Besides, if working with ancient novels, you will probably face lexical coverage problem: many old proper names would not be included in available softwares lexical resources and this is a severe drawback for NER accuracy. Thus it could be a nice idea to split corpus according to decades / centuries and conduct multiple evaluation so as to measure the impact of this trouble on performances.

Sample size for Named Entity Recognition gold standard corpus

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in NAMED-ENTITY-RECOGNITION

Related Questions in SAMPLE-SIZE

Related Questions in BRAT

Trending Questions

Popular # Hahtags

Popular Questions