Correct way to count types in whole corpus

22 Views Asked by At

I struggle to find the correct way to count types (unique forms of words) in a Quanteda corpus. ntype() gives the number of types per document, but not for the corpus as a whole.

I found two ways to get this count, which give two different results and I don’t understand why.

Reproductible code:

library(quanteda)

corp_uk <- corpus(data_char_ukimmig2010)
corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)

nfeat(dfm(corp_uk_tokens))
length(types(corp_uk_tokens))

nfeat(dfm(corp_uk_tokens)) outputs 1648

length(types(corp_uk_tokens)) outputs 1804

Which one is correct and why those two calculations don’t give the same result?

Thanks a lot for helping!

1

There are 1 best solutions below

1
Ken Benoit On BEST ANSWER

It's because dfm() has tolower = TRUE as a default, so the nfeat() has combined some types due to lowercasing. If you turn this off, you will get the same result as the length of the types().

library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.

corp_uk <- corpus(data_char_ukimmig2010)
corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)

# length of types vector
length(types(corp_uk_tokens))
#> [1] 1800

# gives the types after lowercasing, default for dfm()
nfeat(dfm(corp_uk_tokens))
#> [1] 1644

# without lowercasing, it's the same
nfeat(dfm(corp_uk_tokens, tolower = FALSE))
#> [1] 1800

Created on 2024-03-28 with reprex v2.1.0