I have a question about logDice association measure for collocational analysis.
This is the formula for logDice:
logDice = 14 + log2(2.w1w2/w1+w2)
Where:
w1w2 = the frequency of the word x and y
w1 = the frequency of the word x (the keyword)
w2 = the frequency of the word y (the collocate)
And this is the dataframe that I get from the Russian National Corpus
df <- structure(list(lex_1 = c("гей", "гей", "гей", "гей",
"гей", "гей"), lex_2 = c("лесбиянка", "бисексуал",
"-пропаганда", "трансгендер", "-активист",
"пропаганда"), w1w2 = c(256L, 56L, 33L, 40L, 22L, 109L
), w1 = c(3035L, 3035L, 3035L, 3035L, 3035L, 3035L), w2 = c(1000L,
214L, 33L, 1125L, 25L, 14989L), dice = c("11.935563044335458",
"10.632396335625995", "10.16087357953928", "10.048756281418573",
"9.758019438971836", "9.585035580677008"), loglikelihood = c("5257.044796946131",
"1149.0014239624418", "NaN", "650.7437731125242", "529.44817318798",
"1426.7781683883695"), mi3 = c("22.1740166089487", "19.156318611675747",
"19.43925467742511", "16.487339602195437", "18.500491089698897",
"16.905211323448984"), tscore = c("15.999754219819755", "7.483202316520131",
"5.744540056143745", "6.323855817680788", "4.690394799619232",
"10.434660699000974"), agr = c("18.83557314788643", "11.972911417482786",
"10.567382701314845", "10.210952867456713", "9.315288895011557",
"13.281571601474706")), row.names = c(NA, 6L), class = "data.frame")
This is the tibble of df:
lex_1 lex_2 w1w2 w1 w2 dice loglikelihood mi3 tscore agr
<chr> <chr> <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
1 гей лесбиянка 256 3035 1000 11.935563044335458 5257.044796946131 22.1740166089487 15.999754219819755 18.8355…
2 гей бисексуал 56 3035 214 10.632396335625995 1149.0014239624418 19.156318611675747 7.483202316520131 11.9729…
3 гей -пропаганда 33 3035 33 10.16087357953928 NaN 19.43925467742511 5.744540056143745 10.5673…
4 гей трансгендер 40 3035 1125 10.048756281418573 650.7437731125242 16.487339602195437 6.323855817680788 10.2109…
5 гей -активист 22 3035 25 9.758019438971836 529.44817318798 18.500491089698897 4.690394799619232 9.31528…
6 гей пропаганда 109 3035 14989 9.585035580677008 1426.7781683883695 16.905211323448984 10.434660699000974 13.2815…
7 гей смущать 33 3035 2437 9.582255282724038 472.3465013559598 15.137239185266385 5.742894380148091 9.32370…
8 гей транссексуал 19 3035 294 9.527158922151362 332.24632808876066 15.59597672465279 4.358633704547329 8.24482…
9 гей гей 34 3035 3035 9.508393821122564 473.70304645444645 15.007354424847035 5.82890504455946 9.35288…
10 гей лгбт 32 3035 3453 9.381173487564423 433.62119700658593 14.696448565501338 5.6544538228949675 9.11594…
Basically, the dice column stands for logDice value (as per stated in the Russian National Corpus website), and its impossible that this is a Dice coefficient since the formula for Dice coefficient is 2.w1w2/w1+w2 and typically retrieve small number.
But as you can see, the dice column is not actually based of the formula above; it does not use log basis 2. It uses ln (natural log).
df2 <- df %>%
mutate(lndice = 14+log((2*w1w2)/(w1+w2))) %>%
select(lndice, dice)
And this is the head for df2
dice lndice
1 11.935563044335458 11.935563
2 10.632396335625995 10.632396
3 10.16087357953928 10.160874
4 10.048756281418573 10.048756
5 9.758019438971836 9.758019
6 9.585035580677008 9.585036
This is my attempt to do the calculation of logDice based off the formula, using log to the base of 2:
df3 <- df %>%
mutate(log2Dice = 14+log2((2*w1w2)/(w1+w2))) %>%
select(dice, log2Dice)
With the following result:
dice log2Dice
1 11.935563044335458 11.021647
2 10.632396335625995 9.141575
3 10.16087357953928 8.461311
4 10.048756281418573 8.299560
5 9.758019438971836 7.880116
6 9.585035580677008 7.630553
It has different values.
So, since I am new in R, am I making a mistake in my calculation of logDice? I am trying to be precise, since Rychlý (2008) states that the log in logDice uses the base of 2. But the Russian National Corpus seems to be only using natural log (ln). Or can ln somehow be used in substitution of log2?
Sorry if the answer seems trivial, but I am trying to make everything sure.