Can log2 be substituted with ln in logDice association measure in R?

30 Views Asked by At

I have a question about logDice association measure for collocational analysis.

This is the formula for logDice:

logDice = 14 + log2(2.w1w2/w1+w2)

Where:

w1w2 = the frequency of the word x and y

w1 = the frequency of the word x (the keyword)

w2 = the frequency of the word y (the collocate)

And this is the dataframe that I get from the Russian National Corpus

df <- structure(list(lex_1 = c("гей", "гей", "гей", "гей", 
"гей", "гей"), lex_2 = c("лесбиянка", "бисексуал", 
"-пропаганда", "трансгендер", "-активист", 
"пропаганда"), w1w2 = c(256L, 56L, 33L, 40L, 22L, 109L
), w1 = c(3035L, 3035L, 3035L, 3035L, 3035L, 3035L), w2 = c(1000L, 
214L, 33L, 1125L, 25L, 14989L), dice = c("11.935563044335458", 
"10.632396335625995", "10.16087357953928", "10.048756281418573", 
"9.758019438971836", "9.585035580677008"), loglikelihood = c("5257.044796946131", 
"1149.0014239624418", "NaN", "650.7437731125242", "529.44817318798", 
"1426.7781683883695"), mi3 = c("22.1740166089487", "19.156318611675747", 
"19.43925467742511", "16.487339602195437", "18.500491089698897", 
"16.905211323448984"), tscore = c("15.999754219819755", "7.483202316520131", 
"5.744540056143745", "6.323855817680788", "4.690394799619232", 
"10.434660699000974"), agr = c("18.83557314788643", "11.972911417482786", 
"10.567382701314845", "10.210952867456713", "9.315288895011557", 
"13.281571601474706")), row.names = c(NA, 6L), class = "data.frame")

This is the tibble of df:

  lex_1 lex_2         w1w2    w1    w2 dice               loglikelihood      mi3                tscore             agr     
   <chr> <chr>        <int> <int> <int> <chr>              <chr>              <chr>              <chr>              <chr>   
 1 гей   лесбиянка      256  3035  1000 11.935563044335458 5257.044796946131  22.1740166089487   15.999754219819755 18.8355…
 2 гей   бисексуал       56  3035   214 10.632396335625995 1149.0014239624418 19.156318611675747 7.483202316520131  11.9729…
 3 гей   -пропаганда     33  3035    33 10.16087357953928  NaN                19.43925467742511  5.744540056143745  10.5673…
 4 гей   трансгендер     40  3035  1125 10.048756281418573 650.7437731125242  16.487339602195437 6.323855817680788  10.2109…
 5 гей   -активист       22  3035    25 9.758019438971836  529.44817318798    18.500491089698897 4.690394799619232  9.31528…
 6 гей   пропаганда     109  3035 14989 9.585035580677008  1426.7781683883695 16.905211323448984 10.434660699000974 13.2815…
 7 гей   смущать         33  3035  2437 9.582255282724038  472.3465013559598  15.137239185266385 5.742894380148091  9.32370…
 8 гей   транссексуал    19  3035   294 9.527158922151362  332.24632808876066 15.59597672465279  4.358633704547329  8.24482…
 9 гей   гей             34  3035  3035 9.508393821122564  473.70304645444645 15.007354424847035 5.82890504455946   9.35288…
10 гей   лгбт            32  3035  3453 9.381173487564423  433.62119700658593 14.696448565501338 5.6544538228949675 9.11594…

Basically, the dice column stands for logDice value (as per stated in the Russian National Corpus website), and its impossible that this is a Dice coefficient since the formula for Dice coefficient is 2.w1w2/w1+w2 and typically retrieve small number.

But as you can see, the dice column is not actually based of the formula above; it does not use log basis 2. It uses ln (natural log).

df2 <- df %>%
  mutate(lndice = 14+log((2*w1w2)/(w1+w2))) %>%
  select(lndice, dice)

And this is the head for df2

                dice    lndice
1 11.935563044335458 11.935563
2 10.632396335625995 10.632396
3  10.16087357953928 10.160874
4 10.048756281418573 10.048756
5  9.758019438971836  9.758019
6  9.585035580677008  9.585036

This is my attempt to do the calculation of logDice based off the formula, using log to the base of 2:

df3 <- df %>%
  mutate(log2Dice = 14+log2((2*w1w2)/(w1+w2))) %>%
  select(dice, log2Dice)

With the following result:

                dice  log2Dice
1 11.935563044335458 11.021647
2 10.632396335625995  9.141575
3  10.16087357953928  8.461311
4 10.048756281418573  8.299560
5  9.758019438971836  7.880116
6  9.585035580677008  7.630553

It has different values.

So, since I am new in R, am I making a mistake in my calculation of logDice? I am trying to be precise, since Rychlý (2008) states that the log in logDice uses the base of 2. But the Russian National Corpus seems to be only using natural log (ln). Or can ln somehow be used in substitution of log2?

Sorry if the answer seems trivial, but I am trying to make everything sure.

0

There are 0 best solutions below