Dominance analysis with Dirichlet regression: error related to formula syntax?

107 Views Asked by At

The goal

I want to run dominance analysis on a Dirichlet regression, to approximate the relative importance of a set of predictors (scaled continuous predictors, continuous predictors with splines, and factors). Dirichlet regression is an extension of beta regression to model proportions that are not derived from counts, and that are split between more than 2 categories, see Douma&weedon (2019).

The modelling approach: the syntax is potentially important

I am using the DirichletReg package to fit a Dirichlet regression, with an "alternative" parametrization: this allows to estimate simultaneously the parameters and the precision of the estimation. The syntax is: response ~ parameters | precision. The estimation of parameters can be done with different predictors from those used to estimate precision: response ~ predictor1 + predictor2 | predictor3. If left undeclared, the model assumes fixed precision: response ~ predictors, which can be declared explicilty as: response ~ predictors | 1.

I think that the error is related to the vertical bar in the formula, which separates the predictors used to estimate parameters from the predictors used to estimate precision.

I rely on performance::r2() to calculate a metric of model quality: Nagelkerke's pseudo-R2. However, for the actual analysis, I am thinking of either McFadden or Estrella's pseudo-R2, as they appear suitable to run dominance analysis on multinomial responses, see Luchman 2014.

The obstacle

I get the error message: "fitstat requires at least two elements".

A reproducible example

From data available in the DirichletReg package. The response is only two categories, but in any case, it yields the same error message as in the actual analysis.

library(DirichletReg)
#> Warning: package 'DirichletReg' was built under R version 4.1.3
#> Loading required package: Formula
#> Warning: package 'Formula' was built under R version 4.1.1
library(domir)
library(performance)
#> Warning: package 'performance' was built under R version 4.1.3

# Assemble data
RS <- ReadingSkills
RS$acc <- DR_data(RS$accuracy)
#> only one variable in [0, 1] supplied - beta-distribution assumed.
#> check this assumption.
RS$dyslexia <- C(RS$dyslexia, treatment)

# Fit Dirichlet regression
rs2 <- DirichReg(acc ~ dyslexia + iq | dyslexia + iq, data = RS, model = "alternative")

summary(rs2)
#> Call:
#> DirichReg(formula = acc ~ dyslexia + iq | dyslexia + iq, data = RS, model =
#> "alternative")
#> 
#> Standardized Residuals:
#>                   Min       1Q  Median      3Q     Max
#> 1 - accuracy  -1.5279  -0.7798  -0.343  0.6992  2.4213
#> accuracy      -2.4213  -0.6992   0.343  0.7798  1.5279
#> 
#> MEAN MODELS:
#> ------------------------------------------------------------------
#> Coefficients for variable no. 1: 1 - accuracy
#> - variable omitted (reference category) -
#> ------------------------------------------------------------------
#> Coefficients for variable no. 2: accuracy
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  2.22386    0.28087   7.918 2.42e-15 ***
#> dyslexiayes -1.81261    0.29696  -6.104 1.04e-09 ***
#> iq          -0.02676    0.06900  -0.388    0.698    
#> ------------------------------------------------------------------
#> 
#> PRECISION MODEL:
#> ------------------------------------------------------------------
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  1.71017    0.32697   5.230 1.69e-07 ***
#> dyslexiayes  2.47521    0.55055   4.496 6.93e-06 ***
#> iq           0.04097    0.27537   0.149    0.882    
#> ------------------------------------------------------------------
#> Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Log-likelihood: 61.26 on 6 df (33 BFGS + 1 NR Iterations)
#> AIC: -110.5, BIC: -99.81
#> Number of Observations: 44
#> Links: Logit (Means) and Log (Precision)
#> Parametrization: alternative
as.numeric(performance::r2(rs2))
#> [1] 0.4590758

# Run dominance analysis: error

# If left undeclared, the model assumes fixed precision: parameters |  1
domir::domin(acc ~ dyslexia + iq,
             reg =  function(y)  DirichletReg::DirichReg(y, data = RS, model = "alternative"),
             fitstat = list(\(x) list(r2.nagelkerke = as.numeric(performance::r2(x)), "r2.nagelkerke"))
)
#> Error in domir::domin(acc ~ dyslexia + iq, reg = function(y) DirichletReg::DirichReg(y, : fitstat requires at least two elements.

domir::domin(acc ~ dyslexia + iq | 1,
             reg =  function(y)  DirichletReg::DirichReg(y, data = RS, model = "alternative"),
             fitstat = list(\(x) list(r2.nagelkerke = as.numeric(performance::r2(x)), "r2.nagelkerke"))
             )
#> Error in domir::domin(acc ~ dyslexia + iq | 1, reg = function(y) DirichletReg::DirichReg(y, : fitstat requires at least two elements.

domir::domin(acc ~ dyslexia + iq | dyslexia + iq,
             reg =  function(y)  DirichletReg::DirichReg(y, data = RS, model = "alternative"),
             fitstat = list(\(x) list(r2.nagelkerke = as.numeric(performance::r2(x)), "r2.nagelkerke"))
             )
#> Error in domir::domin(acc ~ dyslexia + iq | dyslexia + iq, reg = function(y) DirichletReg::DirichReg(y, : fitstat requires at least two elements.

domir::domin(acc ~ dyslexia + iq,
             reg =  function(y)  DirichletReg::DirichReg(y, data = RS, model = "alternative"),
             fitstat = list(\(x) list(r2.nagelkerke = as.numeric(performance::r2(x)), "r2.nagelkerke")),
             consmodel = "| dyslexia + iq"
             )
#> Error in domir::domin(acc ~ dyslexia + iq, reg = function(y) DirichletReg::DirichReg(y, : fitstat requires at least two elements.

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] performance_0.10.0 domir_1.0.1        DirichletReg_0.7-1 Formula_1.2-4     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13  knitr_1.38       magrittr_2.0.3   insight_0.19.1  
#>  [5] lattice_0.20-44  rlang_1.1.0      fastmap_1.1.0    stringr_1.5.0   
#>  [9] highr_0.9        tools_4.1.0      grid_4.1.0       xfun_0.30       
#> [13] cli_3.6.0        withr_2.5.0      htmltools_0.5.2  maxLik_1.5-2    
#> [17] miscTools_0.6-28 yaml_2.3.5       digest_0.6.29    lifecycle_1.0.3 
#> [21] vctrs_0.6.1      fs_1.5.2         glue_1.6.2       evaluate_0.15   
#> [25] rmarkdown_2.13   sandwich_3.0-1   reprex_2.0.1     stringi_1.7.6   
#> [29] compiler_4.1.0   generics_0.1.2   zoo_1.8-9

Created on 2023-07-27 by the reprex package (v2.0.1)

References

Luchman Relative Importance Analysis With Multicategory Dependent Variables:: An Extension and Review of Best Practices (2014) Organizational research methods

Douma & Weedon. Analysing continuous proportions in ecology and evolution: A practical introduction to beta and Dirichlet regression (2019) Methods in Ecology and Evolution

2

There are 2 best solutions below

0
M. Riera On BEST ANSWER

Interestingly, the solution provided on July 2023 no longer worked in February 2024. Building on the code provided by Joseph Luchman, I could make domir compatible with a Dirichlet model, by fitting a Dirichlet model outside domir, and using update inside the domir call to pass the formula. This was also discussed over at GitHub. Infinite thanks to J. Luchman for the persistance and assistance!

library(DirichletReg)
#> Warning: package 'DirichletReg' was built under R version 4.1.3
#> Loading required package: Formula
#> Warning: package 'Formula' was built under R version 4.1.1
library(domir)
library(performance)
#> Warning: package 'performance' was built under R version 4.1.3

RS <- ReadingSkills
response.d <- DR_data(RS$accuracy)
#> only one variable in [0, 1] supplied - beta-distribution assumed.
#> check this assumption.


# Fit Dirichlet regression
rs2 <- DirichReg(response.d ~ dyslexia + iq | dyslexia + iq, data = RS, model = "alternative")
performance::r2( rs2)[[1]]
#> Nagelkerke's R2 
#>       0.4590758


domir::domir(response.d ~ dyslexia + iq,
             function(y)  {
               iv <- attr(terms(y), "term.labels")
               fml <- paste0("response.d ~ ", paste0(iv, collapse = "+"), "| dyslexia + iq", collapse = "")
               print(fml)
               performance::r2( update(rs2, formula(fml) ) )[[1]]})  
#> [1] "response.d ~ dyslexia+iq| dyslexia + iq"
#> [1] "response.d ~ dyslexia| dyslexia + iq"
#> [1] "response.d ~ iq| dyslexia + iq"
#> Overall Value:      0.4590758 
#> 
#> General Dominance Values:
#>          General Dominance Standardized Ranks
#> dyslexia       0.455042413   0.99121418     1
#> iq             0.004033357   0.00878582     2
#> 
#> Conditional Dominance Values:
#>          Subset Size: 1 Subset Size: 2
#> dyslexia    0.457263643    0.452821183
#> iq          0.006254587    0.001812127
#> 
#> Complete Dominance Designations:
#>                  Dmnated?dyslexia Dmnated?iq
#> Dmnates?dyslexia               NA       TRUE
#> Dmnates?iq                  FALSE         NA

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] performance_0.10.0 domir_1.0.1        DirichletReg_0.7-1 Formula_1.2-4     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13  knitr_1.38       magrittr_2.0.3   insight_0.19.1  
#>  [5] lattice_0.20-44  rlang_1.1.0      fastmap_1.1.0    stringr_1.5.0   
#>  [9] highr_0.9        tools_4.1.0      grid_4.1.0       xfun_0.39       
#> [13] cli_3.6.0        withr_2.5.2      htmltools_0.5.5  maxLik_1.5-2    
#> [17] miscTools_0.6-28 yaml_2.3.5       digest_0.6.29    lifecycle_1.0.4 
#> [21] vctrs_0.6.1      fs_1.5.2         glue_1.6.2       evaluate_0.15   
#> [25] rmarkdown_2.13   sandwich_3.0-1   reprex_2.0.1     stringi_1.7.6   
#> [29] compiler_4.1.0   generics_0.1.2   zoo_1.8-9

Created on 2024-02-26 by the reprex package (v2.0.1)

2
jluchman On

The issue asked about here is hinted at by domin as the list submitted to fitstat is length 1.

> list(\(x) list(r2.nagelkerke = as.numeric(performance::r2(x)), "r2.nagelkerke"))
[[1]]
\(x) list(r2.nagelkerke = as.numeric(performance::r2(x)), "r2.nagelkerke")

Moving the parentheses fixes it but reveals another that, I believe, is related to the design on DirichletReg::DirichReg.

> domir::domin(acc ~ dyslexia + iq,
+              reg =  function(y)  DirichletReg::DirichReg(y, data = RS, model = "alternative"),
+              fitstat = list(\(x) list(r2.nagelkerke = as.numeric(performance::r2(x))), "r2.nagelkerke")
+ )
Error in x$formula : object of type 'symbol' is not subsettable

Basically, it seems that DirichletReg::DirichReg cannot accept a lazily evaluated formula that is required for using domin.

For instance, most modeling functions with a formula allow for something like:

> lapply(list(mpg ~ am, mpg ~ vs), lm, data = datasets::mtcars)
[[1]]

Call:
FUN(formula = X[[i]], data = ..1)

Coefficients:
(Intercept)           am  
     17.147        7.245  


[[2]]

Call:
FUN(formula = X[[i]], data = ..1)

Coefficients:
(Intercept)           vs  
      16.62         7.94  

As you can see in the Call portion of the output, lm accepts arguments in a flexible way and evaluates the formula when it needs to, as applied to the data.

When trying something similar with DirichReg using parts of the focal model results in:

> lapply(list(acc ~ dyslexia, acc ~ iq), DirichReg, data = RS, model = "alternative")
Error in eval(x) : object 'X' not found

DirichReg actually needs to 'see' the formula as a string (as it uses match.call to parse arguments for processing; at least I believe this is the issue).

The resolution to this one is a little more complex. Have to, on the fly, take the formula domin (or in the case below, I use the more updated domir::domir; also note I'm using R v4.3 to allow the element selection with the base R pipe) submits to each function call to reconstruct a string formula that is then interpreted as.formula when submitted to DirichReg in the example below. The formulas produced are also printed.

> domir(acc ~ dyslexia + iq, function(y)  {
+     iv <- terms(y) |> attr("term.labels")
+     fml <- paste0("acc ~ ", paste0(iv, collapse = "+"), collapse = "")
+     print(fml)
+     DirichReg(as.formula(fml), data = RS, model = "alternative") |> performance::r2() |> _[[1]]})
[1] "acc ~ dyslexia+iq"
[1] "acc ~ dyslexia"
[1] "acc ~ iq"
Overall Value:      0.6568343 

General Dominance Values:
         General Dominance Standardized Ranks
dyslexia         0.4983012    0.7586406     1
iq               0.1585332    0.2413594     2

Conditional Dominance Values:
         Subset Size: 1 Subset Size: 2
dyslexia      0.6498178    0.346784532
iq            0.3100498    0.007016514

Complete Dominance Designations:
                 Dmnated?dyslexia Dmnated?iq
Dmnates?dyslexia               NA       TRUE
Dmnates?iq                  FALSE         NA