I was wondering if someone could help me with a statistical problem I have run into. Any help would be incredibly helpful.
Please note that for clarity, I have simplified the below description. It leaves out information such as weighting, or transforming variables.
I am conducting a project which investigates the association between the funding a hospital receives and the clinical need of that hospital. I am investigating whether it would be better to also include a measure of the deprivation level of the area. Analysis has four variables:
Output- measure of clinical need (a measure of how ill people are in the area)
Input_funding - measure of the funding
Input_deprivation - measure of deprivation level
Input_funding is a government formula which is calculated by six variables, including age of the local population level.
My plan was to conduct GLM analysis as follows:
Fit1 <- glm(data=df, formula = (Output~Input_funding))
Fit2 <- glm(data=df, formula = (Output~Input_funding + Input_deprivation))
I planned to calculate the adjusted R2 for the two models. The hypothesis was that r2(fit2) > r2(fit1), and therefore a consideration of deprivation in funding would improve current funding formulas, as they would more closely match clinical need.
Input_deprivation has been shown in multiple studies, when controlling for age, to be positively associated with output. However, as people living in more deprived areas are more likely to live in younger areas, in fit 2, Input_deprivation had a negative beta value. I therefore need to adjust for age in the above analysis. This variable is input_age. My plan currently was to compare the R2 of the following models. The hypothesis would now be whether R2(fit4) > r2(fit3).
Fit3 <- glm(data=df, formula = (Output~Input_funding + Input_age))
Fit4<- glm(data=df, formula = (Output~Input_funding + Input_age + Input_deprivation ))
However, given that age is used to calculate input_funding, I wondered if this would be problematic (certainly would introduce some colliniarrity between input_age and input_funding. Additionally, ideally I don’t want either fit to include age as its own variable, given age has such a large effect. However I see no other way to control for it. I thought of the option of using Input_age*Input_deprivation, however this has the separate terms Input_deprivation and Input_age included. Any ideas as to 1. Whether this is an issue and 2. How else I could control for age would be hugely apreciated!