How to use ICD10 Code in a regression model in R?

690 Views Asked by At

I am trying to find the ICD10 codes which are causing certain disease. But ICD10 has alpha numeric classification e.g. A00.00 . There are 1000s of such classifications but I am not sure how to use them in my regression model. Any suggestion please.

Data Patient Existing ICD10 Diabetic (Y) P1 A00.10 1 P2 A00.20 0 P1 C00.1 1 P3 Z01 1 ....

3

There are 3 best solutions below

1
On BEST ANSWER

An effective way to do this is to use the concept of comorbidities. My R package icd does this for standardized sets of diseases, e.g. "Diabetes", "Cancer", "Heart Disease." There is a choice of the comorbidity maps, so you can pick one which aligns with your interests, e.g. PCCC maps in icd can be used for pediatrics, the others are for adults and span a variety of disease states.

E.g., as described in the introduction vignette. These are actually ICD-9 codes, but you can use ICD-10.

patients <- data.frame(
   visit_id = c(1000, 1000, 1000, 1000, 1001, 1001, 1002),
   icd9 = c("40201", "2258", "7208", "25001", "34400", "4011", "4011"),
   poa = c("Y", NA, "N", "Y", "X", "Y", "E"),
   stringsAsFactors = FALSE
   )
patients
  visit_id  icd9  poa
1     1000 40201    Y
2     1000  2258 <NA>
3     1000  7208    N
4     1000 25001    Y
5     1001 34400    X
6     1001  4011    Y
7     1002  4011    E
icd::comorbid_ahrq(patients)
CHF Valvular  PHTN   PVD  HTN Paralysis NeuroOther Pulmonary    DM  DMcx Hypothyroid Renal Liver
1000  TRUE    FALSE FALSE FALSE TRUE     FALSE      FALSE     FALSE  TRUE FALSE       FALSE FALSE FALSE
1001 FALSE    FALSE FALSE FALSE TRUE      TRUE      FALSE     FALSE FALSE FALSE       FALSE FALSE FALSE
1002 FALSE    FALSE FALSE FALSE TRUE     FALSE      FALSE     FALSE FALSE FALSE       FALSE FALSE FALSE
       PUD   HIV Lymphoma  Mets Tumor Rheumatic Coagulopathy Obesity WeightLoss FluidsLytes BloodLoss
1000 FALSE FALSE    FALSE FALSE FALSE      TRUE        FALSE   FALSE      FALSE       FALSE     FALSE
1001 FALSE FALSE    FALSE FALSE FALSE     FALSE        FALSE   FALSE      FALSE       FALSE     FALSE
1002 FALSE FALSE    FALSE FALSE FALSE     FALSE        FALSE   FALSE      FALSE       FALSE     FALSE
     Anemia Alcohol Drugs Psychoses Depression
1000  FALSE   FALSE FALSE     FALSE      FALSE
1001  FALSE   FALSE FALSE     FALSE      FALSE
1002  FALSE   FALSE FALSE     FALSE      FALSE

With "DM" being Diabetes Mellitus, and "DMcx" for being diabetes with complications, e.g., retinopathy or renal failure. This is with the US AHRQ modification of the standard Elixhauser categories.

When you have binary flags for the disease states, you can use these in any statistical or machine learning model.

1
On

You may want to decode ICD10 in a variable with one or more strata. One way may be to generate a variable as dat$diabates with levels 0 (no disease) and 1 (disease). A way may be using grepl. By the way the common pattern for diabetes in ICD10 codes is E08 (please check http://eicd10.com/index.php?srchtext=diabetes&Submit=Search&action=search), instead A00 is cholera.

dat$diabates <- as.integer(grepl(pattern = "E08", x = dat$ICD10))
###Add to pattern a common pattern in ICD 10 code
as.numeric(as.character(dat$diabetes))->dat$diabetes

If you have several different pattern (repeating the procedure for each pattern) than you may generate new variables and merge them. For example:

dat$diabetes_final<-0 
dat$diabetes_final[which(dat$diabetes1 ==1 | dat$diabetes2==1)]<-1
0
On

I would suggest to set "healthy" as the reference level of your factor variable containing the diagnosis because this would give you the coefficients that show how your dependend variable changes when you compare healthy patients vs. patients with a certain disease. Of course you can group the diseases, as suggested by Jean-Claude Arbaut.

This could look something like that:

# your vector with the diagnosis
diagnosis <- c("healthy", "P1 A00.10 1", "P2 A00.20 0", "P1 C00.1 1", "P3 Z01 1")

# grouping your vector. I have no idea about ICD10 groups, so this is only to show how this would work in R
diagnosis[diagnosis %in% c("P1 A00.10 1", "P2 A00.20 0")] <- "diabetes"
diagnosis[diagnosis %in% c("P1 C00.1 1", "P3 Z01 1")] <- "cancer"

# make the vector a factor with healthy as the reference
diagnosis <- factor(diagnosis)
diagnosis <- relevel(diagnosis, ref = "healthy")

# now you can use the variable in a regression
set.seed(1) # making it reproducible
dv <- rnorm(length(diagnosis)) # generating a dependent variable
summary(lm(dv ~ diagnosis)) # linear regression

# the coeficients look like this
...
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)        -0.6265     0.8126  -0.771    0.521
diagnosiscancer     1.5888     0.9952   1.597    0.251
diagnosisdiabetes   0.3005     0.9952   0.302    0.791
...