How can I create a column to assign a region to cities?

41 Views Asked by At

I have a database with a list of patients and their city of residence (1-34) that I would like to assign regions to (1-8). How can I add a column that assigns a region to each patient given their city? For example, Region 1 is composed of cities 1, 2, 3, 4, 5, and 8.

Thank you for any help you can provide.

I attempted to use group_by commands but have been getting repeated errors.

df1 <- df %>%
  mutate(
    cities = fct_collapse(
      cities,
      "1" = c("1", "2", "3", "4","5","8"),  
      "2" = c("10", "24"),
      "3" = c("6", "7", "15", "16"),   
      "4" = c("20", "21", "22", "28", "29"),   
      "5" = c("9", "17", "18", "19"),   
      "6" = c("25", "26", "27", "30", "34"),   
      "7" = c("11", "12", "13", "14"),   
      "8" = c("23", "31", "32", "33"))) 

For some reason I am returning a new cities column with only 6 groups (regions). Any idea what may be going wrong? Thank you so much for any help you can offer.

3

There are 3 best solutions below

0
ZKA On

You're redefining the "cities" column in your mutate() call instead of creating a new "region" column.

Try:

df1 <- df %>% 
  mutate(region = fct_collapse(
    cities, 
    "1" = c("1", "2", "3", "4","5","8"),
    "2" = c("10", "24"), 
    "3" = c("6", "7", "15", "16"),
    "4" = c("20", "21", "22", "28", "29"),
    "5" = c("9", "17", "18", "19"),
    "6" = c("25", "26", "27", "30", "34"),
    "7" = c("11", "12", "13", "14"),
    "8" = c("23", "31", "32", "33")
  ))

Here's an alternative where "cities" doesn't have to be a factor:

df1 <- df %>% 
  mutate(region = case_when(
    cities %in% c("1", "2", "3", "4","5","8") ~ "1",
    cities %in% c("10", "24") ~ "2",
    cities %in% c("6", "7", "15", "16") ~ "3",
    cities %in% c("20", "21", "22", "28", "29") ~ "4",
    cities %in% c("9", "17", "18", "19") ~ "5",
    cities %in% c("25", "26", "27", "30", "34") ~ "6",
    cities %in% c("11", "12", "13", "14") ~ "7",
    cities %in% c("23", "31", "32", "33") ~ "8"
  ))
0
benetsugarboy On

Without the original data it's hard to tell for sure what's going on, but I think it has something to do with this. fct_collapse will not create levels if they're completey made of levels that don't exist in the original factor. For example:

fct_collapse(
    factor(1:5),
    A = 1:3,
    B = 4:6,
    C = 7:9
)

Returns a warning, indicating that levels 6 to 9 do not exist in the original factor.

[1] A A A B B
Levels: A B

Warning message:
Unknown levels in `f`: 6, 7, 8, 9

It created level B, even though it contains level 6, which doesn't exist. But it didn't create level C, because the original factor does not have any of the levels 7, 8 and 9.

In your case I would simply suggest to add a line of code at the top to ensure that your original factor is well defined. To ensure the order of the levels doesn't change, you can use union.

levels(df$cities) <- union(levels(df$cities), 1:34)

df1 <- df %>%
  mutate(
    cities = fct_collapse(
      cities,
      "1" = c("1", "2", "3", "4","5","8"),  
      "2" = c("10", "24"),
      "3" = c("6", "7", "15", "16"),   
      "4" = c("20", "21", "22", "28", "29"),   
      "5" = c("9", "17", "18", "19"),   
      "6" = c("25", "26", "27", "30", "34"),   
      "7" = c("11", "12", "13", "14"),   
      "8" = c("23", "31", "32", "33"))) 
0
Adriano Mello On

welcome to SO. Please, take a moment to read about how to write a good question here.

As @benetsugarboy said it above, without the original data it's hard to tell for sure what's going on. My take is: you don't need factors, just "wrangle" your region-city data in unique-pairs, longer format and dplyr::left_join() with your patient-city data.

Here's some toy data based on what you were trying:

library(tidyverse)
set.seed(100)

# -------------------------------------------------------------------------
# Patient-city
df_patients <- tibble(
  name = starwars$name,
  city_id = sample(x = 1:34, size = length(starwars$name), replace = TRUE))

# Region-city
df_regions <- tribble(
  ~region_id, ~city_id,
  1, c(1, 2, 3, 4,5,8),  
  2, c(10, 24),
  3, c(6, 7, 15, 16),   
  4, c(20, 21, 22, 28, 29),   
  5, c(9, 17, 18, 19),   
  6, c(25, 26, 27, 30, 34),   
  7, c(11, 12, 13, 14),   
  8, c(23, 31, 32, 33))

df_regions <- unnest_longer(df_regions, city_id)

Patients and their cities:

> df_patients
# A tibble: 87 × 2
   name               city_id
   <chr>                <int>
 1 Luke Skywalker          10
 2 C-3PO                   25
 3 R2-D2                   14
 4 Darth Vader             23
 5 Leia Organa             22
 6 Owen Lars                6
 7 Beru Whitesun Lars       4
 8 R5-D4                    6
 9 Biggs Darklighter       34
10 Obi-Wan Kenobi           7
# ℹ 77 more rows

And, more important, region-city formatted like unique pairs:

# A tibble: 34 × 2
   region_id city_id
       <dbl>   <dbl>
 1         1       1
 2         1       2
 3         1       3
 4         1       4
 5         1       5
 6         1       8
 7         2      10
 8         2      24
 9         3       6
10         3       7
# ℹ 24 more rows

Since the region-city data is in a unique-pair and longer format, just perform the left_join():

# Join and arrange (you can just ` %>% ` the commands)
df_patients <- left_join(df_patients, df_regions, by = "city_id")
df_patients <- arrange(df_patients,region_id, city_id, name)

> df_patients
# A tibble: 87 × 3
   name               city_id region_id
   <chr>                <dbl>     <dbl>
 1 Sebulba                  1         1
 2 Lobot                    2         1
 3 R4-P17                   2         1
 4 Wedge Antilles           2         1
 5 Adi Gallia               3         1
 6 BB8                      3         1
 7 Ratts Tyerel             3         1
 8 Ben Quadinaros           4         1
 9 Beru Whitesun Lars       4         1
10 Jek Tono Porkins         4         1
# ℹ 77 more rows

That's it. Maybe count it?

> count(df_patients, region_id, sort = TRUE, name = "n_pct")
# A tibble: 8 × 2
  region_id n_pct
      <dbl> <int>
1         1    17
2         6    15
3         7    12
4         3    11
5         4    10
6         8    10
7         5     8
8         2     4

Patients from region "1"? No problem:

> filter(df_patients, region_id == 1)
# A tibble: 17 × 3
   name               city_id region_id
   <chr>                <dbl>     <dbl>
 1 Sebulba                  1         1
 2 Lobot                    2         1
 3 R4-P17                   2         1
 4 Wedge Antilles           2         1
 5 Adi Gallia               3         1
 6 BB8                      3         1
 7 Ratts Tyerel             3         1
 8 Ben Quadinaros           4         1
 9 Beru Whitesun Lars       4         1
10 Jek Tono Porkins         4         1
11 Yoda                     4         1
12 Gasgano                  5         1
13 Raymus Antilles          5         1
14 Saesee Tiin              5         1
15 Han Solo                 8         1
16 Jocasta Nu               8         1
17 Lama Su                  8         1