Tags column to dummy columns

191 Views Asked by At

Imagine a very simple csv file that looks like this:

output,input,tags
0,0,sunday-hot
1,3,sunday
5,1,hot-random

where the tags column is made up of a series of keywords all separated by "-".
What I would like is to turn it into a dataset that looks like this:

output,input,sunday,hot,random
0,0,T,T,F
1,3,T,F,F
5,1,F,T,T
3

There are 3 best solutions below

0
On BEST ANSWER

Here is one way to do it with dplyr/tidyr. Use separate_rows() to separate the pasted tags column to a normal column which contains one value in each cell; Create a value column that contains all TRUE to spread on; Fill missing values with FALSE:

library(dplyr); library(tidyr)

df %>% separate_rows(tags) %>% 
       mutate(val = TRUE) %>% 
       spread(tags, val, fill = FALSE)

#  output input   hot random sunday
#1      0     0  TRUE  FALSE   TRUE
#2      1     3 FALSE  FALSE   TRUE
#3      5     1  TRUE   TRUE  FALSE
1
On

We can do this with a one-liner using mtabulate

library(qdapTools)
cbind(df1[-3], mtabulate(strsplit(df1$tags, "-"))!=0)
#   output input   hot random sunday
#1      0     0  TRUE  FALSE   TRUE
#2      1     3 FALSE  FALSE   TRUE
#3      5     1  TRUE   TRUE  FALSE

NOTE: If the "tags" column is factor class, wrap it with as.character as strsplit takes only character class as input

0
On

the way you want to store the data is not efficient at all because you will create an insane amount of columns whose values will be missing most of the time.

Instead, have a look at https://github.com/juliasilge/tidytext and the unnest_tokens function.