My data set spans over a year. This first had quarter hours, which I then aggregated into hours. So now I have 8760 hours, which means always 0 to 23 per day. I would like to create now for a linear regression, a dummy matrix which contains 24 rows and 24 columns, something like that:
| hour 1 | hour 2 | ... |
|---|---|---|
| 1 | 0 | |
| 0 | 1 |
I tried it with different functions, but nothing works. I hope someone of you could help me. This are the codes for the current dataset:
data$time = substr(data$x, 1, 16)
data$time <- as.POSIXct(data$start_time, format = "%d.%m.%Y %H:%M", tz = "UTC")
data = subset(data, select = -c(x, y))
data$hour <- hour(data$time)
head(data)
df = data %>%
mutate(data_aggregate = floor_date(time, unit = "hour")) %>%
group_by(data_aggregate) %>%
summarise(W = sum(W, na.rm = TRUE))
df1 <- df %>% mutate(hour = as.factor(hour(data_aggregate)))
As @Onyambu explained, you don't need to do it in R. When performing linear regression (or other types of statistical models) in R, if you include a factor variable as a predictor, R automatically generates dummy variables for each level of the factor (except one which is used as the reference level). This is known as "dummy coding" or "one-hot encoding".
In your case, when you create a factor variable for the hour, R will automatically create 23 dummy variables (since there are 24 hours, and one is used as the reference level).
If you still want to create dummy variables, here's how to do it. Not recommended
To create a dummy matrix using base R's
model.matrixfunction, you can use:The
~hour-1formula means that we want a model matrix from the hour variable without an intercept (-1) because we want all 24 columns representing each hour.You can reattach this matrix to the original data frame using
cbind().