How to create breaks using the cut function without numbers overlapping

2.7k Views Asked by At

I have a dataset and need to cut the age factor of my dataset into 3 different age categories...e.g. age group 1 (10-20 years old), age group 2 (21-30 years old), and age group 3 (31-40 years old).

If I type breaks=c(10, 20, 30, 40) when creating the cut function, the outcome is as follows: age group 1 being 10-20 age group 2 being 20-30 age group 3 being 30-40

I do not want this! I need age group 2 to be from 21-30 years of age (however 20 is part of this age category now)...I would appreciate some help thank you

1

There are 1 best solutions below

12
On

I think that you are misinterpreting the results. The intervals are half-open. They include the upper bound, but not the lower bound. So

 age = sample(10:40, 50, replace=TRUE)
 cut(age, breaks=c(10, 20, 30, 40))
 [1] (30,40] (30,40] (30,40] (20,30] (30,40] (30,40] (30,40]
 [8] (30,40] (10,20] (30,40] (20,30] (30,40] (30,40] (10,20]
[15] (10,20] (30,40] (30,40] (20,30] (30,40] (30,40] (20,30]
[22] (30,40] (30,40] (30,40] (10,20] (20,30] (10,20] (10,20]
[29] (10,20] (10,20] (20,30] (10,20] (20,30] (30,40] (20,30]
[36] (20,30] (20,30] (20,30] (10,20] (30,40] (20,30] (20,30]
[43] (10,20] (20,30] (20,30] (30,40] (30,40] (20,30] (10,20]
[50] (20,30]
Levels: (10,20] (20,30] (30,40]

Means that the number 20 is only in the first group (10,20] but not in the second group (20,30] Also notice that the default does not include the lower limit so better than what I wrote before is cut(age, breaks=c(10, 20, 30, 40), include.lowest = TRUE) which will make the lowest level be the fully closed interval [10,20].