how to detect specific value combination (or condition) of variables within group

365 Views Asked by At

I have a survey dataset which contains household ids and individual ids within each household: individual 1 represents the interviewee him/herself. Some variable represents each individual's relationship to the interviewee (for example, 2 for spouse, 3 for parents and so on), the data structure like the following

???

Now what I want to do is detect the occurrence of certain values in var1 and, if it occurs, whether the values of var1 and var2 satisfy a certain condition.

For example, if var1 and var2 satisfy

(var1 == 3 & var2 == 1) | (var1 == 4 & var2 == 1)

then I can attach value 1 to a new generated variable, say var3, for each individual in the same group (household in this case, to represent family structure) and 0 otherwise.

It seems not a big problem, and I suppose I should employ some

 by group: egen 

or

by group: gen

command, but I'm not sure. I used to apply commands like

gen l_w_p = 0
by hhid: replace l_w_p = 1 if (var1 == 3 & a2004 == 1) | (var2 == 4 & a2004 == 1)
by hhid: replace l_w_p = 2 if (var1 == 3 & a2004 == 2) & (var2 == 4 & a2004 == 2)

but it seems it doesn't work. Does that need some kind of loop?

2

There are 2 best solutions below

0
On

@Dimitriy V. Masterov provided a good specific answer, but there is scope to address the question more generally.

As his answer shows,

  1. Problems of the form: does any member of this group have this characteristic? can be tackled by using egen's max() function over groups to a true-or-false expression yielding 0 or 1, namely an indicator (or in a poor terminology popular in some fields, a dummy).

A little thought shows that

  1. Problems of the form: do all members of this group have this characteristic? can be tackled by using egen's min() function over groups to a true-or-false expression yielding 0 or 1, etc.

The whole story is fleshed out in an FAQ How do I create a variable recording whether any members of a group (or all members of a group) possess some characteristic? (so a meta-lesson is to make use of the resources available to you).

One step away are problems about the other members of a group, also discussed in an FAQ How do I create variables summarizing for each individual properties of the other members of a group?

For fuller discussions that may be useful, see this article and this article

Two further comments:

a. In code like this

gen l_w_p = 0
by hhid: replace l_w_p = 1 if (var1 == 3 & a2004 == 1) | (var2 == 4 & a2004 == 1)
by hhid: replace l_w_p = 2 if (var1 == 3 & a2004 == 2) & (var2 == 4 & a2004 == 2)

the by: prefix makes no difference to what is done. The code still works at individual level, and the prefix doesn't spread the operation to the group. That is why it "doesn't work", normally a fairly useless error report.

b. Mild abstraction is useful in explaining problems, but abstraction in naming variables just makes your code more difficult to read. I wouldn't use variable names such as var1, var2, etc., which just impose a burden of remembering what is what. Use evocative names such as any_unemployed or any_married or whatever. This is more than personal style, as when you are asking others to think about your code (as here), being able to read it easily is a great help.

0
On

I have a hard time figuring what you are asking. A good strategy is to give an example of your data and desired output, simplified as far as possible to the essence of your problem. This is much easier than describing the data with words.

Let's start simple. Suppose you have data that looks like this:

hhid    x
1       1
1       2
2       0
2       1

and you want to tag households where x is ever 2. One way is

bys hhid: egen tag=max(cond(x==2,1,0))

This will produce:

hhid   x   tag  
   1   1     1  
   1   2     1  
   2   0     0  
   2   1     0  

Working from the inside out, for each member, you check if x is ever 2. If it is, the member gets a 1. If not, he gets a 0. The max() calculates the maximum of this binary indicator over the entire household.

The conditions can get more complicated and the condition functions can be nested like Russian dolls.

Here's a more complicated example. Suppose you want to tag households where someone has x = 2 (tag with a 1) or y >= 5 (tag with a 2) in this dataset:

hhid   x   y  
   1   1   1  
   1   2   2  
   2   0   3  
   2   1   4  
   3   1   5  
   3   3   5  

We check x first, and then check y if the x condition is false:

bys hhid : egen tag=max(cond(x==2,1,cond(y>=5,2,0)))

This yields:

hhid   x   y   tag  
   1   1   1     1  
   1   2   2     1  
   2   0   3     0  
   2   1   4     0  
   3   1   5     2  
   3   3   5     2