Creating an ID variable for duplicates in SAS

2k Views Asked by At

I have a SAS dataset with an ID variable which is supposed to be unique at the person level, but in reality there are duplicates. What I'd like to do is create a duplicate ID which only fills when a person has duplicate values of ID, like this:

ID      Dupe_ID
1 
2          1
2          1
3
4          2
4          2

Any help is much appreciated!

4

There are 4 best solutions below

1
On

It's handling it slightly differently, but just in case it's of use to you and/or others - proc sort has a handy simple dupout= option for seperating out non-unique key observations:

proc sort data=have out=want dupout=dups ;
  by id ;
run ;

The first occurence of each id will go to the want dataset. Any subsequent observations with the same id will go to the dups dataset

0
On
proc sort data = dataset out = sortdata;
by id;
run;

data younameit;
length dup_id 1;
set sortdata;
by id;
if first.id and last.id then dup_id =;
else dup_id =1;
run;

My approach is to use Data Step with First. and Last. You need to perform sorting at both PROCEDURE proc sort and DATA step "by" immediately after set statementbefore First. and Last. could work in data step.

if an observation itself is the FIRST and the LAST observation of the by-group(i.e. Id), then it must be an unique item. Say if there are two observations with id =2 , the earlier observation would be the first.id and the later would be the last.id.

1
On
data have;
    input ID;
    cards;
1 
2 
2 
3
4 
4 
;

/*if sorted*/
data want;
    set have;
    by id;

    if first.id and not last.id then
        _dup+1;
    dup_id=_dup;

    if first.id and last.id then
        call missing (dup_id);
    drop _dup;
run;
0
On

In SAS 9.3+, there is a new option on proc sort which can be of use. If you want to split your data into "actually unique" and "not unique" datasets (in order to later process the not uniques and work out what they should be), the following will do that:

proc sort data=have out=nonuniquedata nouniquekey uniqueout=uniquedata;
  by id;
run;

NOUNIQUEKEY is basically the opposite of NODUPKEY: it only keeps records that are NOT unique. So here the "primary" output dataset will have the nonunique records, and the "uniqueout" dataset will have the unique ones.