how could I handle an iterative naming system for this?

95 Views Asked by At

I have a dataset of this kind, this dataset contains some missing values (I'm representing them with X).

ID  VA  VB  ... Vn      |       ID  VA  VB  ... Vn  |   
1   a1  b1  ... n1      |       1   a1  X   ... n1  |
2   a2  b2  ... n2      |===    2   X   b2  ... X   |
3   a3  b3  ... n3      |===    3   a3  b3  ... n3  |
..................      |       ..................  |
N   aN  bN  ... nN      |       N   X   bN  ... nN  |

I want to add observations by ID using only one variable column, I will call that variable VAR: something like an inverted proc format by ID; var VAR;.

ID  VAR
1   a1
..
1   n1 
2   b2
..
3   a3
3   b3
..
3   n3
..
N   bN
..  
N   nN

so I tried to split OLDdataset in different datasets (NEWa NEWb ...NEWn) where, in each dataset we have all the not-missing observations stored in a column called VAR. Then I will merge NEWa NEWb ... NEWn in NEWdataset and I will apply a proc sort for restoring the order by ID.

The problem arised when I realized that "n" is not known prior the analysis 'cause I want to setup a generalized code that won't work for only one dataset, and VA VB VN are the result of a proc format step.

If n is knows I would use something like this:

data NEWa NEWb NEWc;
set OLDdataset;
array try[3] VA VB VC ;
if try[1] ne '.' then output NEWa;
if try[1] ne '.' then output NEWb;
if try[1] ne '.' then output NEWc;
run;

But now i need some iterative naming system that maybe could sounds like:

data NEW_i;                     <-------- "i" must be assigned
set OLDdataset;
array try[*] V: ;
do i=1 dim(try);
if try[i] ne '.' then output NEW_i; <---- "i" must be assigned
run;

Hope this sounds clear. Any hints? Thanks.

2

There are 2 best solutions below

1
On BEST ANSWER

Try a proc transpose instead, you can delete the missing in a second step or add a where clause to the out data set. If you want to create subsets after that based on ID separate them out in a data step.

*generate sample data;
data have;
array V(20) v1-v20;
do ID=1 to 10;
    do i=1 to 20;
    v(i)=rand('normal', 0, 1);
    if v(i) < -2 then v(i)=.;
    end;
    output;
end;
drop i;
run;

*Flip data;
proc transpose data=have out=want(where=(var1 ne .)) prefix=Var;
    by id;
run;

*Separate into different data sets;
data _null_;
    set want;
    by id;
    call execute(catt("data var", id, "; set want; where id=", id, ";run;"));
run;
1
On

If I've understood the scenario correctly you are dealing with numeric variables, so on that basis you might not necessarily have to split the data by variable into different data sets, as you should be able to use the nmiss() function which returns the number of missing numeric values, e.g. in a subsetting if:

data newdataset;
  set olddataset;
  if nmiss(&varlist)=0;
run;

where &varlist could be created just before using olddataset in the above data step (as you said the number of variables can be dynamic) using:

/* note literals are case sensitive */
proc sql noprint;
  select name into :varlist separated by ','
  from dictionary.columns
  where libname='WORK'
    and memname='OLDDATASET'
    /* use following if you also have non-numeric variables */
    /* and type='num' */
  ;
quit;

You can use the following to see the contents of &varlist:

%put varlist=&varlist;

or even use the following option before the macro variable is referenced:

options symbolgen;