I have a dataset like this
data have;
do i = 1 to 1000;
y = ranuni(0);
x1 = y ** 2;
x2 = x1 ** 3;
x3 = x2 - x1/2;
output;
end;
run;
I build a correlation matrix like this:
proc corr
data = have
out = correlation_matrix
(where = (_TYPE_ = "CORR"))
noprint;
run;
I've attempted to think out loud some code which would achieve something similar to what I'm looking for, the syntax nor the logic is correct but I'm having a go at describing what I'm looking for
proc sort
data = correlation_matrix
by _NAME_;
run;
data _temp;
set correlation_matrix;
array col[*] _numeric_;
by _NAME_;
do i = 1 to dim(col);
if col(i) > 0.6 then do;
%let list = append(vname(col));
end;
run;
And from the correlation matrix, I'm looking for a way to return pairs which have a correlation of 60% or is above some threshold, the pairs I will then use to build scatter plot / histogram matrices like this
proc contents;
data = high_correlation_pairs
out = contents
noprint;
run;
proc sort
data = contents
nodupkey;
by name;
run;
proc sql noprint;
select name INTO: highly_correlated_pairs
separated by " "
from contents
;
quit;
ODS GRAPHICS /
IMAGEMAP=OFF;
OPTIONS VALIDVARNAME=ANY;
PROC SGSCATTER
DATA=have;
TITLE "Scatter Plot Matrix";
FOOTNOTE;
MATRIX &highly_correlated_pairs
/
DIAGONAL=(HISTOGRAM )
START=TOPLEFT
NOLEGEND
;
RUN;
TITLE; FOOTNOTE;
I'm just unsure how to select variables from the matrix which have a pair of over 60% correlation, it could even by a by NAME return the columns with corr over 60%
You can get the pairs like this - the key is the
vname
function, which returns the name of an array element:Perhaps from there you can work out the rest?