I am strugling to query a claims database due to computational efficency and runtime. The server I'm using runs out of resources and does not execute the code bellow as a result. I'm essentially trying to look within a large claims database (2-3 billion rows) and find anyone with hypertension defined by the codes and columns listed bellow. I was hoping someone could provide a suggestion as how to achieve the same goal but improve runtime and not have the server crash.
proc sql;
create table hyper as
select var1, var2
from claim_db
WHERE SUBSTR(col1, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col2, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col3, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col4, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col5, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col6, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col7, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col8, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col9, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col10, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col11, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col12, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405') or
SUBSTR(col13, 1, 3) IN ('I10', 'I11', 'I12', 'I13', 'I15', 'I16', '401','402','403','404','405');
quit;
I've even tried to do just one column at a time in hopes of just having 13 tables and then appending the tables later with the same logic as the code bellow, but the server still ran out of resources after just trying the first where statement. Does anyone have any suggestions how to achieve the same thing in this code with maximum efficency in mind?
For billions of records, you probably need to work with the DBA who owns the database. You'd want to understand what database it is (Oracle? SQL server?), if there is an index on the columns, etc. Then try to figure out how to write the query efficiently in the native SQL. Then change to use an explicit pass-through query, to let the database do all the work before passing to SAS.
Your current code is probably requiring that the billions of records all be transferred from the database to SAS, so that SAS can subset them. You'll definitely want to change to explicit pass-through, so that the database runs the query and subsets the data, then only returns the selected records to SAS.