I'm getting a "The remote Process is out of memory" in SAS DIS (Data Integration Studio):
Since it is possible that my approach is wrong, I'll explain the problem I'm working on and the solution I've decided on:
I have a large list of customer names which need cleaning. In order to achieve this, I use a .csv file containing regular expression patterns and their corresponding replacements; (I use this approach since it is easier to add new patterns to the file and upload it to the server for the deployed job to read from rather than harcoding new rules and redeploying the job).
In order to get my data step to make use of the rules in the file I add the patterns and their replacements to an array in the first iteration of my data step then apply them to my names. Something like:
DATA &_OUPUT;
ARRAY rule_nums{1:&NOBS} _temporary_;
IF(_n_ = 1) THEN
DO i=1 to &NOBS;
SET WORK.CLEANING_RULES;
rule_nums{i} = PRXPARSE(CATS('s/',rule_string_match,'/',rule_string_replace,'/i'));
END;
SET WORK.CUST_NAMES;
customer_name_clean = customer_name;
DO i=1 to &NOBS;
customer_name_clean = PRXCHANGE(a_rule_nums{i},1,customer_name_clean);
END;
RUN;
When I run this on around ~10K rows or less, it always completes and finishes extremely quickly. If I try on ~15K rows it chokes for a super long time and eventually throws an "Out of memory" error.
To try and deal with this I built a loop (using the SAS DIS loop transformation) wherein I number the rows of my dataset first, then apply the preceding logic in batches of 10000 names at a time. After a very long time I got the same out of memory error, but when I checked my target table (Teradata) I noticed that it ran and loaded the data for all but the last iteration. When I switched the loop size from 10000 to 1000 I saw exactly the same behaviour.
For testing purposes I've been working with only around ~500K rows but will soon have to handle millions and am worried about how this is going to work. For reference, the set of cleaning rules I'm applying is currently 20 rows but will grow to possibly a few hundred.
- Is it significantly less efficient to use a file with rules rather than hard coding the regular expressions directly in my datastep?
- Is there any way to achieve this without having to loop?
- Since my dataset gets overwritten on every loop iteration, how can there be an out of memory error for datasets that are 1000 rows long (and like 3 columns)?
- Ultimately, how do I solve this out of memory error?
Thanks!
The issue turned out to be that the log that the job was generating was too large. The possible solutions are to disable logging or to redirect the log to a location which can be periodically purged and/or has enough space.