How to unzip multiple TAQ data files in SAS

88 Views Asked by At

I have daily TAQ data for a month. I am trying to unzip those using SAS but I am failing. The folder's name is EQY_US_ALL_TRADE_202107. It has several zipped (GZ files) files for each trading day named as EQY_US_ALL_TRADE_202210701 EQY_US_ALL_TRADE_202210702 EQY_US_ALL_TRADE_202210703 ... EQY_US_ALL_TRADE_202210729

I have tried the following code. First I tried to unzip two files (hence, in line 4, do n = 1 to 2). It is not working at all.

'data "D:\EQY_US_ALL_TRADES_202107\MainDataset";

 rc=filename("folderef","D:\EQY_US_ALL_TRADES_202107");

 did = dopen("folderef");

 do _n_ = 1 to 2;

 filename = dread(did,_n_);

 if scan(filename,-1,'.') ne 'gz' then continue;

 fullname = pathname("folderef") || '/' || filename;

 do while(1);

      infile archive zip filevar=fullname gzip dlm='|' firstobs=2 eof=nextfile;

  OUTPUT;

  end;

nextfile:

end;

   stop;

 run;

Proc contents data = "D:\EQY_US_ALL_TRADES_202107\MainDataset";

run;'
1

There are 1 best solutions below

0
On

So you have three problems.

The primary one is understanding how to read ONE of the files. If you downloaded this from NYSE then they should be pipe delimited text files and the variable definitions are published. So first work on code that can read one of the files.

To read a pipe delimited text file just use a simple data step. So say perhaps you have the daily quotes file. The documentation says that file has 23 variables. Reading delimited files is simple. Just define the variables and the input them. Make sure to remove the summary line at the bottom.

data want;
  infile 'myfile.gz' zip gzip dsd dlm='|' termstr=lf truncover firstobs=2 ;
  attrib Time length=$15 label='Timestamp Time the quote was published by the SIP';
  attrib Exchange length=$1 label='The Exchange that issued the quote'
  attrib Symbol length=$17 label='Symbol Stock symbol';
  attrib BidPrice length=8 label'='The highest price any buyer is willing to pay for shares of this security';
  attrib BidSize length=8 label='The maximum number of shares the highest bidder is willing to buy, in round lots';
/* you can type the rest */
  attrib SecurityStatus length $2 label='The Security Status Indicator field is used to report trading suspensions';
  input Time -- SecurityStatus ;
  if time='END' then delete;
run;

The second problem is how to get the list of files to be read.

To get the list of files from a directory is a common question here and on SAS Communities. Your current code is close to doing that using the DOPEN() and DREAD() functions.

data files;
  length fileref $8 filename fullname $256 ;
  rc=filename(fileref,"D:\EQY_US_ALL_TRADES_202107");
  did = dopen(filref);
  do _n_ = 1 to dnum(did);
    filename = dread(did,_n_);
    if scan(filename,-1,'.') = 'gz' then do;
      fullname = catx('/',pathname(fileref),filename);
      output;
    end;
  end;
  keep fullname;
run;

Once you have solved those two problems you can then move onto how to read ALL of the files into one dataset. That you could do by driving the data step that reads the TAQ files with the data that has the list of files. You can use the FILEVAR= option of the INFILE statement to do that. So if you have dataset named FILES with a variable named FULLNAME that has the name of the GZIP files you want to read the basic structure would look like this:

data want;
  set files ;
  infile dummy zip gzip filevar=FULLNAME end=eof dsd dlm='|' termstr=lf truncover firstobs=2 ;
  attrib Time length=$15 label='Timestamp Time the quote was published by the SIP';
  attrib Exchange length=$1 label='The Exchange that issued the quote'
  attrib Symbol length=$17 label='Symbol Stock symbol';
  attrib BidPrice length=8 label'='The highest price any buyer is willing to pay for shares of this security';
  attrib BidSize length=8 label='The maximum number of shares the highest bidder is willing to buy, in round lots';
/* you can type the rest */
  attrib SecurityStatus length $2 label='The Security Status Indicator field is used to report trading suspensions';
  do while (not eof);
    input Time -- SecurityStatus ;
    if time ne 'END' then output;
  end;
run;