PROC APPEND Alternatives

From sasCommunity
Jump to: navigation, search

Introduction

Suppose you have 10,000 small data sets, with identical structures, that are to be appended to a template or base data set. Can it be done without running PROC APPEND 10,000 times?

Here is an example, starting with the base:

data base;
length source a $ 3 b $ 5 d $ 4;
retain source a     b     d     ;
stop;
run;

Here are three data sets representing the 10,000:

data one
     two
     e04
     ;
length a b c $ 4;
input (a b c)($);
source='one'; output one;
source='two'; output two;
source='e04'; output e04;
cards;
a1-- b1-- c1--
a2-- b2-- c2--
a2-- b3-- c3--
a2-- b2-- c2--
;

The three are alike in structure, but they differ in structure from the base. One variable is longer, one is shorter, one is absent in the base, and one is present in the base but not in the three sources.

PROC APPEND

This is the obvious:

proc append base=base data=one force; run;
proc append base=base data=two force; run;
proc append base=base data=e04 force; run;

The FORCE option is needed because of the structure discrepancies.

The downside is the overhead of 10,000 steps and all the messages in the log. Each step generates a series of messages like:

NOTE: Appending WORK.ONE to WORK.BASE.
WARNING: Variable c was not found on BASE file. The variable will not be added to the BASE file.
WARNING: Variable a has different lengths on BASE and DATA files (BASE 3 DATA 4).
WARNING: Variable b has different lengths on BASE and DATA files (BASE 5 DATA 4).
WARNING: Variable d was not found on DATA file.
NOTE: FORCE is specified, so dropping/truncating will occur.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: 4 observations added.
NOTE: The data set WORK.BASE has 4 observations and 4 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.15 seconds
      cpu time            0.06 seconds

PROC DATASETS

PROC DATASETS can be used instead:

proc datasets nolist;
append base=work.base data=work.one force;
 run;
append base=work.base data=work.two force;
 run;
append base=work.base data=work.e04 force;
 run;
quit;

All is done in one step. However, the same cycle of messages for each append appears:

NOTE: Appending WORK.ONE to WORK.BASE.
WARNING: Variable c was not found on BASE file. The variable will not be added to the BASE file.
WARNING: Variable a has different lengths on BASE and DATA files (BASE 3 DATA 4).
WARNING: Variable b has different lengths on BASE and DATA files (BASE 5 DATA 4).
WARNING: Variable d was not found on DATA file.
NOTE: FORCE is specified, so dropping/truncating will occur.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: 4 observations added.
NOTE: The data set WORK.BASE has 16 observations and 4 variables.

It seems that behind the scenes PROC APPEND and PROC DATASETS use the same code, so little performance gain should be expected.

PROC SQL

The INSERT statement can be used:

proc sql;
insert into base ( source, a, b, d   )
 select            source, a, b, ' '
  from one;
insert into base ( source, a, b, d   )
 select            source, a, b, ' '
  from two;
insert into base ( source, a, b, d   )
 select            source, a, b, ' '
  from e04;
quit;

The alignment of source and destination is based on position, not column name. Each statement generates just one message, like this one:

NOTE: 4 rows were inserted into WORK.BASE.

The processor opens and closes BASE for each statement, so it's probably not too efficient.

DATA step

The MODIFY statement can be used to open a data set for writing additional observations, as in:

data base;
if 0 then modify base;
set one two e04 open=defer;
output;
run;

There is one NOTE for each data set opened by SET and one overall NOTE generated by MODIFY:

NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 4 observations read from the data set WORK.TWO.
NOTE: There were 4 observations read from the data set WORK.E04.
NOTE: The data set WORK.BASE has been updated.  There were 0 observations rewritten, 12
      observations added and 0 observations deleted.

This should be rather efficient, since BASE is opened only once.

Concern: Will this break or degrade in performance when the number of data sets read gets really large?

Instead of using the SET statement, the source data sets can be specified at run time and harvested using hash objects. For example:

data base;
if 0 then modify base;
if 0 then call missing(source, a, b);
* CALL MISSING here is workaround for bug described
  in http://support.sas.com/kb/19/207.html .
  Bug appears to affect OUTPUT as well as REPLACE.
  Bug appears to be fixed in 9.2.
  ;
do dsn = 'one', 'two', 'e04';
  declare hash hh(dataset:dsn);                               
  declare hiter ii  ('hh') ;                             
  hh.definekey  ('source', 'a', 'b') ;                           
  hh.definedata ('source', 'a', 'b') ;                        
  hh.definedone () ;   
  rc = ii.first();
  do while (rc=0);
     output;
     rc = ii.next();
     end; 
  hh.delete();
  end; 
  stop;                                           
run;

The log shows:

NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 4 observations read from the data set WORK.TWO.
NOTE: There were 4 observations read from the data set WORK.E04.
NOTE: The data set WORK.BASE has been updated.  There were 0 observations rewritten, 9
      observations added and 0 observations deleted.

Note that only nine observations were added. That's because the hash object purges duplicate observations. A related side effect: order of observations is not preserved. Question: Can SAS 9.2 hash enhancements mitigate these side effects?

The hash approach should be more robust than the SET statement for really large numbers of data sets, and ought to be at least as fast.

Macro loop/function

Another option is using a loop. Especially if the datasets are numbered or stored in a common location, a macro can run through them in an automated way. If the files are numbered, the code below can be simplified to append a number to a filename, e.g.:

%let num_files = 10000;

%do i=1 %to &num_files;
    %let padded_number = %sysfunc(putn(&i, z6.));
    proc append base=lib.full_set
                data=lib.table_%padded_number;
    run;
 %end;

This example will read a list of files in a directory (here are two other examples). It can be easily modified to list files from a Linux server by modifying the filename statement:

filename file_list pipe "cd &folder_location; ls -1 file_template*.ext";

You can then format the list of files, and put their names into macro variables:

data WORK.file_list;
    infile file_list truncover; /* for more on truncover, see this paper and this blog post */
    input filename $20.; /* change if you expect longer file names */
run;

proc sql;
    select count(*)
    into :num_files
    from WORK.file_list;
quit;

data _null_;
    set WORK.file_list;
    call symputx(cats("file_", _N_), trim(file));
run;

A macro %do loop can then be used on the files.

%do i=1 %to &num_files;
    proc append base=lib.full_set
                data=&file_i;
    run;
 %end;

This will need to be modified to suit your needs. The general concept can also be applied to the other methods above. Macros are powerful, and very useful for automation.