Adding statistics to the PROC MEANS/SUMMARY default output dataset.

From sasCommunity
Jump to: navigation, search

Adding statistics to the PROC MEANS/SUMMARY default output dataset. This article is very much like PROC_MEANS_-_Improve_on_the_default, but includes code and macro code for study not a complete production macro.

The default data set create by PROC MEANS/SUMMAY has qualities that I find desirable for further processing, especially if I’m trying to make a typical clinical trials summary table “Summary Statistics in Rows” The only catch is that it only contains five statistics N MEAN STD MIN and MAX. I usually want to include the MEDIAN and often Q1, Q3 and perhaps LCLM and UCLM. Other statistics are more rare but also needed from time to time. I’m hopping that SI will add an option in the future to do this so we don’t have to work so hard to get output like this.

The code to produce the 5 statistics from the default data set is about as simple as can be. PROC MEANS will summarize all numeric variables not used in other statements or the variables named in the VAR statement. I would usually include an output data set name but you don’t have to.

proc means data=sashelp.class noprint;
   class sex;
   output;
   run;
Obs    Sex    _TYPE_    _FREQ_    _STAT_      Age       Height     Weight

  1              0        19       N        19.0000    19.0000     19.000
  2              0        19       MIN      11.0000    51.3000     50.500
  3              0        19       MAX      16.0000    72.0000    150.000
  4              0        19       MEAN     13.3158    62.3368    100.026
  5              0        19       STD       1.4927     5.1271     22.774
  6     F        1         9       N         9.0000     9.0000      9.000
  7     F        1         9       MIN      11.0000    51.3000     50.500
  8     F        1         9       MAX      15.0000    66.5000    112.500
  9     F        1         9       MEAN     13.2222    60.5889     90.111
 10     F        1         9       STD       1.3944     5.0183     19.384
 11     M        1        10       N        10.0000    10.0000     10.000
 12     M        1        10       MIN      11.0000    57.3000     83.000
 13     M        1        10       MAX      16.0000    72.0000    150.000
 14     M        1        10       MEAN     13.4000    63.9100    108.950
 15     M        1        10       STD       1.6465     4.9379     22.727

As you can see all numeric variables have been summarized and the attributes, (FORMAT, INFORMAT, and LABEL) of the original variables have been inherited by the summarized variables, (Age, Height, and Weight). Although in this example using SASHELP.CLASS the variables don’t have any of those attributes.

To add one or two statistics to the default data step is easy enough as follows.

proc means data=sashelp.class noprint;
   class sex;
   output out=stats;
   output out=median median=;
   run;
data stats;
   set stats median(in=in2);
   by _type_ sex;
   if in2 then _STAT_ = 'MEDIAN';
   run;
Obs    Sex    _TYPE_    _FREQ_    _STAT_      Age       Height     Weight

  1              0        19      N         19.0000    19.0000     19.000
  2              0        19      MIN       11.0000    51.3000     50.500
  3              0        19      MAX       16.0000    72.0000    150.000
  4              0        19      MEAN      13.3158    62.3368    100.026
  5              0        19      STD        1.4927     5.1271     22.774
  6              0        19      MEDIAN    13.0000    62.8000     99.500
  7     F        1         9      N          9.0000     9.0000      9.000
  8     F        1         9      MIN       11.0000    51.3000     50.500
  9     F        1         9      MAX       15.0000    66.5000    112.500
 10     F        1         9      MEAN      13.2222    60.5889     90.111
 11     F        1         9      STD        1.3944     5.0183     19.384
 12     F        1         9      MEDIAN    13.0000    62.5000     90.000
 13     M        1        10      N         10.0000    10.0000     10.000
 14     M        1        10      MIN       11.0000    57.3000     83.000
 15     M        1        10      MAX       16.0000    72.0000    150.000
 16     M        1        10      MEAN      13.4000    63.9100    108.950
 17     M        1        10      STD        1.6465     4.9379     22.727
 18     M        1        10      MEDIAN    13.5000    64.1500    107.250

That was easy and you would just add more output statements for other statistics and concatenate them in a data step or data step view. The all works great unless you want two-tailed confidence limit for the mean, (UCLM and LCLM). Using the output model we devised we would expect to use the following.

   output out=LCLM LCLM=;
   output out=UCLM UCLM=;

This would not produce two-tailed CLM however it would produce both upper and lower one-tailed CLM, Unusually, we only want one of the one-tailed CLM and when we ask for both we want two-tailed CLM. See the TIP regarding this behavior in the PROC MEANS documentation. [1] So to keep the output model consistent and get two-tailed CLM I want to request both UCLM and LCLM on the output statement. I have chosen to use two output statements and create dummy variables that are discarded. This achieves the objective and keeps this output consistent with other statistics.

   output out=LCLM(drop=dummy: ) LCLM= UCLM=dummy1-dummyN;
   output out=UCLM(drop=dummy: ) UCLM= LCLM=dummy1-dummyN;

Here the N in “dummyN” is replaced with a number equal to the number of variables in the VAR statement. If the program we are writing is a one off the number of variables in the VAR statement can be determined by counting. If on the other had we want a smarter program or reusable code we will need a way to count the variables in the VAR statement. But, what if the VAR statement is not used or contains one or more SAS Variable List we can’t just count the words in VAR. We will need a more sophisticated method. PROC TANSPOSE is my preferred method because it is simple and provides all the features I need. See this article for details regarding expanding variable lists with PROC TRANSPOSE [2]

To accommodate the two-tailed CLM and expand the code to output more statistics it we will need to write some code to write some code. The SAS macro language is well suited to the task as we can package the whole process into macro that is parameterized to function in a way similar to PROC MEANS. I won’t try to accommodate the complete syntax as that would be rather difficult and probably not necessary. Multiple CLASS statements and class statement options will not be supported by this simple example macro.

%macro 
   defaultMeans
      (
         data     =  ,                 /* input data */
         out      =  stats,            /* output data set name */ 
         class    =  ,                 /* CLASS statement variable(s) */
         var      =  ,                 /* VAR statement variable(s), blank for ALL */
         weight   =  ,                 /* WEIGHT statement variable */
         freq     =  ,                 /* FREQ statement variable */
         stats    =  median lclm uclm, /* optional statistics */
         poptions =  missing,          /* PROC statement options */
         ooptions =  ways levels       /* OUTPUT statement options */  
      );
 
   %local dummy i w tails statslist;
   %let tails = 0;
   %let stats = %upcase(&stats);
   %if       %sysfunc(indexW(&stats,UCLM)) and %sysfunc(indexW(&stats,LCLM)) %then %let tails = 2;
   %else %if %sysfunc(indexW(&stats,UCLM)) or  %sysfunc(indexW(&stats,LCLM)) %then %let tails = 1;
 
   %if &tails eq 2 %then %do;
      proc transpose out=&sysmacroname._VARS data=&data(obs=0 drop=&class &weight &freq);
         var %sysfunc(coalesceC(&var,_NUMERIC_));
         run;
      data _null_;
         call symputX('dummy',cats('_dummy_1-_dummy_',nobs));
         stop;
         set &sysmacroname._VARS nobs=nobs;
         run;
      %put NOTE: DUMMY=&dummy;
      %end;
   proc means data=&data noprint &poptions;
      class &class;
      freq &freq;
      weight &weight;
      var &var;
      output out=&sysmacroname._OPTIONS(keep=_: &class drop=_freq_) n=  / &ooptions;
      output out=&sysmacroname._STATS;
      %let i = %eval(&i + 1);
      %let w = %scan(&stats,&i,%str( ));
      %do %while(%superQ(w) ne);         
         %if not %sysfunc(indexW(N MEAN STD MIN MAX,&w)) %then %do; 
            %let statslist=&statslist &w;
            %if %sysfunc(indexW(UCLM LCLM,&w)) and tails eq 2 %then %do;
               %if &w eq UCLM %then %do;
                  output out=&sysmacroname._&w(drop=&dummy) &w= LCLM=&dummy;
                  %end;
               %else %if &w eq LCLM %then %do;
                  output out=&sysmacroname._&w(drop=&dummy) &w= UCLM=&dummy;
                  %end;
               %end;
            %else %do;
               output out=&sysmacroname._&w &w=;
               %end;
            %end;
         %else %put NOTE: Requested statistics &w is a default statistic;   
         %let i = %eval(&i + 1);
         %let w = %scan(&stats,&i,%str( ));
         %end;
      run;
   %if &tails %then %put NOTE: CLM %sysfunc(compress(%sysfunc(putN(&tails,words))))-tailed Statistics Requested;
   data &sysmacroname._STATSV  / view=&sysmacroname._STATSV;
      set 
         &sysmacroname._STATS
         %let i = 1;
         %let w = %scan(&statslist,&i,%str( ));
         %do %while(%superQ(w) ne);         
            &sysmacroname._&w(in=_IN_&i)
            %let i = %eval(&i + 1);
            %let w = %scan(&statslist,&i,%str( ));
            %end;
         ;
      by _type_ &class;
      %if %superQ(statslist) ne %then %do;
         select;
            %let i = 1;
            %let w = %scan(&statslist,&i,%str( ));
            %do %while(%superQ(w) ne);         
               when (_IN_&i) _STAT_ = "&w";
               %let i = %eval(&i + 1);
               %let w = %scan(&statslist,&i,%str( ));
               %end;
            otherwise;
            end;
         %end;
      run;
   data &out;
      merge &sysmacroname._STATSV &sysmacroname._OPTIONS;
      by _type_ &class;
      run;
   proc datasets library=work memtype=(data view);
      delete &sysmacroname._:;
      run;
      quit;
 
   %mend defaultMeans;

Only one thing you need to see is your class variable, if this is character then it will be not sorted in proper manner. Like if class variables is Visit 1, Visit 2 and Visit 11 then it will be sorted as Visit 11 - Visit 1 - Visit 2. To avoid this make new column which is numerical form of visit and then use new column in this macro's class statement. Further you can use proc format to show them properly in your output. --sanket sinojia 07:33, 3 December 2012 (EST)