# Adding statistics to the PROC MEANS/SUMMARY default output dataset.

Adding statistics to the PROC MEANS/SUMMARY default output dataset. This article is very much like PROC_MEANS_-_Improve_on_the_default, but includes code and macro code for study not a complete production macro.

The default data set create by PROC MEANS/SUMMAY has qualities that I find desirable for further processing, especially if I’m trying to make a typical clinical trials summary table “Summary Statistics in Rows” The only catch is that it only contains five statistics N MEAN STD MIN and MAX. I usually want to include the MEDIAN and often Q1, Q3 and perhaps LCLM and UCLM. Other statistics are more rare but also needed from time to time. I’m hopping that SI will add an option in the future to do this so we don’t have to work so hard to get output like this.

The code to produce the 5 statistics from the default data set is about as simple as can be. PROC MEANS will summarize all numeric variables not used in other statements or the variables named in the VAR statement. I would usually include an output data set name but you don’t have to.

proc means data=sashelp.class noprint; class sex; output; run;

Obs Sex _TYPE_ _FREQ_ _STAT_ Age Height Weight 1 0 19 N 19.0000 19.0000 19.000 2 0 19 MIN 11.0000 51.3000 50.500 3 0 19 MAX 16.0000 72.0000 150.000 4 0 19 MEAN 13.3158 62.3368 100.026 5 0 19 STD 1.4927 5.1271 22.774 6 F 1 9 N 9.0000 9.0000 9.000 7 F 1 9 MIN 11.0000 51.3000 50.500 8 F 1 9 MAX 15.0000 66.5000 112.500 9 F 1 9 MEAN 13.2222 60.5889 90.111 10 F 1 9 STD 1.3944 5.0183 19.384 11 M 1 10 N 10.0000 10.0000 10.000 12 M 1 10 MIN 11.0000 57.3000 83.000 13 M 1 10 MAX 16.0000 72.0000 150.000 14 M 1 10 MEAN 13.4000 63.9100 108.950 15 M 1 10 STD 1.6465 4.9379 22.727

As you can see all numeric variables have been summarized and the attributes, (FORMAT, INFORMAT, and LABEL) of the original variables have been inherited by the summarized variables, (Age, Height, and Weight). Although in this example using SASHELP.CLASS the variables don’t have any of those attributes.

To add one or two statistics to the default data step is easy enough as follows.

proc means data=sashelp.class noprint; class sex; output out=stats; output out=median median=; run; data stats; set stats median(in=in2); by _type_ sex; if in2 then _STAT_ = 'MEDIAN'; run;

Obs Sex _TYPE_ _FREQ_ _STAT_ Age Height Weight 1 0 19 N 19.0000 19.0000 19.000 2 0 19 MIN 11.0000 51.3000 50.500 3 0 19 MAX 16.0000 72.0000 150.000 4 0 19 MEAN 13.3158 62.3368 100.026 5 0 19 STD 1.4927 5.1271 22.774 6 0 19 MEDIAN 13.0000 62.8000 99.500 7 F 1 9 N 9.0000 9.0000 9.000 8 F 1 9 MIN 11.0000 51.3000 50.500 9 F 1 9 MAX 15.0000 66.5000 112.500 10 F 1 9 MEAN 13.2222 60.5889 90.111 11 F 1 9 STD 1.3944 5.0183 19.384 12 F 1 9 MEDIAN 13.0000 62.5000 90.000 13 M 1 10 N 10.0000 10.0000 10.000 14 M 1 10 MIN 11.0000 57.3000 83.000 15 M 1 10 MAX 16.0000 72.0000 150.000 16 M 1 10 MEAN 13.4000 63.9100 108.950 17 M 1 10 STD 1.6465 4.9379 22.727 18 M 1 10 MEDIAN 13.5000 64.1500 107.250

That was easy and you would just add more output statements for other statistics and concatenate them in a data step or data step view. The all works great unless you want two-tailed confidence limit for the mean, (UCLM and LCLM). Using the output model we devised we would expect to use the following.

output out=LCLM LCLM=; output out=UCLM UCLM=;

This would not produce two-tailed CLM however it would produce both upper and lower one-tailed CLM, Unusually, we only want one of the one-tailed CLM and when we ask for both we want two-tailed CLM. See the TIP regarding this behavior in the PROC MEANS documentation. [1] So to keep the output model consistent and get two-tailed CLM I want to request both UCLM and LCLM on the output statement. I have chosen to use two output statements and create dummy variables that are discarded. This achieves the objective and keeps this output consistent with other statistics.

output out=LCLM(drop=dummy: ) LCLM= UCLM=dummy1-dummyN; output out=UCLM(drop=dummy: ) UCLM= LCLM=dummy1-dummyN;

Here the N in “dummyN” is replaced with a number equal to the number of variables in the VAR statement. If the program we are writing is a one off the number of variables in the VAR statement can be determined by counting. If on the other had we want a smarter program or reusable code we will need a way to count the variables in the VAR statement. But, what if the VAR statement is not used or contains one or more SAS Variable List we can’t just count the words in VAR. We will need a more sophisticated method. PROC TANSPOSE is my preferred method because it is simple and provides all the features I need. See this article for details regarding expanding variable lists with PROC TRANSPOSE [2]

To accommodate the two-tailed CLM and expand the code to output more statistics it we will need to write some code to write some code. The SAS macro language is well suited to the task as we can package the whole process into macro that is parameterized to function in a way similar to PROC MEANS. I won’t try to accommodate the complete syntax as that would be rather difficult and probably not necessary. Multiple CLASS statements and class statement options will not be supported by this simple example macro.

%macro defaultMeans ( data = , /* input data */ out = stats, /* output data set name */ class = , /* CLASS statement variable(s) */ var = , /* VAR statement variable(s), blank for ALL */ weight = , /* WEIGHT statement variable */ freq = , /* FREQ statement variable */ stats = median lclm uclm, /* optional statistics */ poptions = missing, /* PROC statement options */ ooptions = ways levels /* OUTPUT statement options */ ); %local dummy i w tails statslist; %let tails = 0; %let stats = %upcase(&stats); %if %sysfunc(indexW(&stats,UCLM)) and %sysfunc(indexW(&stats,LCLM)) %then %let tails = 2; %else %if %sysfunc(indexW(&stats,UCLM)) or %sysfunc(indexW(&stats,LCLM)) %then %let tails = 1; %if &tails eq 2 %then %do; proc transpose out=&sysmacroname._VARS data=&data(obs=0 drop=&class &weight &freq); var %sysfunc(coalesceC(&var,_NUMERIC_)); run; data _null_; call symputX('dummy',cats('_dummy_1-_dummy_',nobs)); stop; set &sysmacroname._VARS nobs=nobs; run; %put NOTE: DUMMY=&dummy; %end; proc means data=&data noprint &poptions; class &class; freq &freq; weight &weight; var &var; output out=&sysmacroname._OPTIONS(keep=_: &class drop=_freq_) n= / &ooptions; output out=&sysmacroname._STATS; %let i = %eval(&i + 1); %let w = %scan(&stats,&i,%str( )); %do %while(%superQ(w) ne); %if not %sysfunc(indexW(N MEAN STD MIN MAX,&w)) %then %do; %let statslist=&statslist &w; %if %sysfunc(indexW(UCLM LCLM,&w)) and tails eq 2 %then %do; %if &w eq UCLM %then %do; output out=&sysmacroname._&w(drop=&dummy) &w= LCLM=&dummy; %end; %else %if &w eq LCLM %then %do; output out=&sysmacroname._&w(drop=&dummy) &w= UCLM=&dummy; %end; %end; %else %do; output out=&sysmacroname._&w &w=; %end; %end; %else %put NOTE: Requested statistics &w is a default statistic; %let i = %eval(&i + 1); %let w = %scan(&stats,&i,%str( )); %end; run; %if &tails %then %put NOTE: CLM %sysfunc(compress(%sysfunc(putN(&tails,words))))-tailed Statistics Requested; data &sysmacroname._STATSV / view=&sysmacroname._STATSV; set &sysmacroname._STATS %let i = 1; %let w = %scan(&statslist,&i,%str( )); %do %while(%superQ(w) ne); &sysmacroname._&w(in=_IN_&i) %let i = %eval(&i + 1); %let w = %scan(&statslist,&i,%str( )); %end; ; by _type_ &class; %if %superQ(statslist) ne %then %do; select; %let i = 1; %let w = %scan(&statslist,&i,%str( )); %do %while(%superQ(w) ne); when (_IN_&i) _STAT_ = "&w"; %let i = %eval(&i + 1); %let w = %scan(&statslist,&i,%str( )); %end; otherwise; end; %end; run; data &out; merge &sysmacroname._STATSV &sysmacroname._OPTIONS; by _type_ &class; run; proc datasets library=work memtype=(data view); delete &sysmacroname._:; run; quit; %mend defaultMeans;

Only one thing you need to see is your class variable, if this is character then it will be not sorted in proper manner. Like if class variables is Visit 1, Visit 2 and Visit 11 then it will be sorted as Visit 11 - Visit 1 - Visit 2. To avoid this make new column which is numerical form of visit and then use new column in this macro's class statement. Further you can use proc format to show them properly in your output. --sanket sinojia 07:33, 3 December 2012 (EST)