Missing Values in Classification Variables
Classification variables are used by procedures to form groups. There are two basic kinds of classification variables:
- implied (e.g. PROC FREQ and PROC REPORT) where there is no CLASS statement
- explicit (e.g. PROC MEANS/SUMMARY, PROC TABULATE, and PROC UNIVARIATE) where there is a CLASS statement
Regardless of the kind of classification variable, when a classification variable has a missing value that observation is excluded from the analysis or report.
You can alter this behavior though the use of the MISSING option. Depending on the procedure this option is applied in one of several ways. When it appears on the PROC statement it applies to all of the classification variables used in the procedure. However it can often be used on other statements within the PROC step to further refine how it is to be applied.
For procedures that support the CLASS statement (those that have explicit classification variables), the MISSING option can now be applied directly.
class age sex region / missing;
Since multiple CLASS statements are allowed we could further refine which classification variables are to receive the MISSING option by specifying multiple CLASS statements. In the following example the MISSING only applies to the variable SEX.
class age; class sex / missing; class region;
For procedures that have implied classification variables the MISSING option can generally be applied to specific statements within the procedure step. In PROC FREQ the MISSING option can be used on the TABLES statement.
proc freq data=clinics; tables age*sex / missing;
In a REPORT step the MISSING option can be used on the DEFINE statement.
proc report data=clinics; column region age wt; define region / group missing;
If there are two or more CLASS variables not subject to the MISSING option, exclusion of observations is determined jointly. That is, if any of those variables has a missing value for some observation, that observation is excluded from all counts, calculations and results. The excluded observations are part of the count shown in the log (NOTE: There were ___ observations read from the data set ____.____.), but otherwise they behave as if they were filtered by means of a WHERE condition.
PROC FREQ works differently. If two or more variables are designated in TABLE statements (and not subject to the MISSING option), exclusion of observations is determined independently for each tabulation.
For example, consider this data set:
data demo ; input Aa Bb ; format Aa Bb Questfmt. ; cards ; 1 1 2 2 2 2 . 8 8 8 8 8 ;
The format is based on one presented in the documentation.
proc format ; value Questfmt 1 ='Yes' 2 ='No' 8,. ='N/A' ; run ;
Printed without the format in effect, the data set looks like this.
Obs Aa Bb 1 1 1 2 2 2 3 2 2 4 . 8 5 8 8 6 8 8
With the format it looks like this.
Obs Aa Bb 1 Yes Yes 2 No No 3 No No 4 N/A N/A 5 N/A N/A 6 N/A N/A
When the following PROC MEANS code
proc means data=demo order=formatted ; class Aa ; class Bb ; ways 1 ; run ;
runs, this is the output
N Bb Obs N/A 2 No 2 Yes 1 N Aa Obs N/A 2 No 2 Yes 1
Observation #4 has been excluded, for both Aa and Bb. Now try this similar PROC FREQ code.
proc freq data=demo order=formatted nlevels ; table Aa ; table Bb ; run ;
The result is
Number of Variable Levels Missing Nonmissing Variable Levels Levels Levels Aa 3 1 2 Bb 3 0 3 Cumulative Cumulative Aa Frequency Percent Frequency Percent No 2 66.67 2 66.67 Yes 1 33.33 3 100.00 Frequency Missing = 3 Cumulative Cumulative Bb Frequency Percent Frequency Percent N/A 3 50.00 3 50.00 No 2 33.33 5 83.33 Yes 1 16.67 6 100.00
First notice that the one-way tabulation for Bb incorporates all 6 observations. That's because there are no missing values for Bb. It doesn't matter whether there are missing values for other classification variables.
In the table for Aa, observations are excluded. Typically, it would only happen to observation #4, where a missing value is in fact stored. Here however we have a special case. The format (Questfmt) groups missing values with the non-missing value 8. PROC FREQ treats such a "mixed" (missing and non-missing) level as a missing level. That's corroborated by the Number of Variable Levels report triggered by the NLEVELS option and shown above. The documentation is a bit sketchy about this. It says "When you use PROC FORMAT to create a user-written format that combines missing and nonmissing values into one category, PROC FREQ treats the entire category of formatted values as missing." That would lead us to expect the level labeled N/A for variable Bb to be considered missing, since its format meets the condition described. However, it is clearly considered non-missing, indicating that there is a second condition: that at least one missing value is encountered in the data set. That's why the mixed (N/A) level is considered missing for Aa but not for Bb.
Finally, there is some not exactly intuitive behavior when the MISSING option is brought into effect. Applying MISSING to one CLASS variable can change results for a different CLASS variable. This is a consequence of the "joint" filtering discussed above. Suppose the PROC MEANS code is altered by invoking the missing option for Aa
proc means data=demo order=formatted ; class Aa / missing ; class Bb ; ways 1 ; run ;
The results are
N Bb Obs N/A 3 No 2 Yes 1 N Aa Obs N/A 3 No 2 Yes 1
The inclusion of observation #4 in the counts for Aa is a direct effect of the MISSING option for Aa. But Bb also includes more observations, as a consequence of the joint filter.
Code and inline data to create the CLINIC data set referenced above can be found in SAS Example 59224.
(Issues to be explored: crosstabs in PROC FREQ; other procs, esp. REPORT)