Visualize patterns of missing values

Heat map of missing values among among 8 variables

Missing values present challenges for the statistical analyst and data scientist. Many modeling techniques (such as regression) exclude observations that contain missing values, which can reduce the sample size and reduce the power of a statistical analysis. Before you try to deal with missing values in an analysis (for example, by using multiple imputation), you need to understand which variables contain the missing values and to examine the patterns of missing values. I have previously written about how to use SAS to do the following:

An example of a visualization of a missing value pattern is shown to the right. The heat map shows the missing data for eight variables and 5209 observations in the Sashelp.Heart data set. It is essentially the heat map of a binary matrix where 1 indicates nonmissing values (white) and 0 indicates missing values (black).

In this article, I present a bar chart that helps to visualize the "Missing Data Patterns" table from PROC MI in SAS/STAT software. I also show two SAS tricks:

The pattern of missing data

The MI procedure can perform multiple imputation of missing data, but it also can be used as a diagnostic tool to group observations according to the pattern of missing data. You can use the NIMPUTE=0 option to display the pattern of missing value. By default, the "Missing Data Patterns" table is very wide because it includes the group means for each variable. However, you can use the DISPLAYPATTERN=NOMEANS option (SAS9.4M5) to suppress the group means, as follows:

%let Vars = AgeAtStart Height Weight Diastolic Systolic MRW Smoking Cholesterol;
%let numVars = %sysfunc(countw(&Vars));    /* &numVars = 8 for this example */
proc mi data=Sashelp.Heart nimpute=0 displaypattern=nomeans;   /* SAS 9.4M5 option */
var &Vars;
ods output MissPattern=Pattern;        /* output missing value pattern to data set */

In the table, an X indicates that a variable does not contain any missing values, whereas a dot (.) indicates that a variable contains missing values. The table shows that Group 1 consists of about 5000 observations that are complete. Groups 2, 3, and 6 contains observations for which exactly one variable contains missing values. Groups 4 and 5 contain observations for which two variables are jointly missing. Group 7 contains observations for which three variables are missing. You can see that the size of the groups vary widely. Some patterns of missing value appear only a few times, whereas others appear much more often.

By inspecting the table, you can determine which combinations of variables are missing for each group. However, imagine a dataset that has 20 variables. The "Missing Data Patterns" table for 20 variables would be very wide, and it would be cumbersome to determine which combination of variables are jointly missing. The next section condenses the table into a smaller format and then creates a graph to summarize the pattern of missing values.

Shorten the labels for the missing data patterns

The information in the "Missing Data Patterns" table can be condensed. I saw the following idea in Chapter 10 of Gerhard Svolba's Data Quality for Analytics Using SAS, who credits the idea to a display that appears in JMP software. (Svolba does not use PROC MI, but defines a macro that you can download from the book's website.)

The table is wide because there is a column for every variable in the analysis. But the information in those columns is binary: for the i_th group does the j_th variable have a missing value? If the analysis involves k variables, you can replace the k columns by a single binary string that has k digits. The j_th digit in the string indicates whether the j_th variable has a missing value.

In the preceding analysis, the ODS OUTPUT statement wrote the "Missing Data Patterns" table to a data set. The following SAS DATA step reads the data and constructs a binary string of length k from the k character variables in the data. The binary string is formed by using the CATT function to concatenate the 'X' and '.' values in the table. The TRANSLATE function then converts those characters to '0' and '1' characters. The DATA step also computes the NumMiss variable, which counts the number of variables that have missing values for each row:

data Miss;
set Pattern;
array vars[*] _CHARACTER_;    /* or use &Vars */
length Pattern $&numVars.;    /* length = number of variables in array */
Pattern = translate(catt(of vars[*]), '01', 'X.');     /* Ex: 00100100 */
NumMiss = countc(pattern, '1');
proc print data=Miss noobs;
   var Group Pattern NumMiss Freq;

The table shows that the eight columns for the analysis variables have been replaced by a single column that displays an eight-character binary string. For Group 1, the string is all zeros, which indicates that no variable has missing values. For Group 2, the binary string contains all zeros except for a 1 in the last position. This means that Group 1 is the set of observations for which the last variable contains a missing value. Similarly, Group 7 is the set of observations for which the second, third, and sixth variables contain a missing value. Notice that this table does not provide the names of the variables. If you cannot remember the name of the second, third, and sixth variables, you need to look them up in the VARS macro variable.

Create a bar chart that has a logarithmic scale

The condensed version of the "Missing Data Patterns" table is suitable to graph as a bar chart. However, for these data the size of the groups vary widely. Consequently, you might want to plot the frequencies of the groups on a logarithmic scale. (Or you might not! There is considerable debate about whether you should display a bar chart that has a logarithmic axis. For a discussion and alternatives, see Sanjay's article "Graphs with log axis." My opinion is that a log axis is fine to use for a technical audience such as statisticians.)

By default, you cannot use a logarithmic scale on a bar chart because the baseline for the vertical axis (the frequency or count axis) starts at 0 and the LOG function is not defined at 0. If you have count data and no category has zero counts, then you can set the baseline of the graph to 1. The bars then indicate the log-frequency of the counts. The following call to PROC SGPLOT creates the bar chart and a marginal table:

title "Pattern of Missing Values";
proc sgplot data=Miss;
   hbar Pattern /response=Freq datalabel
                       baseline=1;  /* set BASELINE=1 for log scale */
   yaxistable NumMiss / valuejustify=left label="Num Miss"
                       valueattrs=GraphValueText labelattrs=GraphLabelText; /* use same attributes as axis */
   yaxis labelposition=top; 
   xaxis grid type=log logbase=10 label="Frequency (log10 scale)";

This graph displays the counts of the number of observations in each pattern group. The labels on the Y axis indicate which variables have missing data. The values in the right margin indicate how many variables have missing data. If you are opposed to drawing bar charts on a logarithmic axis, you can use one of Sanjay's alternative visualizations.

In summary, you can create a bar chart that visualizes the number of observations that have a similar pattern of missing values. The summarization comes from PROC MI in SAS, which has a new DISPLAYPATTERN=NOMEANS option. A short DATA step can create a binary string for each group. That string can be used to indicate the missing data pattern in each group. If the groups vary widely in size, you can use a logarithmic axis to display the data. To create a bar chart on a logarithmic scale in SAS, set BASELINE=1 in the HBAR statement and use TYPE=LOG on the XAXIS statement in PROC SGPLOT.

The post Visualize patterns of missing values appeared first on The DO Loop.


A SAS programming technique to modify ODS templates

This article demonstrates a SAS programming technique that I call Kuhfeld's template modification technique. The technique enables you to dynamically modify an ODS template and immediately call the modified template to produce a new graph or table. By following the five steps in this article, you can implement the technique in less than 20 lines of SAS code.

Kuhfeld's Template Modification Technique (TMT)

The template modification technique (TMT) is one of several advanced techniques that Warren Kuhfeld created and promoted in a series of articles that culminated with Kuhfeld (2016). (Slides are also available.) The technique that I call the TMT is a DATA _NULL_ program that modifies a SAS-supplied template, uses CALL EXECUTE to compiles it "on the fly," and immediately calls the modified template to render a new graph (pp. 12ff in the paper; p. 14 of his slides).

When I attended Warren's 2016 presentation, I felt a bit overwhelmed by the volume of information. It was difficult for me to see the forest because of all the interconnected "trees." Graph customization is easier to understand if you restrict your attention only to GTL templates. You can make simple modifications to a template (for example, changing the value of an option) by using the following five steps:

  1. Prepend a private item store to the ODS search path. The private item store will store the modified template.
  2. Write the existing template to a temporary text file.
  3. Run a DATA step that reads the existing template one line at a time. Check each line to see if it contains the option that you want to change. If so, modify the line to change the template's behavior.
  4. Compile the modified template into the private item store.
  5. Call PROC SGRENDER to render the graph by using the new template.

This article is intentionally brief. For more information see the chapter "ODS Graphics Template Modification" in the SAS/STAT documentation.

Most of Kuhfeld's examples use the TMT to modify ODS templates that are distributed with SAS. These templates can be complex and often contain conditional logic. In an effort to produce a simpler example, the next section creates a simple template that uses a heat map to display a response variable on a uniform grid.

Example: A template that specifies a color ramp

Programmers who use the GTL to create graphs have probably encountered dynamic variables and run-time macro variables. You can use these to specify certain options at run time, rather than hard-coding information into a template. However, the documentation states that "dynamic variables ... cannot resolve to statement or option keywords." In particular, the GTL does not support changing the color ramp at run time: whatever color ramp is hard-coded in the template is the color ramp that is used. (Note: If you have SAS 9.4M3, the HEATMAPPARM statement in PROC SGPLOT supports specifying a color ramp at run time.)

However, you can use Kuhfeld's TMT to modify the color ramp at run time. The following GTL defines a template named "HEATMAP." The template uses the HEATMAPPARM statement and hard-codes the COLORMODEL= option to be the "ThreeColorRamp" in the current ODS style. The template is adapted from the article "Creating a basic heat map in SAS."

/* Construct example template: a basic heat map with continuous color ramp */
proc template; 
define statgraph HEATMAP;
dynamic _X _Y _Z;           /* dynamic variables */
 layout overlay;
    heatmapparm x=_X y=_Y colorresponse=_Z /  /* specify variables at run time */
       name="heatmap" primary=true xbinaxis=false ybinaxis=false
       colormodel = THREECOLORRAMP;           /* <== hard-coded color ramp */
    continuouslegend "heatmap";

When the PROC TEMPLATE step runs, the HEATMAP template is stored in a location that is determined by your ODS search path. You can run ods path show; to display the search path. You can discover where the HEATMAP template is stored by running proc template; list HEATMAP; run;.

The following DATA step defines (x,y) data on a uniform grid and defines z to be a cubic function of x and y. The heat map shows an approximate contour plot of the cubic function:

/* Construct example data: A cubic function z = f(x,y) on regular grid */
do x = -1 to 1 by 0.05;
   do y = -1 to 1 by 0.05;
      z = x**3 - y**2 - x + 0.5;
/* visualize the data using the built-in color ramp */
proc sgrender data=Cubic template=Heatmap; 
   dynamic _X='X' _Y='Y' _Z='Z';
Heat map rendered with original template

The remainder of this article shows how to use Kuhfeld's TMT to create a new template which is identical to the HEATMAP template except that it uses a different color ramp. Kuhfeld's technique copies the existing template line by line but replaces the text "THREECOLORRAMP" with text that defines a new custom color ramp.

In the following program, capital letters are used for the name of the existing template, the name of the new template, and the value of the new color ramp. Most of the uncapitalized terms are "boilerplate" for the TMT, although each application will require unique logic in the DATA _NULL_ step.

Step 1: Prepend a private item store to the ODS search path

PROC TEMPLATE writes templates to the first (writable) item store on the ODS search path. PROC SGRENDER looks through the ODS search path until it locates a template with a given name. The first step is to prepend a private item store to the search path, as shown by the following statement:

/* 1. Modify the search path for ODS. Prepend an item store in WORK */
ods path (prepend) work.modtemp(update);

The ODS PATH statement ensures that the modified template will be stored in WORK.MODTEMP. You can provide any name you want. The name WORK.TEMPLAT is a convention that is used in the SAS documentation. Because you might already have WORK.TEMPLAT on your ODS path, I used a different name. (Note: Any item stores in WORK are deleted when SAS exits.) The SAS/STAT documentation provides more information about the template search path.

Step 2: Write the existing template to a temporary text file

You need to write the source code for the HEATMAP template to a plain text file so that it can be read by the SAS DATA step. You can use the FILENAME statement to point to a hard-coded location (such as filename t "C:/temp/tmplt.txt";), but I prefer to use the SAS keyword TEMP, which is more portable. The TEMP keyword tells SAS to create a temporary file name in a location that is independent of the operating system and the file system:

/* 2. Write body of existing template to text file */
filename _tmplt TEMP;             /* create temporary file and name */
proc template;
   source HEATMAP / file=_tmplt;  /* write template source to text file */

If you omit the FILE= option, the template source is written to the SAS log. The text file contains only the body of the HEATMAP template.

Steps 3 and 4: Create a new template from the old template

The next step reads and modifies the existing template line by line. There are many SAS character functions that enable you to discover whether some keyword exists in a line of text. Examples include the INDEX and FIND family of functions, the PRXCHANGE function (which uses regular expressions), and the TRANWRD function.

The goal of the current example is to replace the word "THREECOLORRAMP" by another string. The TRANWRD function replaces the string if it is found; the line is unchanged if the word is not found. Therefore it is safe to call the TRANWRD function on every line in the existing template.

You can use the CALL EXECUTE function to place each (possibly modified) line of the template onto a queue, which will be executed when the DATA step completes. However, the text file does not start with a PROC TEMPLATE statement nor end with a RUN statement. You must add those statements to the execution queue in order to compile the template, as follows:

/* one possible way to specify a new color ramp */
%let NEWCOLORRAMP = (blue cyan yellow red);
/* 3. Read old template; make modifications.
   4. Use CALL EXECUTE to compile (modified) template with a new name. */
data _null_;                       /* Modify graph template to replace default color ramp */
   infile _tmplt end=eof;          /* read from text file; last line sets EOF flag to true */
   input;                          /* read one line at a time */
   if _n_ = 1 then 
       call execute('proc template;');      /* add 'PROC TEMPLATE;' before */
   if findw(_infile_, 'define statgraph') then
      _infile_ = "define statgraph NEWHEATMAP;";  /* optional: assign new template name */
   /* use TRANWRD to make substitutions here */
   _infile_ = tranwrd(_infile_, 'THREECOLORRAMP', "&NEWCOLORRAMP"); /* replace COLORMODEL= option */
   call execute(_infile_);                  /* push line onto queue; compile after DATA step */
   if eof then call execute('run;');        /* add 'RUN;' at end */

The DATA step copies the existing template source, but modifies the template name and the COLORMODEL= option. The CALL EXECUTE statements push each (potentially modified) statement onto a queue that is executed when the DATA step exits. The result is that PROC TEMPLATE compiles a new template and stores it in the WORK.MODTEMP item store. Recall that the TRANWRD function and other string-manipulation functions are case sensitive, so be sure to use the exact capitalization that exists in the template. In other words, this technique requires that you know details of the template that you wish to modify!

If you are using this technique to modify one of the built-in SAS templates that are used by SAS procedure, do not modify the template name.

Inspect the modified template

Before you attempt to use the new template, you might want to confirm that it is correct. The following call to PROC TEMPLATE shows the location and creation date for the new template. The source for the template is displayed in the log so that you can verify that the substitutions were made correctly.

proc template;
   list NEWHEATMAP / stats=created; /* location of the modified template */
   source NEWHEATMAP;       /* confirm that substitutions were performed */

Step 5: Render the graph by using the new template

The final step is to call PROC SGRENDER to use the newly created template to create the new graph:

/* 5. Render graph by using new template */
proc sgrender data=Cubic template=NEWHEATMAP; 
   dynamic _X='X' _Y='Y' _Z='Z';
Render graph with modified template

The output demonstrates that the original color ramp ("ThreeColorRamp") has been replaced by a four-color ramp that uses the colors blue, cyan, yellow, and red.

When is this necessary?

You might wonder if the result is worth all this complexity and abstraction. Couldn't I have merely copied the template into a program editor and changed the color ramp? Yes, you can use a program editor to modify a template for your personal use. However, this programming technique enables you to write macros, tasks, and SAS/IML functions that other people can use. For example, you could encapsulate the program in this article into a %CreateHeatmap macro that accepts a color ramp and data set as an argument. Or in SAS/IML you can write modules such as the HEATMAPCONT and HEATMAPDISC subroutines, which enable SAS/IML programmers to specify any color ramp to visualize a matrix.

If you work in a regulated industry and are modifying templates that SAS distributes, this technique provides reproducibility. A reviewer or auditor can run a program to reproduce a graph.


Kuhfeld's template modification technique (TMT) enables you to write a SAS program that modifies an ODS template. The TMT combines many SAS programming techniques and can be difficult to comprehend when you first see it. This article breaks down the technique into five simpler steps. You can use the TMT to edit ODS templates that are distributed by SAS and used in SAS procedures, or (as shown in this article) to change options at run time for a template that you wrote yourself. You can download the SAS program for this article.

Happy Halloween to all my readers. I hope you'll agree that Kuhfeld's TRICKS are a real TREAT!

The post A SAS programming technique to modify ODS templates appeared first on The DO Loop.


The diffogram and other graphs for multiple comparisons of means

In a previous article, I discussed the lines plot for multiple comparisons of means. Another graph that is frequently used for multiple comparisons is the diffogram, which indicates whether the pairwise differences between means of groups are statistically significant. This article discusses how to interpret a diffogram. Two related plots are also discussed.

How to create a diffogram in SAS

The diffogram (also called a mean-mean scatter diagram) is automatically created when you use the PDIFF=ALL option on the LSMEANS statement in several SAS/STAT regression procedures such as PROC GLM and PROC GLMMIX, assuming that you have enabled ODS graphics. The documentation for PROC GLM contains an example that uses data about the chemical composition of shards of pottery from four archaeological sites in Great Britain. Researchers want to determine which sites have shards that are chemically similar. The data are contained in the pottery data set. The following call to PROC GLM performs an ANOVA of the calcium oxide in the pottery at the sites. The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of calcium oxide for the different sites. The ADJUST=TUKEY option is one way to adjust the confidence intervals for multiple comparisons.

ods graphics on;
proc glm data=pottery plots(only)=( diffplot(center)          /* diffogram */
                                    meanplot(cl ascending) ); /* plot of means and CIs */
   label Ca = "Calcium Oxide (%)";
   class Site;
   model Ca = Site;
   lsmeans Site / pdiff=all cl adjust=tukey;  /* all pairwise comparisons of means w/ adjusted CL */
   ods output LSMeanDiffCL=MeanDiff;          /* optional: save mean differences and CIs */

Two graphs are requested: the diffogram (or "diffplot") and a "mean plot" that shows the group means and 95% confidence intervals. The ODS OUTPUT statement creates a data set from a table that contains the mean differences between pairs of groups, along with 95% confidence intervals for the differences. You can use that information to construct a plot of the mean differences, as shown later in this article.

How to interpret a diffogram

Diffogram in SAS showing multiple comparisons of means

The diffogram, which is shown to the right (click to enlarge), is my favorite graph for multiple comparisons of means. Every diffogram displays a diagonal reference line that has unit slope. Horizontal and vertical reference lines are placed along the axes at the location of the means of the groups. For these data, there are four vertical and four horizontal reference lines. At the intersection of most reference lines there is a small colored line segment. These segments indicate which of the 4(4-1)/2 = 6 pairwise comparisons of means are significant.

Let's start at the top of the diffogram. The mean for the Caldecot site is about 0.3, as shown by a horizontal reference line near Y=0.3. On the reference line for Caldecot are three line segments that are centered at the mean values for the other groups, which are (from left to right) IslandThorns, AshleyRails, and Llanederyn. The first two line segments (in blue) do not intersect the dashed diagonal reference line, which indicates that the means for the (IslandThorns, Caldecot) and (AshleyRails, Caldecot) pairs are significantly different. The third line segment (in red) intersects the diagonal reference line, which indicates that the (Llanederyn, Caldecot) comparison is not significant.

Similarly, the mean for the Llanederyn site is about 0.2, as shown by a horizontal reference line near Y=0.2. On the reference line for Llanederyn are two line segments, which represent the (IslandThorns, Llanederyn) and (AshleyRails, Llanederyn) comparisons. Neither line segment intersects the diagonal reference line, which indicates that the means for those pairs are significantly different.

Lastly, the mean for the AshleyRails site is about 0.05, as shown by a horizontal reference line near Y=0.05. The line segment on that reference line represents the (IslandThorns, AshleyRails) comparison. It intersects the diagonal reference line, which indicates that those means are not significantly different.

The colors for the significant/insignificant pairs depends on the ODS style, but it is easy to remember the simple rule: if a line segment intersects the diagonal reference line, then the corresponding group means are not significantly different. A mnemonic is "Intersect? Insignificant!"

The mean plot with confidence interval: Use wisely!

Mean plot and 95% confidence intervals, created with SAS

The previous section shows how to use the diffogram to visually determine which pairs of means are significantly different. This section reminds you that you should not try to use the mean plot (shown at the right) for making those inferences.

I have seen presentations in which the speaker erroneously claims that "the means of these groups are significantly different because their 95% confidence intervals do not overlap." That is not a correct inference. In general, the overlap (or lack thereof) between two (1 – α)100% confidence intervals does not give sufficient information about whether the difference between the means is significant at the α level. In particular, you can construct examples where

  • Two confidence intervals overlap, but the difference of means is significant. (See Figure 2 in High (2014).)
  • Two confidence intervals do not overlap, but the difference of means is not significant.

The reason is twofold. First, the confidence intervals in the plot are constructed by using the sample sizes and standard deviations for each group, whereas tests for the difference between the means are constructed by using pooled standard deviations and sample sizes. Second, if you are making multiple comparisons, you need to adjust the widths of the intervals to accommodate the multiple (simultaneous) inferences.

"But Rick," you might say, "in the mean plot for these data, the Llanederyn and Caldecot confidence intervals overlap. The intervals for IslandThorns and AshleyRails also overlap. And these are exactly the two pairs that are not significantly different, as shown by the diffogram!" Yes, that is true for these data, but it is not true in general. Use the diffogram, not the means plot, to visualize multiple comparisons of means.

Plot the pairwise difference of means and confidence intervals

In addition to the diffogram, you can visualize comparisons of means by plotting the confidence intervals for the pairwise mean differences. Those intervals that contain 0 represent insignificant differences. Recall that the call to PROC GLM included an ODS output statement that created a data set (MeanDiff) that contains the mean differences. The following DATA step constructs labels for each pair and computes whether each pairwise difference is significant:

data Intervals;
length Pair $28.;
length S $16.;
set MeanDiff;
/* The next line is data dependent. For class variable 'C', concatenate C and _C variables */
Pair = catx(' - ', Site, _Site);           
Significant = (0 < LowerCL | UpperCL < 0); /* is 0 in interior of interval? */
S = ifc(Significant, "Significant", "Not significant");
title "Pairwise Difference of LSMeans (Tukey Adjustment)";
title2 "95% Confidence Intervals of Mean Difference";
footnote J=L "Pairs Whose Intervals Contain 0 Are Not Significantly Different";
proc sgplot data=Intervals;
   scatter y=Pair x=Difference / group=S name="CIs"
           xerrorlower=LowerCL xerrorupper=UpperCL;
   refline 0 / axis=x;
   yaxis reverse colorbands=odd display=(nolabel) 
                 offsetmin=0.06 offsetmax=0.06;
   keylegend "CIs" / sortorder=ascending;
Plot of mean differences and confidence intervals, created with SAS

The resulting graph is shown to the right. For each pair of groups, the graph shows an estimate for the difference of means and the Tukey-adjusted 95% confidence intervals for the difference. Intervals that contain 0 indicate that the difference of means is not significant. Intervals that do not contain 0 indicate significant differences.

Although the diffogram and the difference-of-means plot provide the same information, I prefer the diffogram because it shows the values for the means rather than for the mean differences. Furthermore, the height of the difference-of-means plot needs to be increased as more groups are added (there are k(k-1)/2 rows for k groups), whereas the diffogram can accommodate a moderate number of groups without being rescaled. On the other hand, it can be difficult to read the reference lines in the diffogram when there are many groups, especially if some of the groups have similar means.

For more about multiple comparisons tests and how to visualize the differences between means, see the following references:

The post The diffogram and other graphs for multiple comparisons of means appeared first on The DO Loop.


Graphs for multiple comparisons of means: The lines plot

Last week Warren Kuhfeld wrote about a graph called the "lines plot" that is produced by SAS/STAT procedures in SAS 9.4M5. (Notice that the "lines plot" has an 's'; it is not a line plot!) The lines plot is produced as part of an analysis that performs multiple comparisons of means. Although the graphical version of the lines plot is new in SAS 9.4M5, the underlying analysis has been available in SAS for decades. If you have an earlier version of SAS, the analysis is presented as a table rather than as a graph.

Warren's focus was on the plot itself, with an emphasis on how to create it. However, the plot is also interesting for the statistical information it provides. This article discusses how to interpret the lines plot in a multiple comparisons of means analysis.

The lines plot in SAS

You can use the LINES option in the LSMEANS statement to request a lines plot in SAS 9.4M5. The following data are taken from Multiple Comparisons and Multiple Tests (p. 42-53 of the First Edition). Researchers are studying the effectiveness of five weight-loss diets, denoted by A, B, C, D, and E. Ten male subjects are randomly assigned to each method. After a fixed length of time, the weight loss of each subject is recorded, as follows:

/* Data and programs from _Multiple Comparisons and Multiple Tests_
   Westfall, Tobias, Rom, Wolfinger, and Hochberg (1999, First Edition) */
data wloss;
do diet = 'A','B','C','D','E';
   do i = 1 to 10;  input WeightLoss @@;  output;  end;
12.4 10.7 11.9 11.0 12.4 12.3 13.0 12.5 11.2 13.1
 9.1 11.5 11.3  9.7 13.2 10.7 10.6 11.3 11.1 11.7
 8.5 11.6 10.2 10.9  9.0  9.6  9.9 11.3 10.5 11.2
 8.7  9.3  8.2  8.3  9.0  9.4  9.2 12.2  8.5  9.9
12.7 13.2 11.8 11.9 12.2 11.2 13.7 11.8 11.5 11.7

You can use PROC GLM to perform a balanced one-way analysis of variance and use the MEANS or LSMEANS statement to request pairwise comparisons of means among the five diet groups:

proc glm data=wloss;
   class diet;
   model WeightLoss = diet;
   *means diet / tukey LINES; 
   lsmeans diet / pdiff=all adjust=tukey LINES;

In general, I use the LSMEANS statement rather than the MEANS statement because LS-means are more versatile and handle unbalanced data. (More about this in a later section.) The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of weight loss for the different diets. The ADJUST=TUKEY option is a common way to adjust the widths of confidence intervals to accommodate the multiple comparisons. The analysis produces several graphs and tables, including the following lines plot.

Lines plot for multiple comparisons of means in SAS

How to interpret a lines plot

In the lines plot, the vertical lines visually connect groups whose LS-means are "statistically indistinguishable." Statistically speaking, two means are "statistically indistinguishable" when their pairwise difference is not statistically significant.

If you have k groups, there are k(k-1)/2 pairwise differences that you can examine. The lines plot attempts to summarize those comparisons by connecting groups whose means are statistically indistinguishable. Often there are fewer lines than pairwise comparisons, so the lines plot displays a summary of which groups have similar means.

In the previous analysis, there are five groups, so there are 10 pairwise comparisons of means. The lines plot summarizes the results by using three vertical lines. The leftmost line (blue) indicates that the means of the 'B' and 'C' groups are statistically indistinguishable (they are not significantly different). Similarly, the upper right vertical bar (red) indicates that the means of the pairs ('E','A'), ('E','B'), and ('A','B') are not significantly different from each other. Lastly, the lower right vertical bar (green) indicates that the means for groups 'C' and 'D' are not significantly different. Thus in total, the lines plot indicates that five pairs of means are not significantly different.

The remaining pairs of mean differences (for example, 'E' and 'D') are significantly different. By using only three vertical lines, the lines plot visually associates pairs of means that are essentially the same. Those pairs that are not connected by a vertical line are significantly different.

Advantages and limitations of the lines plot

Advantages of the lines plot include the following:

  • The groups are ordered according to the observed means of the groups.
  • The number of vertical lines is often much smaller than the number of pairwise comparisons between groups.

Notice that the groups in this example are the same size (10 subjects). When the group sizes are equal (the so-called "balanced ANOVA" case), the lines plot can always correctly represent the relationships between the group means. However, that is not always true for unbalanced data. Westfall et al. (1999, p. 69) provide an example in which using the LINES option on the MEANS statement produces a misleading plot.

The situation is less severe when you use the LSMEANS statement, but for unbalanced data it is sometimes impossible for the lines plot to accurately connect all groups that have insignificant mean differences. In those cases, SAS appends a footnote to the plot that alerts you to the situation and lists the additional significances not represented by the plot.

In my next blog post, I will show some alternative graphical displays that are appropriate for multiple comparisons of means for unbalanced groups.


In summary, the new lines plot in SAS/STAT software is a graphical version of an analysis that has been in SAS for decades. You can create the plot by using the LINES option in the LSMEANS statement. The lines plot indicates which groups have mean differences that are not significant. For balanced data (or nearly balanced), it does a good job of summarizes which differences of means are not significant. For highly unbalanced data, there are other graphs that you can use. Those graphs will be discussed in a future article.

The post Graphs for multiple comparisons of means: The lines plot appeared first on The DO Loop.


Create and interpret a weighted histogram

If you perform a weighted statistical analysis, it can be useful to produce a statistical graph that also incorporates the weights. This article shows how to construct and interpret a weighted histogram in SAS.

How to construct a weighted histogram

Before constructing a weighted histogram, let's review the construction of an unweighted histogram. A histogram requires that you specify a set of evenly spaced bins that cover the range of the data. An unweighted histogram of frequencies is constructed by counting the number of observations that are in each bin. Because counts are dependent on the sample size, n, histograms often display the proportion (or percentage) of values in each bin. The proportions are the counts divided by n. On the proportion scale, the height of each bin is the sum of the quantity 1/n, where the sum is taken over all observations in the bin.

That fact is important because it reveals that the unweighted histogram is a special case of the weighted histogram. An unweighted histogram is equivalent to a weighted histogram in which each observation receives a unit weight. Therefore the quantity 1/n is the standardized weight of each observation: the weight divided by the sum of the weights. The formula is the same for non-unit weights: the height of each bin is the sum of the quantity wi / Σ wi, where the sum is taken over all observations in the bin. That is, you add up all the standardized weights in each bin to produce the bin height.

An example of a weighted histogram

The SAS documentation for the WEIGHT statement includes the following example. Twenty subjects estimate the diameter of an object that is 30 cm across. Some people are placed closer to the object than others. The researcher believes that the precision of the estimate is inversely proportional to the distance from the object. Therefore the researcher weights each subject's estimate by using the inverse distance.

The following DATA step creates the data, and PROC SGPLOT creates a weighted histogram of the data by using the WEIGHT= option on the HISTOGRAM option. (The WEIGHT= option was added in SAS 9.4M1.)

data Size;
input Distance ObjectSize @@;
Wt = 1 / distance;        /* precision */
x = ObjectSize; 
label x = "Estimate of Size";
1.5 30   1.5 20   1.5 30   1.5 25
3   43   3   33   3   25   3   30
4.5 25   4.5 36   4.5 48   4.5 33
6   43   6   36   6   23   6   48
7.5 30   7.5 25   7.5 50   7.5 38
title "Weighted Histogram of Size Estimate";
proc sgplot data=size noautolegend;
   histogram x / WEIGHT=Wt scale=proportion datalabel binwidth=5;
   fringe x / lineattrs=(thickness=2 color=black) transparency=0.6;
   yaxis grid offsetmin=0.05 label="Weighted Proportion";
   refline 30 / axis=x lineattrs=(pattern=dash);
Weighted histogram in SAS; weights proportional to inverse variance

The weighted histogram is shown to the right. The data values are shown in the fringe plot beneath the histogram. The height of each bin is the sum of the weights of the observations in that bin. The dashed line represents the true diameter of the object. Most estimates are clustered around the true value, except for a small cluster of larger estimates. Notice that I use the SCALE=PROPORTION option to plot the weighted proportion of observations in each bin, although the default behavior (SCALE=PERCENT) would also be acceptable.

If you remove the WEIGHT= option and study the unweighted graph, you will see that the average estimate for the unweighted distribution (33.6) is not as close to the true diameter as the weighted estimate (30.1). Furthermore, the weighted standard deviation is about half the unweighted standard deviation, which shows that the weighted distribution of these data has less variance than the unweighted distribution.

By the way, although PROC UNIVARIATE can produce weighted statistics, it does not create weighted graphics as of SAS 9.4M5. One reason is that the graphics statements (CDFPLOT, HISTOGRAM, QQPLOT, etc) not only create graphs but also fit distributions and produce goodness-of-fit statistics, and those analyses do not support weight variables.

Checking the computation

Although a weighted histogram is not conceptually complex, I understand a computation better when I program it myself. You can write a SAS program that computes a weighted histogram by using the following algorithm:

  1. Construct the bins. For this example, there are eight bins of width 5, and the first bin starts at x=17.5. (It is centered at x=20.) Initialize all bin heights to zero.
  2. For each observation, find the bin that contains it. Increment the bin height by the weight of that observation.
  3. Standardize the heights by dividing by the sum of weights. You can skip this step if the weights sum to unity.

A SAS/IML implementation of this algorithm requires only a few lines of code. A DATA step implementation that uses arrays is longer, but probably looks more familiar to many SAS programmers:

data BinHeights(keep=height:);
array EndPt[8] _temporary_;
binStart = 17.5;  binWidth = 5;  /* anchor and width for bins */
do i = 1 to dim(EndPt);          /* define endpoints of bins */
   EndPt[i] = binStart + (i-1)*binWidth;
array height[7];                 /* height of each bin */
set Size end=eof;                /* for each observation ... */
sumWt + Wt;                      /* compute sum of weights */
do i = 1 to dim(EndPt)-1 while (^Found); /* find bin for each obs */
   Found = (EndPt[i] <= x < EndPt[i+1]);
   if Found then height[i] + Wt; /* increment bin height by weight */
if eof then do;            
   do i = 1 to dim(height);      /* scale heights by sum of weights */
      height[i] = height[i] / sumWt;
proc print noobs data=BinHeights; run;
Heights of bars in weighted histogram

The computations from the DATA step match the data labels that appear on the weighted histogram in PROC SGPLOT.


In SAS, the HISTOGRAM statement in PROC SGPLOT supports the WEIGHT= option, which enables you to create a weighted histogram. A weighted histogram shows the weighted distribution of the data. If the histogram displays proportions (rather than raw counts), then the heights of the bars are the sum of the standardized weights of the observations within each bin. You can download the SAS program that computes the quantities in this article.

How can you interpret a weighted histogram? That depends on the meaning of the weight variables. For survey data and sampling weights, the weighted histogram estimates the distribution of a quantity in the population. For inverse variance weights (such as were used in this article), the weighted histogram overweights precise measurements and underweights imprecise measurements. When the weights are correct, the weighted histogram is a better estimate of the density of the underlying population and the weighted statistics (mean, variance, quantiles,...) are better estimates of the corresponding population quantities.

Have you ever plotted a weighted histogram? What was the context? Leave a comment.

The post Create and interpret a weighted histogram appeared first on The DO Loop.


The path of zip codes

Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.


3 ways to visualize prediction regions for classification problems

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

Prediction regions for the binary classification problem. Graph created in SAS.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

This article shows three ways to visualize prediction regions in SAS:

  1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
  2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
  3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   store work.LogiModel;                /* save model to item store */
data Grid;                              /* create grid in (x1,x2) coords */
do x1 = 0 to 1 by 0.02;
   do x2 = -7.5 to 7.5 by 0.3;
proc plm restore=work.LogiModel;        /* use PROC PLM to score model on a grid */
   score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink;  /* evaluate the model on new data */

The polygon method

Parameter estimates for  logistic model

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   effectplot contour(x=x1 y=x2);       /* 2. contour plot with scatter plot overlay */

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

The background grid method

Prediction region for a classification problem with two outcomes

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.


This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.


On the SMOOTHCONNECT option in the SERIES statement

By default, when you use the SERIES statement in PROC SGPLOT to create a line plot, the observations are connected (in order) by straight line segments. However, SAS 9.4m1 introduced the SMOOTHCONNECT option which, as the name implies, uses a smooth curve to connect the observations. In Sanjay Matange's blog, he shows an example where the SMOOTHCONNECT option interpolates points in a time series by using a smooth curve. In the example, the smooth curve does a good job of smoothly connecting the evenly spaced data points.

However, in a second post, Sanjay shows how the SMOOTHCONNECT option can result in curves that are less ideal. Here is a SAS program that reproduces the gist of Sanjay's example:

data Ex1;
input segment $ x y;
A 0  0 
A 1  0
A 2 -3
A 3 -1
B 0  0 
B 1  0
B 2  2
B 3  1
proc sgplot data=Ex1;
   series x=x y=y / group=segment markers smoothconnect;

As you can see, groups A and B have identical values for the first two observations. However, the remaining Y values for group A are negative, whereas the remaining Y values for group B are positive. Notice that the interpolating curve for group A rises up before it dives down. Similarly, the curve for group B dips down before it rises up.

This visualization could give the wrong impression to the viewer of the graph, especially if the markers are not displayed. We only have data for four values of X. However, the SMOOTHCONNECT option makes it appear that the graph for group A was higher than group B on the interval (0, 1). That may or may not be true. The curves also give the impression that the curve for group A began to decrease prior to x=1, whereas in fact we have no idea when the curve reached a maximum and began to decrease.

5 things to remember when using the SMOOTHCONNECT option in SGPLOT #DataViz
Click To Tweet

The SMOOTHCONNECT option can be useful for visualizing time series data, but as with any tool, it is important to know how to use it responsibly. In thinking about these issues, I compiled a list of five ways that the SMOOTHCONNECT option might give a misleading view of the data. By understanding these issues, you can use the SMOOTHCONNECT option wisely. Here are some potential pitfalls to connecting points by using a smooth curve.

  1. Before the curve goes down, it often goes up.
  2. The curve gives the impression that we know the location of peaks and valleys.
  3. The high and low points on the curve might exceed the range of the data.
  4. The curve can display quick (possibly unrealistic) changes in direction.
  5. The curve can bend backward, or even create a loop.

Sanjay's example demonstrates the first three issues. Notice that the maximum Y value for group A in the data is Y=0, but the interpolating spline exceeds that value.

To demonstrate the fourth and fifth items requires an artificial example and data that is not evenly spaced along the X variable:

data Ex2;
input x y;
0    0 
0.95 0.95
1    1
2   -2
proc sgplot data=Ex2;
   series x=x y=y / markers smoothconnect;
   yaxis max=1.5;

There are only four points in the data set, but the SMOOTHCONNECT curve for these data exhibits a sharp turn (in fact, a loop). I intentionally created data that would create this behavior by making the data linear for X < 2. I also made the second data point very close to the third data point so that the interpolating curve would have to gyrate widely to pass through the points before sloping down to pass through the fourth point.

Implications for data visualization

It is well known among applied mathematicians that interpolation can lead to issues like these. These issues are not caused by a bug in SAS, but by the fact that you are trying to force a smooth curve to pass through every data point. I point them out so that SAS users who are using SGPLOT can be aware of them.

If your data are evenly spaced in the horizontal direction, the first three issue are usually small and innocuous. The last two issue will not occur. Therefore, it is safe to use the SMOOTHCONNECT option for evenly spaced data.

If your data are not evenly spaced, here are a few guidelines to help you avoid wild interpolating curves:

In summary, if you use the SMOOTHCONNECT option in the SERIES statement, SAS will pass a smooth curve through the points. When the points are unevenly spaced in X, the curve might need to quickly change direction to pass through all the data. The resulting curve might not be a good representation of the data. In this case, you might need to relax the requirements of a smooth interpolating curve, either by using a piecewise linear interpolation or by using a smooth curve that is not constrained to pass through the data.

The post On the SMOOTHCONNECT option in the SERIES statement appeared first on The DO Loop.


Perceptions of probability

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance, a one-in-ten chance, or some other value?

Analysts often use vague statements like "probably" or "chances are slight" to convey their beliefs that a future event will or will not occur. Government officials and policy-makers who read reports from analysts must interpret and act on these vague statements. If the reader of a report interprets a phrase different from what the writer intended, that can lead to bad decisions.

Assigning probabilities to statements

Original box plot: Distribution of probabilities for word phrases

In the book Psychology of Intelligence Analysis (Heuer, 1999), the author presents "the results of an experiment with 23 NATO military officers accustomed to reading intelligence reports. They were given a number of sentences such as: "It is highly unlikely that ...." All the sentences were the same except that the verbal expressions of probability changed. The officers were asked what percentage probability they would attribute to each statement if they read it in an intelligence report."

The results are summarized in the adjacent dot plot from Heuer (Chapter 12), which summarizes how the officers assess the probability of various statements. The graph includes a gray box for some statements. The box is not a statistical box plot. Rather it indicates the probability range according to a nomenclature proposed by Kent (1964), who tried to get the intelligence community to agree that certain phrases would be associated with certain probability ranges.

For some statements (such as "better than even" and "almost no chance") there was general agreement among the officers. For others, there was large variability in the probability estimates. For example, many officers interpreted "probable" as approximately a 75% chance, but quite a few interpreted it as less than 50% chance.

A modern re-visualization

Zonination's box plot: Distribution of probabilities for word phrases

The results of this experiment are interesting on many levels, but I am going to focus on the visualization of the data. I do not have access to the original data, but this experiment was repeated in 2015 when the user "Zonination" got 46 users on Reddit (who were not military experts) to assign probabilities to the statements. His visualization of the resulting data won a 2015 Kantar Information is Beautiful Award. The visualization uses box plots to show the schematic distribution and overlays the 46 individual estimates by using a jittered, semi-transparent, scatter plot. The Zonination plot is shown at the right (click to enlarge). Notice that the "boxes" in this second graph are determined by quantiles of the data, whereas in the first graph they were theoretical ranges.

Creating the graph in SAS

I decided to remake Zonination's plot by using PROC SGPLOT in SAS. I made several modifications that improve the readability and clarity of the plot.

  • I sorted the categories by the median probability. The median is a robust estimate of the "consensus probability" for each statement. The sorted categories indicate the relative order of the statements in terms of perceived likelihood. For example, an "unlikely" event is generally perceived as more probable than an event that has "little chance." For details about sorting the variables in SAS, see my article about how to sort variables by a statistic.
  • I removed the colors. Zonination's rainbow-colored chart is aesthetically pleasing, but the colors do not add any new information about the data. However, the colors help the eye track horizontally across the graph, so I used alternating bands to visually differentiate adjacent categories. You can create color bands by using the COLORBANDS= option in the YAXIS statement.
  • To reduce overplotting of markers, I used systematic jittering instead of random jittering. In random jittering, each vertical position is randomly offset. In systematic (centered) jittering, the markers are arranged so that they are centered on the "spine" of the box plot. Vertical positions are changed only when the markers would otherwise overlap. You can use the JITTER option in the SCATTER statement to systematically jitter marker positions.
  • Zonination's plot displays some markers twice, which I find confusing. Outliers are displayed once by the box plot and a second time by the jittered scatter plot. In my version, I suppress the display of outliers by the box plot by using the NOOUTLIERS option in the HBOX statement.

You can download the SAS code that creates the data, sorts the variables by median, and creates the plot. The following call to PROC SGPLOT shows the HBOX and SCATTER statements that create the plot:

title "Perceptions of Probability";
proc sgplot data=Long noautolegend;
   hbox _Value_ / category=_Label_ nooutliers nomean nocaps;  
   scatter x=_Value_ y=_Label_ / jitter transparency=0.5
                     markerattrs=GraphData2(symbol=circlefilled size=4);
   yaxis reverse discreteorder=data labelpos=top labelattrs=(weight=bold)
                     colorbands=even colorbandsattrs=(color=gray transparency=0.9)
                     offsetmin=0.0294 offsetmax=0.0294; /* half of 1/k, where k=number of catgories */
   xaxis grid values=(0 to 100 by 10);
   label _Value_ = "Assigned Probability (%)" _label_="Statement";
SAS box plot: Distribution of probabilities for word phrases

The graph indicates that some responders either didn't understand the task or intentionally gave ridiculous answers. Of the 17 categories, nine contain extreme outliers, such as assigning certainty (100%) to the phrases "probably not," "we doubt," and "little chance." However, the extreme outliers do not affect the statistical conclusions about the distribution of probabilities because box plots (which use quartiles) are robust to outliers.

The SAS graph, which uses systematic jittering, reveals a fact about the data that was hidden in the graphs that used random jittering. Namely, most of the data values are multiples of 5%. Although a few people responded with values such as 88.7%, 1%, or 3%, most values (about 80%) are rounded to the nearest 5%. For the phrases "likely" and "we believe," 44 of 46 responses (96%) were a multiple of 5%. In contrast, the phrase "almost no chance" had only 18 of 46 responses (39%) were multiples of 5% because many responses were 1%, 2%, or 3%.

Like the military officers in the original study, there is considerable variation in the way that the Reddit users assign a probability to certain phrases. It is interesting that some phrases (for example, "We believe," "Likely," and "Probable") have the same median value but wildly different interquartile ranges. For clarity, speakers/writers should use phrases that have small variation or (even better!) provide their own assessment of probability.

Does something about this perception study surprise you? Do you have an opinion about the best way to visualize these data? Leave a comment.

The post Perceptions of probability appeared first on The DO Loop.


Visualize a design matrix

Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.

It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures. In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or encodings that result in different design matrices.

Heat maps of design matrices: GLM parameterization

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */
data Temp / view=Temp;
   set Sashelp.heart(obs=150
                keep=BP_Status Chol_Status Smoking_Status Weight_Status);
   rename BP_Status=BP Chol_Status=Chol 
          Smoking_Status=Smoking Weight_Status=Weight;
   FakeY = 0;
ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=GLM;
   model FakeY = BP Chol Smoking Weight;
ods exclude none;
ods graphics / width=500px height=800px;
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design;  read all var _NUM_ into X[c=varNames];  close;
run HeatmapDisc(X) title="GLM Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"White" "Black"};
Design matrix for the GLM parameterization in SAS

Click on the heat map to enlarge it. Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In the design matrix for the GLM parameterization, a categorical variable with k levels is represented by k columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

  • This first column is all black, which indicates the intercept column.
  • Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
  • In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

Heat maps of design matrices: Reference parameterization

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=REF;
   model FakeY = BP Chol Smoking Weight;
ods exclude none;
Design matrix for the REFERENCE parameterization in SAS

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have k levels are replaced by k–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

Nonbinary designs: The EFFECT parameterization

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP(ref='Normal') Chol(ref='Desirable') 
         Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT;
   model FakeY = BP Chol Smoking Weight;
ods exclude none;
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design; read all var _NUM_ into X[c=varNames]; close;
run HeatmapDisc(X) title="Effect Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"};
Design matrix for the EFFECT parameterization in SAS

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

Heat maps in Base SAS

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/
data Long;
set Design;
array dummy[*] _NUMERIC_;
do varNum = 1 to dim(dummy);
   rowNum = _N_;
   value = dummy[varNum];
keep varNum rowNum value;
proc sgplot data=Long;
/* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */
styleattrs  DATACOLORS=(Black White LightBlue);
heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex;
xaxis type=discrete; /* values=(1 to 11)  valuesdisplay=("A" "B" ... "J" "K"); */
yaxis reverse;

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

Back to Top