18
Oct

The diffogram and other graphs for multiple comparisons of means

In a previous article, I discussed the lines plot for multiple comparisons of means. Another graph that is frequently used for multiple comparisons is the diffogram, which indicates whether the pairwise differences between means of groups are statistically significant. This article discusses how to interpret a diffogram. Two related plots are also discussed.

How to create a diffogram in SAS

The diffogram (also called a mean-mean scatter diagram) is automatically created when you use the PDIFF=ALL option on the LSMEANS statement in several SAS/STAT regression procedures such as PROC GLM and PROC GLMMIX, assuming that you have enabled ODS graphics. The documentation for PROC GLM contains an example that uses data about the chemical composition of shards of pottery from four archaeological sites in Great Britain. Researchers want to determine which sites have shards that are chemically similar. The data are contained in the pottery data set. The following call to PROC GLM performs an ANOVA of the calcium oxide in the pottery at the sites. The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of calcium oxide for the different sites. The ADJUST=TUKEY option is one way to adjust the confidence intervals for multiple comparisons.

ods graphics on;
proc glm data=pottery plots(only)=( diffplot(center)          /* diffogram */
                                    meanplot(cl ascending) ); /* plot of means and CIs */
   label Ca = "Calcium Oxide (%)";
   class Site;
   model Ca = Site;
   lsmeans Site / pdiff=all cl adjust=tukey;  /* all pairwise comparisons of means w/ adjusted CL */
   ods output LSMeanDiffCL=MeanDiff;          /* optional: save mean differences and CIs */
quit;

Two graphs are requested: the diffogram (or "diffplot") and a "mean plot" that shows the group means and 95% confidence intervals. The ODS OUTPUT statement creates a data set from a table that contains the mean differences between pairs of groups, along with 95% confidence intervals for the differences. You can use that information to construct a plot of the mean differences, as shown later in this article.

How to interpret a diffogram

Diffogram in SAS showing multiple comparisons of means

The diffogram, which is shown to the right (click to enlarge), is my favorite graph for multiple comparisons of means. Every diffogram displays a diagonal reference line that has unit slope. Horizontal and vertical reference lines are placed along the axes at the location of the means of the groups. For these data, there are four vertical and four horizontal reference lines. At the intersection of most reference lines there is a small colored line segment. These segments indicate which of the 4(4-1)/2 = 6 pairwise comparisons of means are significant.

Let's start at the top of the diffogram. The mean for the Caldecot site is about 0.3, as shown by a horizontal reference line near Y=0.3. On the reference line for Caldecot are three line segments that are centered at the mean values for the other groups, which are (from left to right) IslandThorns, AshleyRails, and Llanederyn. The first two line segments (in blue) do not intersect the dashed diagonal reference line, which indicates that the means for the (IslandThorns, Caldecot) and (AshleyRails, Caldecot) pairs are significantly different. The third line segment (in red) intersects the diagonal reference line, which indicates that the (Llanederyn, Caldecot) comparison is not significant.

Similarly, the mean for the Llanederyn site is about 0.2, as shown by a horizontal reference line near Y=0.2. On the reference line for Llanederyn are two line segments, which represent the (IslandThorns, Llanederyn) and (AshleyRails, Llanederyn) comparisons. Neither line segment intersects the diagonal reference line, which indicates that the means for those pairs are significantly different.

Lastly, the mean for the AshleyRails site is about 0.05, as shown by a horizontal reference line near Y=0.05. The line segment on that reference line represents the (IslandThorns, AshleyRails) comparison. It intersects the diagonal reference line, which indicates that those means are not significantly different.

The colors for the significant/insignificant pairs depends on the ODS style, but it is easy to remember the simple rule: if a line segment intersects the diagonal reference line, then the corresponding group means are not significantly different. A mnemonic is "Intersect? Insignificant!"

The mean plot with confidence interval: Use wisely!

Mean plot and 95% confidence intervals, created with SAS

The previous section shows how to use the diffogram to visually determine which pairs of means are significantly different. This section reminds you that you should not try to use the mean plot (shown at the right) for making those inferences.

I have seen presentations in which the speaker erroneously claims that "the means of these groups are significantly different because their 95% confidence intervals do not overlap." That is not a correct inference. In general, the overlap (or lack thereof) between two (1 – α)100% confidence intervals does not give sufficient information about whether the difference between the means is significant at the α level. In particular, you can construct examples where

  • Two confidence intervals overlap, but the difference of means is significant. (See Figure 2 in High (2014).)
  • Two confidence intervals do not overlap, but the difference of means is not significant.

The reason is twofold. First, the confidence intervals in the plot are constructed by using the sample sizes and standard deviations for each group, whereas tests for the difference between the means are constructed by using pooled standard deviations and sample sizes. Second, if you are making multiple comparisons, you need to adjust the widths of the intervals to accommodate the multiple (simultaneous) inferences.

"But Rick," you might say, "in the mean plot for these data, the Llanederyn and Caldecot confidence intervals overlap. The intervals for IslandThorns and AshleyRails also overlap. And these are exactly the two pairs that are not significantly different, as shown by the diffogram!" Yes, that is true for these data, but it is not true in general. Use the diffogram, not the means plot, to visualize multiple comparisons of means.

Plot the pairwise difference of means and confidence intervals

In addition to the diffogram, you can visualize comparisons of means by plotting the confidence intervals for the pairwise mean differences. Those intervals that contain 0 represent insignificant differences. Recall that the call to PROC GLM included an ODS output statement that created a data set (MeanDiff) that contains the mean differences. The following DATA step constructs labels for each pair and computes whether each pairwise difference is significant:

data Intervals;
length Pair $28.;
length S $16.;
set MeanDiff;
/* The next line is data dependent. For class variable 'C', concatenate C and _C variables */
Pair = catx(' - ', Site, _Site);           
Significant = (0 < LowerCL | UpperCL < 0); /* is 0 in interior of interval? */
S = ifc(Significant, "Significant", "Not significant");
run;
 
title "Pairwise Difference of LSMeans (Tukey Adjustment)";
title2 "95% Confidence Intervals of Mean Difference";
footnote J=L "Pairs Whose Intervals Contain 0 Are Not Significantly Different";
proc sgplot data=Intervals;
   scatter y=Pair x=Difference / group=S name="CIs"
           xerrorlower=LowerCL xerrorupper=UpperCL;
   refline 0 / axis=x;
   yaxis reverse colorbands=odd display=(nolabel) 
                 offsetmin=0.06 offsetmax=0.06;
   keylegend "CIs" / sortorder=ascending;
run;
Plot of mean differences and confidence intervals, created with SAS

The resulting graph is shown to the right. For each pair of groups, the graph shows an estimate for the difference of means and the Tukey-adjusted 95% confidence intervals for the difference. Intervals that contain 0 indicate that the difference of means is not significant. Intervals that do not contain 0 indicate significant differences.

Although the diffogram and the difference-of-means plot provide the same information, I prefer the diffogram because it shows the values for the means rather than for the mean differences. Furthermore, the height of the difference-of-means plot needs to be increased as more groups are added (there are k(k-1)/2 rows for k groups), whereas the diffogram can accommodate a moderate number of groups without being rescaled. On the other hand, it can be difficult to read the reference lines in the diffogram when there are many groups, especially if some of the groups have similar means.

For more about multiple comparisons tests and how to visualize the differences between means, see the following references:

The post The diffogram and other graphs for multiple comparisons of means appeared first on The DO Loop.

16
Oct

Graphs for multiple comparisons of means: The lines plot

Last week Warren Kuhfeld wrote about a graph called the "lines plot" that is produced by SAS/STAT procedures in SAS 9.4M5. (Notice that the "lines plot" has an 's'; it is not a line plot!) The lines plot is produced as part of an analysis that performs multiple comparisons of means. Although the graphical version of the lines plot is new in SAS 9.4M5, the underlying analysis has been available in SAS for decades. If you have an earlier version of SAS, the analysis is presented as a table rather than as a graph.

Warren's focus was on the plot itself, with an emphasis on how to create it. However, the plot is also interesting for the statistical information it provides. This article discusses how to interpret the lines plot in a multiple comparisons of means analysis.

The lines plot in SAS

You can use the LINES option in the LSMEANS statement to request a lines plot in SAS 9.4M5. The following data are taken from Multiple Comparisons and Multiple Tests (p. 42-53 of the First Edition). Researchers are studying the effectiveness of five weight-loss diets, denoted by A, B, C, D, and E. Ten male subjects are randomly assigned to each method. After a fixed length of time, the weight loss of each subject is recorded, as follows:

/* Data and programs from _Multiple Comparisons and Multiple Tests_
   Westfall, Tobias, Rom, Wolfinger, and Hochberg (1999, First Edition) */
data wloss;
do diet = 'A','B','C','D','E';
   do i = 1 to 10;  input WeightLoss @@;  output;  end;
end;
datalines;
datalines;
12.4 10.7 11.9 11.0 12.4 12.3 13.0 12.5 11.2 13.1
 9.1 11.5 11.3  9.7 13.2 10.7 10.6 11.3 11.1 11.7
 8.5 11.6 10.2 10.9  9.0  9.6  9.9 11.3 10.5 11.2
 8.7  9.3  8.2  8.3  9.0  9.4  9.2 12.2  8.5  9.9
12.7 13.2 11.8 11.9 12.2 11.2 13.7 11.8 11.5 11.7
;

You can use PROC GLM to perform a balanced one-way analysis of variance and use the MEANS or LSMEANS statement to request pairwise comparisons of means among the five diet groups:

proc glm data=wloss;
   class diet;
   model WeightLoss = diet;
   *means diet / tukey LINES; 
   lsmeans diet / pdiff=all adjust=tukey LINES;
quit;

In general, I use the LSMEANS statement rather than the MEANS statement because LS-means are more versatile and handle unbalanced data. (More about this in a later section.) The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of weight loss for the different diets. The ADJUST=TUKEY option is a common way to adjust the widths of confidence intervals to accommodate the multiple comparisons. The analysis produces several graphs and tables, including the following lines plot.

Lines plot for multiple comparisons of means in SAS

How to interpret a lines plot

In the lines plot, the vertical lines visually connect groups whose LS-means are "statistically indistinguishable." Statistically speaking, two means are "statistically indistinguishable" when their pairwise difference is not statistically significant.

If you have k groups, there are k(k-1)/2 pairwise differences that you can examine. The lines plot attempts to summarize those comparisons by connecting groups whose means are statistically indistinguishable. Often there are fewer lines than pairwise comparisons, so the lines plot displays a summary of which groups have similar means.

In the previous analysis, there are five groups, so there are 10 pairwise comparisons of means. The lines plot summarizes the results by using three vertical lines. The leftmost line (blue) indicates that the means of the 'B' and 'C' groups are statistically indistinguishable (they are not significantly different). Similarly, the upper right vertical bar (red) indicates that the means of the pairs ('E','A'), ('E','B'), and ('A','B') are not significantly different from each other. Lastly, the lower right vertical bar (green) indicates that the means for groups 'C' and 'D' are not significantly different. Thus in total, the lines plot indicates that five pairs of means are not significantly different.

The remaining pairs of mean differences (for example, 'E' and 'D') are significantly different. By using only three vertical lines, the lines plot visually associates pairs of means that are essentially the same. Those pairs that are not connected by a vertical line are significantly different.

Advantages and limitations of the lines plot

Advantages of the lines plot include the following:

  • The groups are ordered according to the observed means of the groups.
  • The number of vertical lines is often much smaller than the number of pairwise comparisons between groups.

Notice that the groups in this example are the same size (10 subjects). When the group sizes are equal (the so-called "balanced ANOVA" case), the lines plot can always correctly represent the relationships between the group means. However, that is not always true for unbalanced data. Westfall et al. (1999, p. 69) provide an example in which using the LINES option on the MEANS statement produces a misleading plot.

The situation is less severe when you use the LSMEANS statement, but for unbalanced data it is sometimes impossible for the lines plot to accurately connect all groups that have insignificant mean differences. In those cases, SAS appends a footnote to the plot that alerts you to the situation and lists the additional significances not represented by the plot.

In my next blog post, I will show some alternative graphical displays that are appropriate for multiple comparisons of means for unbalanced groups.

Summary

In summary, the new lines plot in SAS/STAT software is a graphical version of an analysis that has been in SAS for decades. You can create the plot by using the LINES option in the LSMEANS statement. The lines plot indicates which groups have mean differences that are not significant. For balanced data (or nearly balanced), it does a good job of summarizes which differences of means are not significant. For highly unbalanced data, there are other graphs that you can use. Those graphs will be discussed in a future article.

The post Graphs for multiple comparisons of means: The lines plot appeared first on The DO Loop.

4
Oct

Create and interpret a weighted histogram

If you perform a weighted statistical analysis, it can be useful to produce a statistical graph that also incorporates the weights. This article shows how to construct and interpret a weighted histogram in SAS.

How to construct a weighted histogram

Before constructing a weighted histogram, let's review the construction of an unweighted histogram. A histogram requires that you specify a set of evenly spaced bins that cover the range of the data. An unweighted histogram of frequencies is constructed by counting the number of observations that are in each bin. Because counts are dependent on the sample size, n, histograms often display the proportion (or percentage) of values in each bin. The proportions are the counts divided by n. On the proportion scale, the height of each bin is the sum of the quantity 1/n, where the sum is taken over all observations in the bin.

That fact is important because it reveals that the unweighted histogram is a special case of the weighted histogram. An unweighted histogram is equivalent to a weighted histogram in which each observation receives a unit weight. Therefore the quantity 1/n is the standardized weight of each observation: the weight divided by the sum of the weights. The formula is the same for non-unit weights: the height of each bin is the sum of the quantity wi / Σ wi, where the sum is taken over all observations in the bin. That is, you add up all the standardized weights in each bin to produce the bin height.

An example of a weighted histogram

The SAS documentation for the WEIGHT statement includes the following example. Twenty subjects estimate the diameter of an object that is 30 cm across. Some people are placed closer to the object than others. The researcher believes that the precision of the estimate is inversely proportional to the distance from the object. Therefore the researcher weights each subject's estimate by using the inverse distance.

The following DATA step creates the data, and PROC SGPLOT creates a weighted histogram of the data by using the WEIGHT= option on the HISTOGRAM option. (The WEIGHT= option was added in SAS 9.4M1.)

data Size;
input Distance ObjectSize @@;
Wt = 1 / distance;        /* precision */
x = ObjectSize; 
label x = "Estimate of Size";
datalines;
1.5 30   1.5 20   1.5 30   1.5 25
3   43   3   33   3   25   3   30
4.5 25   4.5 36   4.5 48   4.5 33
6   43   6   36   6   23   6   48
7.5 30   7.5 25   7.5 50   7.5 38
;
 
title "Weighted Histogram of Size Estimate";
proc sgplot data=size noautolegend;
   histogram x / WEIGHT=Wt scale=proportion datalabel binwidth=5;
   fringe x / lineattrs=(thickness=2 color=black) transparency=0.6;
   yaxis grid offsetmin=0.05 label="Weighted Proportion";
   refline 30 / axis=x lineattrs=(pattern=dash);
run;
Weighted histogram in SAS; weights proportional to inverse variance

The weighted histogram is shown to the right. The data values are shown in the fringe plot beneath the histogram. The height of each bin is the sum of the weights of the observations in that bin. The dashed line represents the true diameter of the object. Most estimates are clustered around the true value, except for a small cluster of larger estimates. Notice that I use the SCALE=PROPORTION option to plot the weighted proportion of observations in each bin, although the default behavior (SCALE=PERCENT) would also be acceptable.

If you remove the WEIGHT= option and study the unweighted graph, you will see that the average estimate for the unweighted distribution (33.6) is not as close to the true diameter as the weighted estimate (30.1). Furthermore, the weighted standard deviation is about half the unweighted standard deviation, which shows that the weighted distribution of these data has less variance than the unweighted distribution.

By the way, although PROC UNIVARIATE can produce weighted statistics, it does not create weighted graphics as of SAS 9.4M5. One reason is that the graphics statements (CDFPLOT, HISTOGRAM, QQPLOT, etc) not only create graphs but also fit distributions and produce goodness-of-fit statistics, and those analyses do not support weight variables.

Checking the computation

Although a weighted histogram is not conceptually complex, I understand a computation better when I program it myself. You can write a SAS program that computes a weighted histogram by using the following algorithm:

  1. Construct the bins. For this example, there are eight bins of width 5, and the first bin starts at x=17.5. (It is centered at x=20.) Initialize all bin heights to zero.
  2. For each observation, find the bin that contains it. Increment the bin height by the weight of that observation.
  3. Standardize the heights by dividing by the sum of weights. You can skip this step if the weights sum to unity.

A SAS/IML implementation of this algorithm requires only a few lines of code. A DATA step implementation that uses arrays is longer, but probably looks more familiar to many SAS programmers:

data BinHeights(keep=height:);
array EndPt[8] _temporary_;
binStart = 17.5;  binWidth = 5;  /* anchor and width for bins */
do i = 1 to dim(EndPt);          /* define endpoints of bins */
   EndPt[i] = binStart + (i-1)*binWidth;
end;
 
array height[7];                 /* height of each bin */
set Size end=eof;                /* for each observation ... */
sumWt + Wt;                      /* compute sum of weights */
Found=0;
do i = 1 to dim(EndPt)-1 while (^Found); /* find bin for each obs */
   Found = (EndPt[i] <= x < EndPt[i+1]);
   if Found then height[i] + Wt; /* increment bin height by weight */
end;
if eof then do;            
   do i = 1 to dim(height);      /* scale heights by sum of weights */
      height[i] = height[i] / sumWt;
   end;
   output;
end;
run;
 
proc print noobs data=BinHeights; run;
Heights of bars in weighted histogram

The computations from the DATA step match the data labels that appear on the weighted histogram in PROC SGPLOT.

Summary

In SAS, the HISTOGRAM statement in PROC SGPLOT supports the WEIGHT= option, which enables you to create a weighted histogram. A weighted histogram shows the weighted distribution of the data. If the histogram displays proportions (rather than raw counts), then the heights of the bars are the sum of the standardized weights of the observations within each bin. You can download the SAS program that computes the quantities in this article.

How can you interpret a weighted histogram? That depends on the meaning of the weight variables. For survey data and sampling weights, the weighted histogram estimates the distribution of a quantity in the population. For inverse variance weights (such as were used in this article), the weighted histogram overweights precise measurements and underweights imprecise measurements. When the weights are correct, the weighted histogram is a better estimate of the density of the underlying population and the weighted statistics (mean, variance, quantiles,...) are better estimates of the corresponding population quantities.

Have you ever plotted a weighted histogram? What was the context? Leave a comment.

The post Create and interpret a weighted histogram appeared first on The DO Loop.

18
Sep

The path of zip codes

Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

17
Jul

3 ways to visualize prediction regions for classification problems

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

Prediction regions for the binary classification problem. Graph created in SAS.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

This article shows three ways to visualize prediction regions in SAS:

  1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
  2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
  3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   store work.LogiModel;                /* save model to item store */
run;
 
data Grid;                              /* create grid in (x1,x2) coords */
do x1 = 0 to 1 by 0.02;
   do x2 = -7.5 to 7.5 by 0.3;
      output;
   end;
end;
run;
 
proc plm restore=work.LogiModel;        /* use PROC PLM to score model on a grid */
   score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink;  /* evaluate the model on new data */
run;

The polygon method

Parameter estimates for  logistic model

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   effectplot contour(x=x1 y=x2);       /* 2. contour plot with scatter plot overlay */
run;

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

The background grid method

Prediction region for a classification problem with two outcomes

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.

Summary

This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.

30
May

On the SMOOTHCONNECT option in the SERIES statement

By default, when you use the SERIES statement in PROC SGPLOT to create a line plot, the observations are connected (in order) by straight line segments. However, SAS 9.4m1 introduced the SMOOTHCONNECT option which, as the name implies, uses a smooth curve to connect the observations. In Sanjay Matange's blog, he shows an example where the SMOOTHCONNECT option interpolates points in a time series by using a smooth curve. In the example, the smooth curve does a good job of smoothly connecting the evenly spaced data points.

However, in a second post, Sanjay shows how the SMOOTHCONNECT option can result in curves that are less ideal. Here is a SAS program that reproduces the gist of Sanjay's example:

data Ex1;
input segment $ x y;
datalines;
A 0  0 
A 1  0
A 2 -3
A 3 -1
B 0  0 
B 1  0
B 2  2
B 3  1
;
 
proc sgplot data=Ex1;
   series x=x y=y / group=segment markers smoothconnect;
run;

As you can see, groups A and B have identical values for the first two observations. However, the remaining Y values for group A are negative, whereas the remaining Y values for group B are positive. Notice that the interpolating curve for group A rises up before it dives down. Similarly, the curve for group B dips down before it rises up.

This visualization could give the wrong impression to the viewer of the graph, especially if the markers are not displayed. We only have data for four values of X. However, the SMOOTHCONNECT option makes it appear that the graph for group A was higher than group B on the interval (0, 1). That may or may not be true. The curves also give the impression that the curve for group A began to decrease prior to x=1, whereas in fact we have no idea when the curve reached a maximum and began to decrease.


5 things to remember when using the SMOOTHCONNECT option in SGPLOT #DataViz
Click To Tweet


The SMOOTHCONNECT option can be useful for visualizing time series data, but as with any tool, it is important to know how to use it responsibly. In thinking about these issues, I compiled a list of five ways that the SMOOTHCONNECT option might give a misleading view of the data. By understanding these issues, you can use the SMOOTHCONNECT option wisely. Here are some potential pitfalls to connecting points by using a smooth curve.

  1. Before the curve goes down, it often goes up.
  2. The curve gives the impression that we know the location of peaks and valleys.
  3. The high and low points on the curve might exceed the range of the data.
  4. The curve can display quick (possibly unrealistic) changes in direction.
  5. The curve can bend backward, or even create a loop.

Sanjay's example demonstrates the first three issues. Notice that the maximum Y value for group A in the data is Y=0, but the interpolating spline exceeds that value.

To demonstrate the fourth and fifth items requires an artificial example and data that is not evenly spaced along the X variable:

data Ex2;
input x y;
datalines;
0    0 
0.95 0.95
1    1
2   -2
;
proc sgplot data=Ex2;
   series x=x y=y / markers smoothconnect;
   yaxis max=1.5;
run;

There are only four points in the data set, but the SMOOTHCONNECT curve for these data exhibits a sharp turn (in fact, a loop). I intentionally created data that would create this behavior by making the data linear for X < 2. I also made the second data point very close to the third data point so that the interpolating curve would have to gyrate widely to pass through the points before sloping down to pass through the fourth point.

Implications for data visualization

It is well known among applied mathematicians that interpolation can lead to issues like these. These issues are not caused by a bug in SAS, but by the fact that you are trying to force a smooth curve to pass through every data point. I point them out so that SAS users who are using SGPLOT can be aware of them.

If your data are evenly spaced in the horizontal direction, the first three issue are usually small and innocuous. The last two issue will not occur. Therefore, it is safe to use the SMOOTHCONNECT option for evenly spaced data.

If your data are not evenly spaced, here are a few guidelines to help you avoid wild interpolating curves:

In summary, if you use the SMOOTHCONNECT option in the SERIES statement, SAS will pass a smooth curve through the points. When the points are unevenly spaced in X, the curve might need to quickly change direction to pass through all the data. The resulting curve might not be a good representation of the data. In this case, you might need to relax the requirements of a smooth interpolating curve, either by using a piecewise linear interpolation or by using a smooth curve that is not constrained to pass through the data.

The post On the SMOOTHCONNECT option in the SERIES statement appeared first on The DO Loop.

3
May

Perceptions of probability

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance, a one-in-ten chance, or some other value?

Analysts often use vague statements like "probably" or "chances are slight" to convey their beliefs that a future event will or will not occur. Government officials and policy-makers who read reports from analysts must interpret and act on these vague statements. If the reader of a report interprets a phrase different from what the writer intended, that can lead to bad decisions.

Assigning probabilities to statements

Original box plot: Distribution of probabilities for word phrases

In the book Psychology of Intelligence Analysis (Heuer, 1999), the author presents "the results of an experiment with 23 NATO military officers accustomed to reading intelligence reports. They were given a number of sentences such as: "It is highly unlikely that ...." All the sentences were the same except that the verbal expressions of probability changed. The officers were asked what percentage probability they would attribute to each statement if they read it in an intelligence report."

The results are summarized in the adjacent dot plot from Heuer (Chapter 12), which summarizes how the officers assess the probability of various statements. The graph includes a gray box for some statements. The box is not a statistical box plot. Rather it indicates the probability range according to a nomenclature proposed by Kent (1964), who tried to get the intelligence community to agree that certain phrases would be associated with certain probability ranges.

For some statements (such as "better than even" and "almost no chance") there was general agreement among the officers. For others, there was large variability in the probability estimates. For example, many officers interpreted "probable" as approximately a 75% chance, but quite a few interpreted it as less than 50% chance.

A modern re-visualization

Zonination's box plot: Distribution of probabilities for word phrases

The results of this experiment are interesting on many levels, but I am going to focus on the visualization of the data. I do not have access to the original data, but this experiment was repeated in 2015 when the user "Zonination" got 46 users on Reddit (who were not military experts) to assign probabilities to the statements. His visualization of the resulting data won a 2015 Kantar Information is Beautiful Award. The visualization uses box plots to show the schematic distribution and overlays the 46 individual estimates by using a jittered, semi-transparent, scatter plot. The Zonination plot is shown at the right (click to enlarge). Notice that the "boxes" in this second graph are determined by quantiles of the data, whereas in the first graph they were theoretical ranges.

Creating the graph in SAS

I decided to remake Zonination's plot by using PROC SGPLOT in SAS. I made several modifications that improve the readability and clarity of the plot.

  • I sorted the categories by the median probability. The median is a robust estimate of the "consensus probability" for each statement. The sorted categories indicate the relative order of the statements in terms of perceived likelihood. For example, an "unlikely" event is generally perceived as more probable than an event that has "little chance." For details about sorting the variables in SAS, see my article about how to sort variables by a statistic.
  • I removed the colors. Zonination's rainbow-colored chart is aesthetically pleasing, but the colors do not add any new information about the data. However, the colors help the eye track horizontally across the graph, so I used alternating bands to visually differentiate adjacent categories. You can create color bands by using the COLORBANDS= option in the YAXIS statement.
  • To reduce overplotting of markers, I used systematic jittering instead of random jittering. In random jittering, each vertical position is randomly offset. In systematic (centered) jittering, the markers are arranged so that they are centered on the "spine" of the box plot. Vertical positions are changed only when the markers would otherwise overlap. You can use the JITTER option in the SCATTER statement to systematically jitter marker positions.
  • Zonination's plot displays some markers twice, which I find confusing. Outliers are displayed once by the box plot and a second time by the jittered scatter plot. In my version, I suppress the display of outliers by the box plot by using the NOOUTLIERS option in the HBOX statement.

You can download the SAS code that creates the data, sorts the variables by median, and creates the plot. The following call to PROC SGPLOT shows the HBOX and SCATTER statements that create the plot:

title "Perceptions of Probability";
proc sgplot data=Long noautolegend;
   hbox _Value_ / category=_Label_ nooutliers nomean nocaps;  
   scatter x=_Value_ y=_Label_ / jitter transparency=0.5
                     markerattrs=GraphData2(symbol=circlefilled size=4);
   yaxis reverse discreteorder=data labelpos=top labelattrs=(weight=bold)
                     colorbands=even colorbandsattrs=(color=gray transparency=0.9)
                     offsetmin=0.0294 offsetmax=0.0294; /* half of 1/k, where k=number of catgories */
   xaxis grid values=(0 to 100 by 10);
   label _Value_ = "Assigned Probability (%)" _label_="Statement";
run;
SAS box plot: Distribution of probabilities for word phrases

The graph indicates that some responders either didn't understand the task or intentionally gave ridiculous answers. Of the 17 categories, nine contain extreme outliers, such as assigning certainty (100%) to the phrases "probably not," "we doubt," and "little chance." However, the extreme outliers do not affect the statistical conclusions about the distribution of probabilities because box plots (which use quartiles) are robust to outliers.

The SAS graph, which uses systematic jittering, reveals a fact about the data that was hidden in the graphs that used random jittering. Namely, most of the data values are multiples of 5%. Although a few people responded with values such as 88.7%, 1%, or 3%, most values (about 80%) are rounded to the nearest 5%. For the phrases "likely" and "we believe," 44 of 46 responses (96%) were a multiple of 5%. In contrast, the phrase "almost no chance" had only 18 of 46 responses (39%) were multiples of 5% because many responses were 1%, 2%, or 3%.

Like the military officers in the original study, there is considerable variation in the way that the Reddit users assign a probability to certain phrases. It is interesting that some phrases (for example, "We believe," "Likely," and "Probable") have the same median value but wildly different interquartile ranges. For clarity, speakers/writers should use phrases that have small variation or (even better!) provide their own assessment of probability.

Does something about this perception study surprise you? Do you have an opinion about the best way to visualize these data? Leave a comment.

The post Perceptions of probability appeared first on The DO Loop.

26
Apr

Visualize a design matrix

Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.

It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures. In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or encodings that result in different design matrices.

Heat maps of design matrices: GLM parameterization

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */
data Temp / view=Temp;
   set Sashelp.heart(obs=150
                keep=BP_Status Chol_Status Smoking_Status Weight_Status);
   rename BP_Status=BP Chol_Status=Chol 
          Smoking_Status=Smoking Weight_Status=Weight;
   FakeY = 0;
run;
 
ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=GLM;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
ods graphics / width=500px height=800px;
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design;  read all var _NUM_ into X[c=varNames];  close;
run HeatmapDisc(X) title="GLM Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"White" "Black"};
QUIT;
Design matrix for the GLM parameterization in SAS

Click on the heat map to enlarge it. Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In the design matrix for the GLM parameterization, a categorical variable with k levels is represented by k columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

  • This first column is all black, which indicates the intercept column.
  • Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
  • In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

Heat maps of design matrices: Reference parameterization

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=REF;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
Design matrix for the REFERENCE parameterization in SAS

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have k levels are replaced by k–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

Nonbinary designs: The EFFECT parameterization

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP(ref='Normal') Chol(ref='Desirable') 
         Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design; read all var _NUM_ into X[c=varNames]; close;
run HeatmapDisc(X) title="Effect Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"};
QUIT;
Design matrix for the EFFECT parameterization in SAS

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

Heat maps in Base SAS

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/
data Long;
set Design;
array dummy[*] _NUMERIC_;
do varNum = 1 to dim(dummy);
   rowNum = _N_;
   value = dummy[varNum];
   output;
end;
keep varNum rowNum value;
run;
 
proc sgplot data=Long;
/* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */
styleattrs  DATACOLORS=(Black White LightBlue);
heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex;
xaxis type=discrete; /* values=(1 to 11)  valuesdisplay=("A" "B" ... "J" "K"); */
yaxis reverse;
run;

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

24
Apr

Visualize an ANOVA with two-way interactions

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

17
Apr

Visualize the 68-95-97.5 rule in SAS

Illustration of the 68-95-99.7 rule

A reader commented on last week's article about constructing symmetric intervals. He wanted to know if I created it in SAS.

Yes, the graph, which illustrates the so-called 68-95-99.7 rule for the normal distribution, was created by using several statements in the SGPLOT procedure in Base SAS

  • The SERIES statement creates the bell-shaped curve.
  • The BAND statement creates the shaded region under the curve.
  • The DROPLINE statement creates the vertical lines from the curve to the X axis.
  • The HIGHLOW statement creates the horizontal lines that indicate interval widths.
  • The TEXT statement creates the text labels "68%", "95%", and "99.7%".

If you follow some simple rules, it is easy to use PROC SGPLOT the overlay multiple curves and lines on a graph. The key is to organize the underlying data into a block form, as shown conceptually in the plot to the right. I use different variable names for each component of the plot, and I set the values of the other variables to missing when they are no longer relevant. Each statement uses only the variables that are relevant for that overlay.

The following DATA step creates the data for the 68-95-99.7 graph. Can you match up each section with the SGPLOT statements that overlay various components to create the final image?

data NormalPDF;
mu = 50; sigma = 8;     /* parameters for the normal distribution N(mu, sigma) */
/* 1. Data for the SERIES and BAND statements */
do m = -4 to 4 by 0.05;                /* x in [mu-4*sigma, mu+4*sigma] */
   x = mu + m*sigma;
   f = pdf("Normal", x, mu, sigma);    /* height of normal curve */
   output;
end;
x=.; f=.;
 
/* 2. Data for vertical lines at mu + m *sigma, m=-3, -2, -1, 1, 2, 3 */
do m =-3 to 3;
   if m=0 then continue;               /* skip m=0 */
   Lx = mu + m*sigma;                  /* horiz location of segment */
   Lf = pdf("Normal", Lx, mu, sigma);  /* vertical height of segment */
   output;
end;
LX = .; Lf = .;
 
/* 3. Data for horizontal lines. Heights are 1.1, 1.2, and  1.3 times the max height of the curve */
Tx = mu;                               /* text centered at mu */
fMax = pdf("Normal", mu, mu, sigma);   /* highest point of curve */
Text = "68%  ";                        /* 68% interval */
TL = mu - sigma;  TR = mu + sigma;     /* Left/Right endpoints of interval */
Ty = 1.1 * fMax;                       /* height of label and segment */
output;
 
Text = "95%  ";                        /* 95% interval */
TL = mu - 2*sigma;  TR = mu + 2*sigma; /* Left/Right endpoints of interval */
Ty = 1.2 * fMax;                       /* height of label and segment */
output;
 
Text = "99.7%";                        /* 99.7% interval */
TL = mu - 3*sigma;  TR = mu + 3*sigma; /* Left/Right endpoints of interval */
Ty = 1.3 * fMax;                       /* height of label and segment */
output;
 
keep x f Lx Lf Tx Ty TL TR Text;
run;
 
proc sgplot data=NormalPDF noautolegend;
   band     x=x upper=f lower=0;
   series   x=x y=f             / lineattrs=(color=black);
   dropline x=Lx y=Lf           / dropto=x lineattrs=(color=black);
   highlow  y=Ty low=TL high=TR / lowcap=serif highcap=serif lineattrs=(thickness=2);
   text     x=Tx y=Ty text=Text / backfill fillattrs=(color=white) textattrs=(size=14); 
   yaxis offsetmin=0 min=0 label="Density";
   xaxis values=(20 to 80 by 10) display=(nolabel);
run;

The post Visualize the 68-95-97.5 rule in SAS appeared first on The DO Loop.

Back to Top