17
Jul

3 ways to visualize prediction regions for classification problems

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

Prediction regions for the binary classification problem. Graph created in SAS.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

This article shows three ways to visualize prediction regions in SAS:

  1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
  2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
  3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   store work.LogiModel;                /* save model to item store */
run;
 
data Grid;                              /* create grid in (x1,x2) coords */
do x1 = 0 to 1 by 0.02;
   do x2 = -7.5 to 7.5 by 0.3;
      output;
   end;
end;
run;
 
proc plm restore=work.LogiModel;        /* use PROC PLM to score model on a grid */
   score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink;  /* evaluate the model on new data */
run;

The polygon method

Parameter estimates for  logistic model

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   effectplot contour(x=x1 y=x2);       /* 2. contour plot with scatter plot overlay */
run;

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

The background grid method

Prediction region for a classification problem with two outcomes

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.

Summary

This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.

30
May

On the SMOOTHCONNECT option in the SERIES statement

By default, when you use the SERIES statement in PROC SGPLOT to create a line plot, the observations are connected (in order) by straight line segments. However, SAS 9.4m1 introduced the SMOOTHCONNECT option which, as the name implies, uses a smooth curve to connect the observations. In Sanjay Matange's blog, he shows an example where the SMOOTHCONNECT option interpolates points in a time series by using a smooth curve. In the example, the smooth curve does a good job of smoothly connecting the evenly spaced data points.

However, in a second post, Sanjay shows how the SMOOTHCONNECT option can result in curves that are less ideal. Here is a SAS program that reproduces the gist of Sanjay's example:

data Ex1;
input segment $ x y;
datalines;
A 0  0 
A 1  0
A 2 -3
A 3 -1
B 0  0 
B 1  0
B 2  2
B 3  1
;
 
proc sgplot data=Ex1;
   series x=x y=y / group=segment markers smoothconnect;
run;

As you can see, groups A and B have identical values for the first two observations. However, the remaining Y values for group A are negative, whereas the remaining Y values for group B are positive. Notice that the interpolating curve for group A rises up before it dives down. Similarly, the curve for group B dips down before it rises up.

This visualization could give the wrong impression to the viewer of the graph, especially if the markers are not displayed. We only have data for four values of X. However, the SMOOTHCONNECT option makes it appear that the graph for group A was higher than group B on the interval (0, 1). That may or may not be true. The curves also give the impression that the curve for group A began to decrease prior to x=1, whereas in fact we have no idea when the curve reached a maximum and began to decrease.


5 things to remember when using the SMOOTHCONNECT option in SGPLOT #DataViz
Click To Tweet


The SMOOTHCONNECT option can be useful for visualizing time series data, but as with any tool, it is important to know how to use it responsibly. In thinking about these issues, I compiled a list of five ways that the SMOOTHCONNECT option might give a misleading view of the data. By understanding these issues, you can use the SMOOTHCONNECT option wisely. Here are some potential pitfalls to connecting points by using a smooth curve.

  1. Before the curve goes down, it often goes up.
  2. The curve gives the impression that we know the location of peaks and valleys.
  3. The high and low points on the curve might exceed the range of the data.
  4. The curve can display quick (possibly unrealistic) changes in direction.
  5. The curve can bend backward, or even create a loop.

Sanjay's example demonstrates the first three issues. Notice that the maximum Y value for group A in the data is Y=0, but the interpolating spline exceeds that value.

To demonstrate the fourth and fifth items requires an artificial example and data that is not evenly spaced along the X variable:

data Ex2;
input x y;
datalines;
0    0 
0.95 0.95
1    1
2   -2
;
proc sgplot data=Ex2;
   series x=x y=y / markers smoothconnect;
   yaxis max=1.5;
run;

There are only four points in the data set, but the SMOOTHCONNECT curve for these data exhibits a sharp turn (in fact, a loop). I intentionally created data that would create this behavior by making the data linear for X < 2. I also made the second data point very close to the third data point so that the interpolating curve would have to gyrate widely to pass through the points before sloping down to pass through the fourth point.

Implications for data visualization

It is well known among applied mathematicians that interpolation can lead to issues like these. These issues are not caused by a bug in SAS, but by the fact that you are trying to force a smooth curve to pass through every data point. I point them out so that SAS users who are using SGPLOT can be aware of them.

If your data are evenly spaced in the horizontal direction, the first three issue are usually small and innocuous. The last two issue will not occur. Therefore, it is safe to use the SMOOTHCONNECT option for evenly spaced data.

If your data are not evenly spaced, here are a few guidelines to help you avoid wild interpolating curves:

In summary, if you use the SMOOTHCONNECT option in the SERIES statement, SAS will pass a smooth curve through the points. When the points are unevenly spaced in X, the curve might need to quickly change direction to pass through all the data. The resulting curve might not be a good representation of the data. In this case, you might need to relax the requirements of a smooth interpolating curve, either by using a piecewise linear interpolation or by using a smooth curve that is not constrained to pass through the data.

The post On the SMOOTHCONNECT option in the SERIES statement appeared first on The DO Loop.

3
May

Perceptions of probability

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance, a one-in-ten chance, or some other value?

Analysts often use vague statements like "probably" or "chances are slight" to convey their beliefs that a future event will or will not occur. Government officials and policy-makers who read reports from analysts must interpret and act on these vague statements. If the reader of a report interprets a phrase different from what the writer intended, that can lead to bad decisions.

Assigning probabilities to statements

Original box plot: Distribution of probabilities for word phrases

In the book Psychology of Intelligence Analysis (Heuer, 1999), the author presents "the results of an experiment with 23 NATO military officers accustomed to reading intelligence reports. They were given a number of sentences such as: "It is highly unlikely that ...." All the sentences were the same except that the verbal expressions of probability changed. The officers were asked what percentage probability they would attribute to each statement if they read it in an intelligence report."

The results are summarized in the adjacent dot plot from Heuer (Chapter 12), which summarizes how the officers assess the probability of various statements. The graph includes a gray box for some statements. The box is not a statistical box plot. Rather it indicates the probability range according to a nomenclature proposed by Kent (1964), who tried to get the intelligence community to agree that certain phrases would be associated with certain probability ranges.

For some statements (such as "better than even" and "almost no chance") there was general agreement among the officers. For others, there was large variability in the probability estimates. For example, many officers interpreted "probable" as approximately a 75% chance, but quite a few interpreted it as less than 50% chance.

A modern re-visualization

Zonination's box plot: Distribution of probabilities for word phrases

The results of this experiment are interesting on many levels, but I am going to focus on the visualization of the data. I do not have access to the original data, but this experiment was repeated in 2015 when the user "Zonination" got 46 users on Reddit (who were not military experts) to assign probabilities to the statements. His visualization of the resulting data won a 2015 Kantar Information is Beautiful Award. The visualization uses box plots to show the schematic distribution and overlays the 46 individual estimates by using a jittered, semi-transparent, scatter plot. The Zonination plot is shown at the right (click to enlarge). Notice that the "boxes" in this second graph are determined by quantiles of the data, whereas in the first graph they were theoretical ranges.

Creating the graph in SAS

I decided to remake Zonination's plot by using PROC SGPLOT in SAS. I made several modifications that improve the readability and clarity of the plot.

  • I sorted the categories by the median probability. The median is a robust estimate of the "consensus probability" for each statement. The sorted categories indicate the relative order of the statements in terms of perceived likelihood. For example, an "unlikely" event is generally perceived as more probable than an event that has "little chance." For details about sorting the variables in SAS, see my article about how to sort variables by a statistic.
  • I removed the colors. Zonination's rainbow-colored chart is aesthetically pleasing, but the colors do not add any new information about the data. However, the colors help the eye track horizontally across the graph, so I used alternating bands to visually differentiate adjacent categories. You can create color bands by using the COLORBANDS= option in the YAXIS statement.
  • To reduce overplotting of markers, I used systematic jittering instead of random jittering. In random jittering, each vertical position is randomly offset. In systematic (centered) jittering, the markers are arranged so that they are centered on the "spine" of the box plot. Vertical positions are changed only when the markers would otherwise overlap. You can use the JITTER option in the SCATTER statement to systematically jitter marker positions.
  • Zonination's plot displays some markers twice, which I find confusing. Outliers are displayed once by the box plot and a second time by the jittered scatter plot. In my version, I suppress the display of outliers by the box plot by using the NOOUTLIERS option in the HBOX statement.

You can download the SAS code that creates the data, sorts the variables by median, and creates the plot. The following call to PROC SGPLOT shows the HBOX and SCATTER statements that create the plot:

title "Perceptions of Probability";
proc sgplot data=Long noautolegend;
   hbox _Value_ / category=_Label_ nooutliers nomean nocaps;  
   scatter x=_Value_ y=_Label_ / jitter transparency=0.5
                     markerattrs=GraphData2(symbol=circlefilled size=4);
   yaxis reverse discreteorder=data labelpos=top labelattrs=(weight=bold)
                     colorbands=even colorbandsattrs=(color=gray transparency=0.9)
                     offsetmin=0.0294 offsetmax=0.0294; /* half of 1/k, where k=number of catgories */
   xaxis grid values=(0 to 100 by 10);
   label _Value_ = "Assigned Probability (%)" _label_="Statement";
run;
SAS box plot: Distribution of probabilities for word phrases

The graph indicates that some responders either didn't understand the task or intentionally gave ridiculous answers. Of the 17 categories, nine contain extreme outliers, such as assigning certainty (100%) to the phrases "probably not," "we doubt," and "little chance." However, the extreme outliers do not affect the statistical conclusions about the distribution of probabilities because box plots (which use quartiles) are robust to outliers.

The SAS graph, which uses systematic jittering, reveals a fact about the data that was hidden in the graphs that used random jittering. Namely, most of the data values are multiples of 5%. Although a few people responded with values such as 88.7%, 1%, or 3%, most values (about 80%) are rounded to the nearest 5%. For the phrases "likely" and "we believe," 44 of 46 responses (96%) were a multiple of 5%. In contrast, the phrase "almost no chance" had only 18 of 46 responses (39%) were multiples of 5% because many responses were 1%, 2%, or 3%.

Like the military officers in the original study, there is considerable variation in the way that the Reddit users assign a probability to certain phrases. It is interesting that some phrases (for example, "We believe," "Likely," and "Probable") have the same median value but wildly different interquartile ranges. For clarity, speakers/writers should use phrases that have small variation or (even better!) provide their own assessment of probability.

Does something about this perception study surprise you? Do you have an opinion about the best way to visualize these data? Leave a comment.

The post Perceptions of probability appeared first on The DO Loop.

26
Apr

Visualize a design matrix

Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.

It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures. In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or encodings that result in different design matrices.

Heat maps of design matrices: GLM parameterization

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */
data Temp / view=Temp;
   set Sashelp.heart(obs=150
                keep=BP_Status Chol_Status Smoking_Status Weight_Status);
   rename BP_Status=BP Chol_Status=Chol 
          Smoking_Status=Smoking Weight_Status=Weight;
   FakeY = 0;
run;
 
ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=GLM;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
ods graphics / width=500px height=800px;
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design;  read all var _NUM_ into X[c=varNames];  close;
run HeatmapDisc(X) title="GLM Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"White" "Black"};
QUIT;
Design matrix for the GLM parameterization in SAS

Click on the heat map to enlarge it. Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In the design matrix for the GLM parameterization, a categorical variable with k levels is represented by k columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

  • This first column is all black, which indicates the intercept column.
  • Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
  • In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

Heat maps of design matrices: Reference parameterization

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP Chol Smoking Weight / param=REF;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
Design matrix for the REFERENCE parameterization in SAS

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have k levels are replaced by k–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

Nonbinary designs: The EFFECT parameterization

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all;  /* use OUTDESIGN= option to write the design matrix to a data set */
proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY);
   class BP(ref='Normal') Chol(ref='Desirable') 
         Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT;
   model FakeY = BP Chol Smoking Weight;
run;
ods exclude none;
 
proc iml;  /* use HEATMAPDISC call to create heat map of design */
use Design; read all var _NUM_ into X[c=varNames]; close;
run HeatmapDisc(X) title="Effect Design Matrix"
     xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"};
QUIT;
Design matrix for the EFFECT parameterization in SAS

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

Heat maps in Base SAS

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/
data Long;
set Design;
array dummy[*] _NUMERIC_;
do varNum = 1 to dim(dummy);
   rowNum = _N_;
   value = dummy[varNum];
   output;
end;
keep varNum rowNum value;
run;
 
proc sgplot data=Long;
/* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */
styleattrs  DATACOLORS=(Black White LightBlue);
heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex;
xaxis type=discrete; /* values=(1 to 11)  valuesdisplay=("A" "B" ... "J" "K"); */
yaxis reverse;
run;

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

24
Apr

Visualize an ANOVA with two-way interactions

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.

An ANOVA for two-way interactions

Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover;
input  T1 $ T2 $ Time @@;
datalines;
0 0 20.2  0 0 23.9  0 0 21.9  0 0 42.4
1 0 27.2  1 0 34.0  1 0 27.4  1 0 28.5
0 1 25.9  0 1 34.5  0 1 25.1  0 1 34.2
1 1 35.0  1 1 33.9  1 1 38.3  1 1 39.9
;

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups";
title2 "Use VBOX Statement with Categories and Groups";
proc sgplot data=recover;
   vbox Time / category=T1 group=T2;
run;
Box plots for a binary 'category' variable and a binary 'group' variable

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

Method 1: Use the EFFECTPLOT statement

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */
proc genmod data=recover;
   class T1 T2;
   model Time = T1 T2 T1*T2;
   effectplot box / cluster;
   effectplot interaction /  obs(jitter);  /* or use interaction plot to see raw data */
run;
Box plots of joint levels created by the EFFECTPLOT statement in SAS

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

Method 2: Concatenate values to form joint levels of categories

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2;
length Treatment $3;          /* specify length of concatenated variable */
set recover;
Treatment = catx(' ',T1,T2);  /* combine into one group */
run;
 
proc format;                  /* make the joint levels more readable */
  value $ TreatFmt '0 0' = 'Control'
                   '1 0' = 'T1 Only'
                   '0 1' = 'T2 Only'
                   '1 1' = 'T1 and T2';
run;
 
proc sgplot data=recover2 noautolegend;
   format Treatment $TreatFmt.;
   vbox Time / category=Treatment;
   scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10);
   xaxis discreteorder=data;
run;
Distribution of response variable in two-way ANOVA: box plots and raw data overlaid

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.

The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

17
Apr

Visualize the 68-95-97.5 rule in SAS

Illustration of the 68-95-99.7 rule

A reader commented on last week's article about constructing symmetric intervals. He wanted to know if I created it in SAS.

Yes, the graph, which illustrates the so-called 68-95-99.7 rule for the normal distribution, was created by using several statements in the SGPLOT procedure in Base SAS

  • The SERIES statement creates the bell-shaped curve.
  • The BAND statement creates the shaded region under the curve.
  • The DROPLINE statement creates the vertical lines from the curve to the X axis.
  • The HIGHLOW statement creates the horizontal lines that indicate interval widths.
  • The TEXT statement creates the text labels "68%", "95%", and "99.7%".

If you follow some simple rules, it is easy to use PROC SGPLOT the overlay multiple curves and lines on a graph. The key is to organize the underlying data into a block form, as shown conceptually in the plot to the right. I use different variable names for each component of the plot, and I set the values of the other variables to missing when they are no longer relevant. Each statement uses only the variables that are relevant for that overlay.

The following DATA step creates the data for the 68-95-99.7 graph. Can you match up each section with the SGPLOT statements that overlay various components to create the final image?

data NormalPDF;
mu = 50; sigma = 8;     /* parameters for the normal distribution N(mu, sigma) */
/* 1. Data for the SERIES and BAND statements */
do m = -4 to 4 by 0.05;                /* x in [mu-4*sigma, mu+4*sigma] */
   x = mu + m*sigma;
   f = pdf("Normal", x, mu, sigma);    /* height of normal curve */
   output;
end;
x=.; f=.;
 
/* 2. Data for vertical lines at mu + m *sigma, m=-3, -2, -1, 1, 2, 3 */
do m =-3 to 3;
   if m=0 then continue;               /* skip m=0 */
   Lx = mu + m*sigma;                  /* horiz location of segment */
   Lf = pdf("Normal", Lx, mu, sigma);  /* vertical height of segment */
   output;
end;
LX = .; Lf = .;
 
/* 3. Data for horizontal lines. Heights are 1.1, 1.2, and  1.3 times the max height of the curve */
Tx = mu;                               /* text centered at mu */
fMax = pdf("Normal", mu, mu, sigma);   /* highest point of curve */
Text = "68%  ";                        /* 68% interval */
TL = mu - sigma;  TR = mu + sigma;     /* Left/Right endpoints of interval */
Ty = 1.1 * fMax;                       /* height of label and segment */
output;
 
Text = "95%  ";                        /* 95% interval */
TL = mu - 2*sigma;  TR = mu + 2*sigma; /* Left/Right endpoints of interval */
Ty = 1.2 * fMax;                       /* height of label and segment */
output;
 
Text = "99.7%";                        /* 99.7% interval */
TL = mu - 3*sigma;  TR = mu + 3*sigma; /* Left/Right endpoints of interval */
Ty = 1.3 * fMax;                       /* height of label and segment */
output;
 
keep x f Lx Lf Tx Ty TL TR Text;
run;
 
proc sgplot data=NormalPDF noautolegend;
   band     x=x upper=f lower=0;
   series   x=x y=f             / lineattrs=(color=black);
   dropline x=Lx y=Lf           / dropto=x lineattrs=(color=black);
   highlow  y=Ty low=TL high=TR / lowcap=serif highcap=serif lineattrs=(thickness=2);
   text     x=Tx y=Ty text=Text / backfill fillattrs=(color=white) textattrs=(size=14); 
   yaxis offsetmin=0 min=0 label="Density";
   xaxis values=(20 to 80 by 10) display=(nolabel);
run;

The post Visualize the 68-95-97.5 rule in SAS appeared first on The DO Loop.

6
Feb

What colors does PROC SGPLOT use for markers?

Suppose you create a scatter plot in SAS with PROC SGPLOT. What color does PROC SGPLOT use for the markers? If you specify the GROUP= option so that markers are colored by a grouping variable, what colors are used to represent the various groups? The following scatter plot shows the colors that are used by default for the HTMLBlue style. They are shades of blue, red, green, brown, and magenta.

data A;        /* example data with groups 1, 2, ..., 5 */
do Color = 1 to 5;
   x = Color; y = Color;  output;
end;
run;
 
title "Marker Colors Used for GROUP= Option";
title2 "HTMLBlue Style";
proc sgplot data=A;
xaxis grid; yaxis grid;
scatter x=x y=y / group=Color markerattrs=(size=24 symbol=SquareFilled);
run;
stylescolors1

Notice that these marker colors are not fully saturated colors, so they are not the SAS color names RED, BLUE, GREEN, BROWN, and MAGENTA. So what colors are these? What are their RGB values?


What colors does PROC SGPLOT uses for groups? #SASTip
Click To Tweet


Colors come from styles

Colors are defined by styles, and you can use ODS style elements to set marker colors. A style defines elements called GraphDataDefault, GraphData1, GraphData2, GraphData3, and so forth. Each element contains several attributes such as colors and line patterns. The complete list of style elements and attributes for ODS graphics is in the documentation, but for this article, the important fact is that the GraphDatan:ContrastColor attribute determines the marker color for the nth group. These are the colors in the previous scatter plot.

Display an ODS style template

Styles are defined by ODS templates. You can use the SOURCE statement in PROC TEMPLATE to display a template. In the style template, the GraphDatan:ContrastColor attributes are set by using keywords named gcdata1, gcdata2, gcdata3, etc.

If you display the template for the styles.HTMLBlue template, you will see that the HTMLBlue style inherits from the Statistical style, and it is the Statistical style that defines the contrast colors. The following statements display the contents of the Statistical template to the SAS log:

proc template;
source styles.statistical;
quit;

The template is long, and it is hard to scroll through the log to discover which colors are associated with each attribute. But that's no problem: you can use SAS to find and display only the information about contrast colors.

Display the marker colors as RGB and hexadecimal values

For years Warren Kuhfeld has been showing SAS customers how to view, edit, and use ODS templates to customize the graphs that are produced by SAS statistical procedures. A powerful technique that he uses is to write a template to a file and then use the DATA step to modify the template.

I will not modify the template but merely display information from it. The following DATA step writes the template to a text file and then uses the DATA step to find all instances of the keyword 'gcdata' in the template. For lines that contain the string 'gcdata', the program extracts the color for each keyword. The keyword-value pairs are saved to a data set, which is sorted and displayed:

libname temp "C:/temp";
proc template;
source styles.statistical / file='temp.tmp'; /* write template to text file */
quit;
 
data Colors;
keep Num Name Color R G B;
length Name Color $8;
infile 'temp.tmp';                    /* read from text file */
input;
/* example string:  'gcdata1' = cx445694 */
k = find(_infile_,'gcdata','i');      /* if k=0 then string not found */
if k > 0 then do;                     /* Found line that contains 'gcdata' */
   s = substr(_infile_, k);           /* substring from 'gcdata' to end of line */
   j = index(s, "'");                 /* index of closing quote  */
   Name = substr(s, 1, j-1);          /* keyword                 */
   if j = 7 then Num = 0;             /* string is 'gcdata'      */
   else                               /* extract number 1, 2, ... for strings */
      Num = inputn(substr(s, 7, j-7), "best2.");  /* gcdata1, gcdata2,...     */
   j = index(s, "=");                 /* index of equal sign     */
   Color = compress(substr(s, j+1));  /* color value for keyword */
   R = inputn(substr(Color, 3, 2), "HEX2.");   /* convert hex to RGB */
   G = inputn(substr(Color, 5, 2), "HEX2.");
   B = inputn(substr(Color, 7, 2), "HEX2.");
end;
if k > 0;
run;
 
proc sort data=Colors; by Num; run;
 
proc print data=Colors; 
var Name Color R G B;
run;
stylescolors2

Success! The output shows the contrast colors for the HTMLBlue style. The 'gcdata' color is the fill color (a dark blue) for markers when no GROUP= option is specified. The 'gcdatan' colors are used for markers that are colored by group membership. Obviously you could use this same technique to display other style attributes, such as line patterns or bar colors ('gdata').

If you prefer a visual summary of the attributes for an ODS style, see section "ODS Style Comparisons" in the SAS/STAT documentation. That section is part of the chapter "Statistical Graphics Using ODS," which could have been titled "Everything you always wanted to know about ODS graphics but were afraid to ask."

An application of setting marker colors

I prefer to style elements and discrete attribute maps to set colors for markers. But if you are rushed for time, you might want to use the STYLEATTRS statement to set the colors that are used for the GROUP= option. The STYLEATTRS statement requires a color list of hexadecimal colors or SAS color names. The following call to PROC SGPLOT uses the RGB/hex values for GraphData1:ContrastColor and so forth:

/* use colors for HTMLBlue style */
%let gcdata1 = cx445694;        
%let gcdata2 = cxA23A2E;
%let gcdata3 = cx01665E;
title "Origin in {Europe, USA}";
proc sgplot data=sashelp.cars;
where origin^='Asia' && type^="Hybrid";                 /* omit first category */
   styleattrs DataContrastColors = (&gcdata2 &gcdata3); /* use 2nd and 3rd colors */
   scatter x=weight y=mpg_city / group=Origin markerattrs=(symbol=CircleFilled);
   keylegend / location=inside position=TopRight across=1;
run;

It would be great if you could specify a style-independent syntax such as

styleattrs DataContrastColors=(GraphData2:ContrastColor GraphData3:ContrastColor);

Unfortunately, that syntax is not supported. The STYLEATTRS statement requires a list of color values or SAS color names.

Although this trick is interesting, in general I prefer to use styles (rather than hard-coded color values) in production code. However, if you want to know the RGB/hex values for a style, this trick shows how you can get them from an ODS template.

tags: SAS Programming, Statistical Graphics

The post What colors does PROC SGPLOT use for markers? appeared first on The DO Loop.

30
Jan

Automate the creation of a discrete attribute map

If you are a SAS programmer and use the GROUP= option in PROC SGPLOT, you might have encountered a thorny issue: if you use a WHERE clause to omit certain observations, then the marker colors for groups might change from one plot to another. This happens because the marker colors depend on the data by default. If you change the number of groups (or the order of groups), the marker colors also change.

A simple example demonstrates the problem. The following scatter plots are colored by the ORIGIN variable in the SasHelp.Cars data. (Click to enlarge.) The ORIGIN variable has three levels: Asia, Europe, and USA. On the left, all values of the ORIGIN variable are present in the graph. On the right, the Asian vehicles are excluded by using WHERE ORIGIN^="Asia". Notice that the colors of the markers on the right are not consistent with the values on the left.

Warren Kuhfeld wrote an excellent introduction to legend order and group attributes, and he describes several other ways that group colors can change from one plot to another. To solve this problem, Kuhfeld and other experts recommend that you create a discrete attribute map. A discrete attribute map is a SAS data set that specifies the colors to use for each group value. If you are not familiar with discrete attribute maps, I provide several references at the end of this article.


Automatically create a discrete attribute map for PROC SGPLOT #SASTip
Click To Tweet


Automatic creation of a discrete attribute map

The discrete attribute map is powerful, flexible, and enables the programmer to completely determine the legend order and color for all categories. However, I rarely use discrete attribute maps in my work because the process requires the manual creation of a data set. The data set has to contain all categories (spelled and capitalized correctly) and you have to remember (or look up) the structure of the data set. Furthermore, many examples use hard-coded color values such as CXFFAAAA or "LightBlue," whereas I prefer to use the GraphDatan style elements in the current ODS style.

However, I recently realized that PROC FREQ and the SAS DATA step can lessen the burden of creating a discrete attribute map. The documentation for the discrete attribute map mentions that you can define a column named MarkerStyleElement (or MarkerStyle), which specifies the names of styles elements such as GraphData1, GraphData2, and so on. Therefore, you can use PROC FREQ to write the category levels to a data set, and use a simple DATA step to add the MarkerStyleElement variable. For example, you can create a discrete attribute map for the ORIGIN variable, as follows:

/* semi-automatic way to create a DATTRMAP= data set */
%let VarName = Origin;           /* specify name of grouping variable */
proc freq data=sashelp.cars ORDER=FORMATTED;   /* or ORDER=DATA|FREQ  */
   tables &VarName / out=Attrs(rename=(&VarName=Value));
run;
 
data DAttrs;
ID = "&VarName";                 /* or "ID_&VarName" */
set Attrs(keep=Value);
length MarkerStyleElement $11.;
MarkerStyleElement = cats("GraphData", 1+mod(_N_-1, 12)); /* GraphData1, GraphData2, etc */
run;
 
proc print; run;
Structure of discrete attribute maps for DATTRMAP= option in PROC SGPLOT

Voila! The result is a valid discrete attribute data set for the ORIGIN variable. The DATTRS data set contains all the information you need to ensure that the first category is always displayed by using the GraphData1 element, the second category is displayed by using GraphData2, and so on. The program does not require that you manually type the categories or even know how many categories there are. Obviously, you could write a macro that makes it easy to generate these statements.

This data set uses the alphabetical order of the formatted values to determine the group order. However, you can use the ORDER=DATA option in PROC FREQ to order by the order of categories in the data set. You can also use the ORDER=FREQ option to order by the most frequent categories. Because most SAS-supplied styles define 12 style elements, the MOD function is used to handle categorical variable that have more than 12 levels.

Use the discrete attribute map

To use the discrete attribute map, you need to specify the DATTRMAP= option on the PROC SGPLOT statement. You also need to specify the ATTRID= option on every SGPLOT statements that will use the map. Notice that I set the value of the ID variable to be the name of the GROUP= variable. (If that is confusing, you could choose a different value, as noted in the comments of the program.) The following statements are similar to the statements that create the right-hand graph at the top of this article, except this call to PROC SGPLOT uses the DATTRS discrete attribute map:

proc sgplot data=sashelp.cars DATTRMAP=DAttrs;
where origin^='Asia' && type^="Hybrid";
   scatter x=weight y=mpg_city / group=Origin ATTRID=Origin 
                                 markerattrs=(symbol=CircleFilled);
   keylegend / location=inside position=TopRight across=1;
run;
Markers colored by attributes specified in a discrete attribute data set, using PROC SGPLOT and the DATTRMAP= option

Notice that the colors in this scatter plot are the same as for the left-hand graph at the top of this article. The group colors are now consistent, even though the number of groups is different.

Generalizing the automatic creation of a discrete attribute map

The previous section showed how to create a discrete attribute map for one variable. You can use a similar approach to automatically create a discrete data map that contains several variables. The main steps are as follows:

  1. Use ODS OUTPUT to save the OneWayFreqs tables from PROC FREQ to a SAS data set.
  2. Use the SUBSTR function to extract the variable name into the ID variable.
  3. Use the COALESCEC function to form a Value column that contains the values of the categorical variables.
  4. Use BY-group processing and the UNSORTED option to assign the style elements GraphDatan.
ods select none;
proc freq data=sashelp.cars;
   tables Type Origin;        /* specify VARS here */ 
   ods output OneWayFreqs=Freqs;
run;
ods select all;
 
data Freqs2;
set Freqs;
length ID $32.;
ID = substr(Table, 6);        /* original values are "Table VarName" */
Value = COALESCEC(F_Type, F_Origin);  /* also specify F_VARS here */
keep ID Value;
run;
 
data DAttrs(drop=count);
set Freqs2;
length MarkerStyleElement $11.;
by ID notsorted;
if first.ID then count = 0;
count + 1;
MarkerStyleElement = cats("GraphData", 1 + mod(count-1, 12));
run;

The preceding program is not completely general, but it shows the main ideas. You can adapt the program to your own data. If you are facile with the SAS macro language, you can even write a macro that generates appropriate code for an arbitrary number of variables. Leave a comment if this technique proves useful in your work or if you have ideas for improving the technique.

References

tags: Statistical Graphics, Statistical Programming

The post Automate the creation of a discrete attribute map appeared first on The DO Loop.

6
Jan

Is "La Quinta" Spanish for "Next to Denny's"?

“La Quinta” is Spanish for “next to Denny’s.”
     -- Mitch Hedberg, comedian

Mitch Hedberg's joke resonates with travelers who drive on the US interstate system because many highway exits feature both a La Quinta Inn™ and a Denny's® restaurant within a short distance of each other. But does a statistical data analysis support this anecdotal evidence?

In 2014 John Reiser wrote a blog post that uses the Python language to scrape the web for the locations of La Quinta Inns and Denny's restaurants. He then analyzed the data to show that, yes, in general, a guest at a La Quinta Inn does not have far to travel if he wants to eat at a Denny's restaurant. This work inspired Colin Rundel and Mine Cetckaya-Rundel to assign this analysis as a project for their students at Duke University and to write an article in CHANCE magazine (29(2), 2016) about the assignment.

Rundel and Cetckaya-Rundel posted CSV files on the CHANCE web site that contain the longitude, latitude, and addresses of 851 La Quinta Inns and 1,634 Denny's restaurants in the contiguous US (as of Dec 2015). This article follows their presentation and shows how to analyze the La Quinta-Denny's spatial data in SAS. The analysis is a straightforward extension of the nearest-neighbor techniques shown in article "Distances between observations in two groups." You can download the SAS program that imports the data and creates all the graphs and tables in this article.

Visualizing the locations of La Quinta Inns and Denny's restaurants

Locations of La Quinta Inns and Denny's restaurants in the contiguous US

You can use PROC HTTP in SAS to read CSV files directly from a URL. After importing the locations of La Quinta Inns and Denny's restaurants from the CHANCE web site, you can use PROC SGPLOT to plot the (unprojected) locations as longitudes and latitudes. To help visualize the locations, the adjacent graph overlays the data on an outline of the lower-48 states in the US. (Click the graph to enlarge.)

In the graph, the La Quinta Inns are represented by blue circles and Denny's by a red cross. In the enlarged version you can see that many circles enclose a red cross, which indicates that the inn and restaurant are very close. On the other hand, there are a few La Quinta locations that seem to be far away from any Denny's. Montana, North Dakota, Louisiana, Kansas, Nevada, and southwest Texas are some geographic regions in which a blue circle is not close to a red cross.

The distance to the nearest Denny's for each La Quinta Inn

Imagine that a husband and wife are spending their retirement by crisscrossing the US. The wife wants to sleep each night at a La Quinta Inn. The husband wants to eat breakfast each morning at a Denny's restaurant. If they both get their wishes, how far will they need to travel to get breakfast each morning?

I've previously written about how to compute the distance to the nearest neighbor between observations that are in different groups. For these data, the La Quinta Inns form one group and the Denny's restaurants form a second group. Because the coordinates for these data are longitude and latitude, you need to use the GEODIST function in Base SAS to compute the distance ( in kilometers, as the crow flies) between each hotel and the nearest Denny's.

For each of the 851 hotels, you can find the distance to the nearest Denny's. The following table summarizes the distribution of distances, in kilometers:

lqdennys2

The table shows that the closest Denny 's is a mere 11 meters from the adjacent La Quinta Inn! For 25% of the inns, the nearest Denny's is within 1.3 km, which is a short walk. Fifty percent of the inns are within 5 km (an easy drive) of a Denny's, and 75% are within 17 km. It seems that the data supports a modified version of Hedberg's joke: “La Quinta” is Spanish for “often close to a Denny’s.”

The following histogram shows the distribution of the 851 distances from each La Quinta Inn to the nearest Denny's. The histogram shows that about 65% of the inns are within 10 km and about 77% are within 20 km. As long as the husband and wife stay at these inns, they will both be happy, well-rested, and well-fed!

lqdennys3

La Quinta Inns that are far from a Denny's

lqdennys4

Clearly the couple can happily sleep at a La Quinta Inn and eat at a Denny's provided that they avoid the few La Quinta Inns that are not located near a Denny's. The scatter plot and map near the top of this article gives some indications about where these inns are located, but we can identify these inns more precisely. For the sake of the couple's marital bliss, let's enumerate the inns that are farthest from a Denny's.

The adjacent table shows 10 La Quinta Inns that are farthest from a Denny's restaurant. The farthest distance is the La Quinta Inn in Glendive, Montana, which is 282 km from the nearest Denny's. La Quinta Inns in Kansas, Nevada, Texas, and Louisiana are also more than 150 km away from a Denny's.

If possible, the couple should avoid these inns, but what if their travels bring them through these cities? In that case, they need to know that distance and location to the nearest Denny's. The next graph might be helpful.

The following graph shows all the La Quinta Inns. The inns shown in red are those for which the distance to the nearest Denny's is more than 80 km away. For each of these inns, an arrow is drawn from the La Quinta Inn to the location of the nearest Denny's. (I explained this nearest-neighbor plot in a previous article.) This helps the couple know what direction they need to drive in order to reach breakfast—or perhaps lunch! For some La Quinta locations (MT, SC, TN) the couple will need to drive into an adjacent state.

Nearest Denny's for La Quinta Inns That Are Far from a Denny's

Summary

Mitch Hedberg's joke is funny because it has an element of truth. For most La Quinta Inns, the nearest Denny's restaurant is a short walk or drive away. By using nearest-neighbor computations and the GEODIST function in SAS, you can compute the distances from each inn to the nearest Denny's. You can graph the set of distances and compute statistics such as quantiles. You can visualize the co-locations of the inns and restaurants, and even direct travelers to the nearest Denny's. I agree with Rundel and Cetckaya-Rundel that this exercise provides a fun activity in data analysis for students.

For professional SAS programmers, the exercise demonstrates how to conduct a particular kind of co-location analysis. Instead of Denny's restaurants, the professional analyst might be interested in the distance to the nearest hospital, distribution center, or cell-phone tower. SAS statistical graphics and SAS/IML, provides the tools for analyzing the distance between groups of spatial data.

If you want to examine or extend my analysis of these data, you can download the SAS program.

tags: Data Analysis, Spatial Data, Statistical Graphics

The post Is "La Quinta" Spanish for "Next to Denny's"? appeared first on The DO Loop.

30
Nov

Append data to add markers to SAS graphs

Do you want to create customized SAS graphs by using PROC SGPLOT and the other ODS graphics procedures? An essential skill that you need to learn is how to merge, join, append, and concatenate SAS data sets that come from different sources. The SAS statistical graphics procedures (SG procedures) enable you to overlay all kinds of customized curves, markers, and bars. However, the SG procedures expect all the data for a graph to be in a single SAS data set. Therefore it is often necessary to append two or more data sets before you can create a complex graph.

This article discusses two ways to combine data sets in order to create ODS graphics. An alternative is to use the SG annotation facility to add extra curves or markers to the graph. Personally, I prefer to use the techniques in this article for simple features, and reserve annotation for adding highly complex and non-standard features.

Overlay curves

sgplotoverlay

In a previous article, I discussed how to structure a SAS data set so that you can overlay curves on a scatter plot.

The diagram at the right shows the main idea of that article. The X and Y variables contain the original data, which are the coordinates for a scatter plot. Secondary information was appended to the end of the data. The X1 and Y1 variables contain the coordinates of a custom scatter plot smoother. The X2 and Y2 variables contain the coordinates of a different scatter plot smoother.

This structure enables you to use the SGPLOT procedure to overlay two curves on the scatter plot. You use a SCATTER statement and two SERIES statements to create the graph. See the previous article for details.

Overlay markers: Wide form

In addition to overlaying curves, I sometimes want to add special markers to the scatter plot. In this article I will show how to add a marker that shows the location of the sample mean. This article shows how to use PROC MEANS to create an output data set that contains the coordinates of the sample mean, then append that data set to the original data.


Add special markers to a graph using PROC SGPLOT #SASTip
Click To Tweet


The following statements use PROC MEANS to compute the sample mean for four variables in the SasHelp.Iris data set, which contains the measurements for 150 iris flowers. To emphasize the general syntax of this computation, I use macro variables, but that is not necessary:

%let DSName = Sashelp.Iris;
%let VarNames = PetalLength PetalWidth SepalLength SepalWidth;
 
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean= / autoname;
run;

The AUTONAME option on the OUTPUT statement tells PROC MEANS to append the name of the statistic to the variable names. Thus the output data set contains variables with names like PetalLength_Mean and SepalWidth_Mean. As shown in the diagram in the previous section, this enables you to append the new data to the end of the old data in "wide form" as follows:

data Wide;
   set &DSName Means; /* add four new variables; pad with missing values */
run;
 
ods graphics / attrpriority=color subpixel;
proc sgplot data=Wide;
scatter x=SepalWidth y=PetalLength / legendlabel="Data";
ellipse x=SepalWidth y=PetalLength / type=mean;
scatter x=SepalWidth_Mean y=PetalLength_Mean / 
         legendlabel="Sample Mean" markerattrs=(symbol=X color=firebrick);
run;
Scatter plot with markers for sample means

The first SCATTER statement and the ELLIPSE statement use the original data. Recall that the ELLIPSE statement draws an approximate confidence ellipse for the mean of the population. The second SCATTER statement uses the sample means, which are appended to the end of the original data. The second SCATTER statement draws a red marker at the location of the sample mean.

You can use this same method to plot other sample statistics (such as the median) or to highlight special values such as the origin of a coordinate system.

Overlay markers: Long form

In some situations it is more convenient to append the secondary data in "long form." In the long form, the secondary data set contains the same variable names as in the original data. You can use the SAS data step to create a variable that identifies the original and supplementary observations. This technique can be useful when you want to show multiple markers (sample mean, median, mode, ...) by using the GROUP= option on one SCATTER statement.

The following call to PROC MEANS does not use the AUTONAME option. Therefore the output data set contains variables that have the same name as the input data. You can use the IN= data set option to create an ID variable that identifies the data from the computed statistics:

/* Long form. New data has same name but different group ID */
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean=;
run;
 
data Long;
set &DSName Means(in=newdata);
if newdata then 
   GroupID = "Mean";
else GroupID = "Data";
run;

The DATA step created the GroupID variable, which has the values "Data" for the original observations and the value "Mean" for the appended observations. This data structure is useful for calling PROC SGSCATTER, which supports the GROUP= option, but does not support multiple PLOT statements, as follows:

ods graphics / attrpriority=none;
proc sgscatter data=Long 
   datacontrastcolors=(steelblue firebrick)
   datasymbols=(Circle X);
plot (PetalLength PetalWidth)*(SepalLength SepalWidth) / group=groupID;
run;
Scatter plot matrix with markers for sample means

In conclusion, this article demonstrates a useful technique for adding markers to a graph. The technique requires that you concatenate the original data with supplementary data. Appending and merging data is a technique that is used often when creating ODS statistical graphics in SAS. It is a great technique to add to your programming toolbox.

tags: SAS Programming, Statistical Graphics, Tips and Techniques

The post Append data to add markers to SAS graphs appeared first on The DO Loop.

Back to Top