3
Jan

## The top 10 posts from The DO Loop in 2017

I wrote more than 100 posts for The DO Loop blog in 2017. The most popular articles were about SAS programming tips, statistical data analysis, and simulation and bootstrap methods. Here are the most popular articles from 2017 in each category.

### Statistics and Data Analysis

• M&M Colors: It's no surprise that a statistical analysis of the color distribution of M&M candies was one of the most popular articles. Some people are content to know that the candies are delicious, but thousands wanted to read about whether blue and orange candies occur more often than brown.
• Interpretation of Correlation: Correlation is one of the simplest multivariate statistics, but it can be interpreted in many ways: algebraic, geometric, in terms of regression, and more. This article describes seven ways to view correlation?
• Winsorize Data: Before you ask "how can I Winsorize data" to eliminate outliers, you should ask "what is Winsorization" and "what are the pitfalls?" This article presents the advantages and disadvantages of Winsorizing data.

### Simulation and Bootstrapping

Was you New Year's resolution to learn more about SAS? Did you miss any of these popular posts? Take a moment to read (or re-read!) one of these top 10 posts from the past year.

The post The top 10 posts from <em>The DO Loop</em> in 2017 appeared first on The DO Loop.

18
Dec

## Visualize multivariate regression models by slicing continuous variables

Slice, slice, baby! You've got to slice, slice, baby!

When you fit a regression model that has multiple explanatory variables, it is a challenge to effectively visualize the predicted values. This article describes how to visualize the regression model by slicing the explanatory variables. In SAS, you can use the SLICEFIT option in the EFFECTPLOT statement visualize a slice of a regression surface.

### Why the naive visualization fails

For a regression model that contains one explanatory variable and (optionally) one classification variable, it is easy to visualize the predicted values. Most statistical software packages make it easy to create a "fit plot." For example, the following call to PROC GLM in SAS fits a model to some patients in a heart study:

```data Heart; /* create example data */ set sashelp.heart(obs=500); where cholesterol < 400; run;   ods graphics / attrpriority=none /* groups determine symbols and line patterns */ imagemap tipmax=1500; /* enable tool tips */   /* easy to visualize predicted values for 1 continuous and 1 categorical explanatory variable */ proc glm data=Heart plots=meanplot; /* PLOTS= option supported in many procedures */ class Sex; model Cholesterol = Sex Systolic; quit;```

The graph shows the observed responses versus the continuous explanatory variable and overlays two curves: one for the predicted values when Sex='Male' and the other when Sex='Female'. Creating this graph is easy because the procedure does all the work.

What happens if you add additional explanatory variables into the model and try to create the same graph? For reasons that will soon be apparent, the procedure will not automatically create the graph when there are additional variables in the model. However, you can use the OUTPUT statement to write the predicted values to a SAS data set and use PROC SGPLOT to create the graph. You will need to sort by the variable that you are plotting on the X axis, as follows:

```proc glm data=Heart; class Sex Smoking_status; model Cholesterol = Sex Smoking_Status /* two classification variables */ Systolic Weight; /* two continuous variables */ output out=GLMOut p=Pred; /* output data set contains predicted values */ quit;   proc sort data=GLMOut; by Systolic Sex; run; /* sort by X variable for graphing */   title "Predicted Values"; proc sgplot data=GLMOut; styleattrs datalinepatterns=(solid solid); scatter x=Systolic y=Cholesterol / group=Sex transparency=0.75; series x=Systolic y=Pred / group=Sex tip=(Smoking_Status Weight); /* add tool tips */ yaxis min=180 max=300; /* zoom in on predicted values */ footnote J=L "Jagged Lines Because Covariates Have Multiple Values"; run;```

This graph looks strange. The regression model is linear, but a plot of the predicted values shows a jagged line for the predicted values. What is going on?

You can use the tool tips feature of the graph to understand why the curves are jagged. If you hover the cursor near a point on the jagged line, the values of the hidden explanatory variables (Weight and Smoking_Status) appear. The graph shows the tool tip at a point that corresponds to a male patient who weighs 160 pounds and who is a moderate smoker. By moving the cursor, you can discover that the previous point along the red line corresponds to a male patient who weighs 155 pounds and is a non-smoker. The subsequent point corresponds to a heavy smoker who weighs 151 pounds.

Because Weight and Smoking_Status were included in the model, the predicted values "jump" up or down as you move along the Systolic axis. Two observations that have similar Systolic values might have very different values for other (hidden) components. Geometrically, this graph displays the projection of the predicted values onto the two-dimensional (Systolic, Cholesterol) plane. To obtain a smooth curve, you must "slice" a response surface rather than project it.

### Slice the response surfaces

The predicted values for this model form a set of 10 planes in the three-dimensional space (x, y, z) = (Systolic, Weight, Cholesterol). Each plane is the graph of predicted values for a combination of the 2 genders and 5 levels of smokers. There is one plane is for the ('Male', 'Non-smoker') patients, another for the ('Female', 'Light (1-5)') patients, and so on.

A "slice" through the response surfaces is accomplished by evaluating the model at a particular value of one of the continuous variables. This gives a two-dimensional plot that has 10 lines on it. Because 10 lines might overcrowd the display, it is common to pick a reference value for one of the classification variables and plot only the lines that are indexed by that value. For example, if you choose the reference value Smoking_Status = 'Non-smoker', the plot contains two lines that correspond to ('Male', 'Non-smoker') and ('Female', 'Non-smoker').

This might sound complicated, but SAS provides an easy implementation: the SLICEFIT option in the EFFECTPLOT statement, which is supported in several regression procedures, enables you to specify how you want to slice the surfaces and which combinations of levels you want to display.

By default, the EFFECTPLOT SLICEFIT statement creates a "sliced fit plot" that graphs the response variable versus the first continuous variable and shows the predicted values for each level of the first class variable. "First" is determined by the order in which the variables are listed on the MODEL statement. Other continuous variables are sliced (evaluated) at their mean value; other classification variables are evaluated at their last level.

PROC GLM does not support the EFFECTPLOT statement, but PROC GENMOD does. The following call to PROC GENMOD fits the same model and creates a "sliced fit plot" of the predicted values. The sliced fit plot will show the response variable (Cholesterol) versus the first continuous variable (Systolic) overlaid with predictions for males and females. The value of the Weight variable is set to 151.7, which is the mean value of the sample. The value of the Smoking_Status variable is set to 'Very Heavy (> 25)', which is the last level in alphanumeric order.

```title; footnote; ods graphics / attrpriority=none imagemap=off; proc genmod data=Heart; class Sex Smoking_status; model Cholesterol = Sex Smoking_Status /* classification variables */ Systolic Weight; /* continuous variables */ /* Plot response vs first cont var for each level of first class var */ /* Set other cont vars to MEAN; set other class vars to last level */ effectplot slicefit / obs; /* add scatter plot of observations */ run;```

The sliced fit plot shows smooth (not jagged) lines because the model is evaluated at constant values of the hidden variables. The values (Weight, Smoking_Status) = (151.7, 'Very Heavy (> 25)') are held constant while the model is evaluated over the range of the Systolic and Sex variables.

### Other ways to slice the response surfaces

The SLICEFIT option in the EFFECTPLOT statement supports many suboptions that enable you to control the way that the model is sliced:

• You can plot any two variables, one continuous and one categorical. Use the X= option to specify the continuous variable and the SLICEBY= option to specify the categorical variable.
• You can specify the statistics that are used to slice the continuous covariates. By default the covariates are sliced at their mean values. You can use the AT option to specify the following keywords: MEAN (the default), MIN, MAX, MEDIAN, or MIDRANGE. (Recall that the midrange is the value (min+max)/2.) For class variables, the REF option specifies that the last level be used.
• You can use the AT option to specify particular values for slicing the continuous covariates and class variables.
• You can specify multiple values for the AT option. The EFFECTPLOT statement will create a panel of sliced fit plots, one for each joint combination of specified values.

The following four EFFECTPLOT statements correspond to the four items in the previous list:

```proc genmod data=Heart; class Sex Smoking_status; model Cholesterol = Sex Smoking_Status /* classification variables */ Systolic Weight; /* continuous variables */ /* specify the X and categorical variables */ effectplot slicefit(X=weight sliceby=Smoking_status) / obs;   /* specify statistics used to slice the covariates */ effectplot slicefit / at MIDRANGE /* new default for continuous vars */ REF; /* default for classification vars */   /* specify explicit values of the covariates */ effectplot slicefit / at(Weight=150 Smoking_Status='Non-smoker');   /* specify multiple values of the covariates to get a panel */ effectplot slicefit / at(Weight=150 200 Smoking_Status='Non-smoker' 'Heavy (16-25)'); quit;```

To save space, only the last sliced fit plot (the panel) is shown below. I have linked to the other three plots: the plot of Weight and Smoking_Status, the plot at midrange, and the plot at specified values.

In summary, you can use the SLICEFIT option in the EFFECTPLOT statement in SAS to visualize regression models that contain many explanatory variables. The AT option enables you to specify values for the covariates. The resulting graph displays a slice through the response surface.

The EFFECTPLOT statement is also available in PROC PLM. PROC PLM enables you to visualize a model that has been saved to an item store. The OBS option (which overlays the predicted values and a scatter plot) is not available in PROC PLM because the item store does not include the observations.

The post Visualize multivariate regression models by slicing continuous variables appeared first on The DO Loop.

4
Dec

## Mean imputation in SAS

Imputing missing data is the act of replacing missing data by nonmissing values. Mean imputation replaces missing data in a numerical variable by the mean value of the nonmissing values. This article shows how to perform mean imputation in SAS. It also presents three statistical drawbacks of mean imputation.

### How to perform mean imputation in SAS

The easiest way to perform mean imputation in SAS is to use PROC STDIZE. PROC STDIZE supports the REPONLY and the METHOD=MEAN options, which tells it to replace missing values with the mean for the variables on the VAR statement. To demonstrate mean imputation, the following statements randomly add missing values to the Sashelp.Class data set. The call to PROC STDIZE then replaces the missing values and creates a data set called IMPUTED that contains the results:

```/* Create "original data" by randomly inserting missing values for some heights */ data Have; set sashelp.class; call streaminit(12345); Replaced = rand("Bernoulli", 0.4); /* indicator variable is 1 about 40% of time */ if Replaced then Height = .; run;   /* Mean imputation: Use PROC STDIZE to replace missing values with mean */ proc stdize data=Have out=Imputed oprefix=Orig_ /* prefix for original variables */ reponly /* only replace; do not standardize */ method=MEAN; /* or MEDIAN, MINIMUM, MIDRANGE, etc. */ var Height; /* you can list multiple variables to impute */ run;   proc print data=Imputed; format Orig_Height Height BESTD8.1; var Name Orig_Height Height Weight Replaced; run;```

The output shows that the missing data (such as observations 6 and 8) are replaced by 61.5, which is the mean value of the observed heights. For a subsequent visualization, I have included a binary variable (Replaced) that indicates whether an observation was originally missing. The METHOD= option in PROC STDIZE supports several statistics. You can use METHOD=MEDIAN to replace missing values by the median, METHOD=MINIMUM to replace by the minimum value, and so forth.

### Problems with mean imputation

Most software packages deal with missing data by using listwise deletion: observations that have missing data are dropped from the analysis. Throwing away hard-collected data is painful and can result in a substantial loss of power for statistical tests. Mean imputation, which is easy to implement, enables analysts to use every observation. However, mean imputation has three serious disadvantages that can lead to problems in your statistical analysis. Mean imputation is a univariate method that ignores the relationships between variables and makes no effort to represent the inherent variability in the data. In particular, when you replace missing data by a mean, you commit three statistical sins:

• Mean imputation reduces the variance of the imputed variables.
• Mean imputation shrinks standard errors, which invalidates most hypothesis tests and the calculation of confidence interval.
• Mean imputation does not preserve relationships between variables such as correlations.

These problems are discussed further in my next blog post. Most experts agree that the drawbacks far outweigh the advantages, especially since most software supports modern alternatives to single imputation, such as multiple imputation. My advice: don't use mean imputation if you can use a more sophisticated alternative.

### Epilogue

When I was in college, an actor friend smoked cigarettes. He knew that he should stop, but his addiction was too strong. When he lit up he would recite the following verse and dramatically punctuate the final phrase by blowing a smoke ring:

If you don't smoke, don't start.
If you do smoke, stop.
If you do smoke and won't stop, smoke with style. (*blows smoke ring*)

I don't recommend mean imputation. It is bad for the health of your data. But I can't dissuade from using mean imputation, remember the following verse:

If you don't use mean imputation, don't start.
If you do use mean imputation, stop.
If you do use mean imputation and won't stop, use PROC STDIZE.

The post Mean imputation in SAS appeared first on The DO Loop.

23
Oct

## Principal component regression in SAS

A common question on discussion forums is how to compute a principal component regression in SAS. One reason people give for wanting to run a principal component regression is that the explanatory variables in the model are highly correlated which each other, a condition known as multicollinearity. Although principal component regression (PCR) is a popular technique for dealing with almost collinear data, PCR is not a cure-all. This article shows how to compute a principal component regression in SAS; a subsequent article discusses the problems with PCR and presents alternative techniques.

### Multicollinearity in regression

Near collinearity among the explanatory variables in a regression model requires special handling because:

• The crossproduct matrix X`X is ill-conditioned (nearly singular), where X is the data matrix.
• The standard errors of the parameter estimates are very large. The variance inflation factor (VIF), which is computed by PROC REG, is one way to measure how collinearities inflate the variances of the parameter estimates.
• The model parameters are highly correlated, which makes interpretation of the parameters difficult.

Principal component regression keeps only the most important principal components and discards the others. This means that you compute the principal components for the explanatory variables and drop the components that correspond to the smallest eigenvalues of X`X. If you keep k principal components, then those components enable you to form a rank-k approximation to the crossproduct matrix. If you regress the response variable onto those k components, you obtain a PCR. Usually the parameter estimates are expressed in terms of the original variables, rather than in terms of the principal components.

In SAS there are two easy ways to compute principal component regression:

• The PLS procedure supports the METHOD=PCR to perform principal component regression. You can use the NFAC= option to determine the number of principal components to keep.
• The MODEL statement in PROC REG supports the PCOMIT= option. (This option is read as "PC omit.") The argument to the PCOMIT= option is the number of principal components to drop (omit) from the regression.

Notice that neither of these methods calls PROC PRINCOMP. You could call PROC PRINCOMP, but it would be more complicated than the previous methods. You would have to extract the first principal components (PCs), then use PROC REG to compute the regression coefficients for the PCs, then use matrix computations to convert the parameter estimates from the PCs to the original variables.

Principal component regression is also sometimes used for general dimension reduction. Instead of projecting the response variable onto a p-dimensional space of raw variables, PCR projects the response onto a k-dimensional space where k is less than p. For dimension reduction, you might want to consider another approach such as variable selection by using PROC GLMSELECT or PROC HPGENSELECT. The reason is that the PCR model retains all of the original variables whereas variable selection procedures result in models that have fewer variables.

### Use PROC PLS for principal component regression

I recommend using the PLS procedure to compute a principal component regression in SAS. As mentioned previously, you need to use the METHOD=PCR and NFAC= options. The following data for 31 men at a fitness center is from the documentation for PROC REG. The goal of the study is to predict oxygen consumption from age, weight, and various physiological measurements before and during exercise. The following call to PROC PLS computes a PCR that keeps four principal components:

```data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ;   proc pls data=fitness method=PCR nfac=4; /* PCR onto 4 factors */ model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / solution; run;```

The output includes the parameter estimates table, which gives the estimates for the four-component regression in terms of the original variables. Another table (not shown) shows that the first four principal components explain 93% of the variation in the explanatory variables and 78% of the variation in the response variable.

For another example of using PROC PLS to combat collinearity, see Yu (2011), "Principal Component Regression as a Countermeasure against Collinearity."

### Use PROC REG for principal component regression

I recommend PROC PLS for principal component regression, but you can also compute a PCR by using the PCOMIT= option on the MODEL statement in PROC REG. However, the parameter estimates are not displayed in any table but must be written to OUTEST= data set, as follows:

```proc reg data=fitness plots=none outest=PE; /* write PCR estimates to PE data set */ model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / PCOmit=2; /* omit 2 PCs ==> keep 6-2=4 PCs */ quit;   proc print data=PE(where=(_Type_="IPC")) noobs; var Intercept--MaxPulse; run;```

Notice that the PCOMIT=2 option specifies that two PCs should be dropped, which is equivalent to keeping four components in this six-variable model. The parameter estimates are written to the PE data set and are displayed by PROC PRINT. The estimates the same as those found by PROC PLS. In the PE data, the PCR estimates are indicated by the value "IPC" for the _TYPE_ variable, which stands for incomplete principal component regression. The word "incomplete" indicates that not all the principal components are used.

It is worth noting that even though the principal components themselves are based on centered and scaled data, the parameter estimates are reported for the original (raw) variables. It is also worth noting that you can use the OUTSEB option on the PROC REG statement to obtain standard errors for the parameter estimates.

### Should you use principal component regression?

This article shows you how to perform principal component regression in SAS by using PROC PLS with METHOD=PCR. However, I must point out that there are statistical drawbacks to using principal component regression. The primary issue is that principal component regression does not use any information about the response variable when choosing the principal components. Before you decide to use PCR, I urge you to read my next post about the drawbacks with the technique. You can then make an informed decision about whether you want to use principal component regression for your data.

The post Principal component regression in SAS appeared first on The DO Loop.

27
Sep

## Data-driven simulation

In a large simulation study, it can be convenient to have a "control file" that contains the parameters for the study. My recent article about how to simulate multivariate normal clusters demonstrates a simple example of this technique. The simulation in that article uses an input data set that contains the parameters (mean, standard deviations, and correlations) for the simulation. A SAS procedure (PROC SIMNORMAL) simulates data based on the parameters in the input data set.

This is a powerful paradigm. Instead of hard-coding the parameters in the program (or as macro variables), the parameters are stored in a data set that is processed by the program. This is sometimes called data-driven programming. (Some people call it dynamic programming, but there is an optimization technique of the same name so I will use the term "data-driven.") In a data-driven program, when you want to run the program with new parameters, you merely modify the data set that contains the control parameters.

I have previously written about a different way to control a batch program by passing in parameters on the command line when you invoke the SAS program.

### Static programming and hard-coded parameters

Before looking at data-driven programming, let's review the static approach. I will simulate clusters of univariate normal data as an example.

Suppose that you want to simulate normal data for three different groups. Each group has its own sample size (N), mean, and standard deviation. In my book Simulating Data with SAS (p. 206), I show how to simulate this sort of ANOVA design by using arrays, as follows.

```/* Static simulation: Parameters embedded in the simulation program */ data AnovaStatic; /* define parameters for three simulated group */ array N[3] _temporary_ (50, 50, 50); /* sample sizes */ array Mean[3] _temporary_ (14.6, 42.6, 55.5); /* center for each group */ array StdDev[3] _temporary_ ( 1.7, 4.7, 5.5); /* spread for each group */   call streaminit(12345); do k = 1 to dim(N); /* for each group */ do i = 1 to N[k]; /* simulate N[k] observations */ x = rand("Normal", Mean[k], StdDev[k]); /* from k_th normal distribution */ output; end; end; run;```

The DATA step contains two loops, one for the groups and the other for the observations within each group. The parameters for each group are stored in arrays. Notice that if you want to change the parameters (including the number of groups), you need to edit the program. I call this method "static programming" because the behavior of the program is determined at the time that the program is written. This is a perfectly acceptable method for most applications. It has the advantage that you know exactly what the program will do by looking at the program.

### Data-driven programming: Put parameters in a file

An alternative is to put the parameters for each group into a file or data set. If the k_th row in the data set contains the parameters for the k_th group, then the implicit loop in the DATA step will iterate over all groups, regardless of the number of groups. The following DATA step creates the parameters for three groups, which are read and processed by the second DATA step. The parameter values are the same as for the static example, but are transposed and processed row-by-row instead of via arrays:

```/* Data-driven simulation: Parameters in a data set, processed by the simulation program */ data params; /* define parameters for each simulated group */ input N Mean StdDev; datalines; 50 14.6 1.7 50 42.6 4.7 50 55.5 5.5 ;   data AnovaDynamic; call streaminit(12345); set params; /* implicit loop over groups k=1,2,... */ do i = 1 to N; /* simulate N[k] observations */ x = rand("Normal", Mean, StdDev); /* from k_th normal distribution */ output; end; run;```

Notice the difference between the static and dynamic techniques. The static technique simulates data from three groups whose parameters are specified in temporary arrays. The dynamic technique simulates data from an arbitrary number of groups. Currently, the PARAMS data specifies three groups, but if I change the PARAMS data set to represent 10 or 1000 groups, the AnovaDynamic DATA step will simulate data from the new design without any modification.

### Generate the parameters from real data

The data-driven technique is useful when the parameters are themselves the results of an analysis. For example, a common simulation technique is to generate the moments of real data (mean, variance, skewness, and so forth) and to use those statistics in place of the population parameters that they estimate. (See Chapter 16, "Moment Matching," in Simulating Statistics with SAS.)

The following call to PROC MEANS generates the sample mean and standard deviation for real data and writes those values to a data set:

```proc means data=sashelp.iris N Mean StdDev stackods; class Species; var PetalLength; ods output Summary=params; run;```

The output data set from PROC MEANS creates a PARAMS data set that contains the variables (N, MEAN, and STDDEV) that are read by the simulation program. Therefore, you can immediately run the AnovaDynamic DATA step to simulate normal data from the sample statistics. A visualization of the resulting simulated data is shown below.

You can run PROC MEANS on other data and other variables and the AnovaDynamic step will continue to work without any modification. The simulation is controlled entirely by the values in the "control file," which is the PARAMS data set.

You can generalize this technique by wrapping the program in a SAS macro in which the name of the parameter file and the name of the simulated data set are provided at run time. With a macro implementation, you can read from multiple input files and write to multiple output data sets. You could use such a macro, for example, to break up a large simulation study into smaller independent sub-simulations, each controlled by its own file of input parameters. In a gridded environment, each sub-simulation can be processed independently and in parallel, thus reducing the total time required to complete the study.

Although this article discusses control files in the context of statistical simulation, other applications are possible. Have you used a similar technique to control a program by using an input file that contains the parameters for the program? Leave a comment.

The post Data-driven simulation appeared first on The DO Loop.

21
Aug

## 6 tips for timing the performance of algorithms

When you implement a statistical algorithm in a vector-matrix language such as SAS/IML, R, or MATLAB, you should measure the performance of your implementation, which means that you should time how long a program takes to analyze data of varying sizes and characteristics. There are some general tips that can help you eliminate bottlenecks in your program so that your program is fast as lightning! In fact, it is a little bit frightening how quickly you can become an expert in timing.

The following general principles apply regardless of the language that you use to implement the algorithm:

1. Use simulation to construct test data of varying sizes. By simulating data, you can easily vary the size of the size while preserving distributional characteristics. For example, you might want to test an algorithm on data that are normally distributed and contain 25, 50, 75, and 100 thousand observations. (Each language has different techniques to simulate data efficiently. For SAS software, see Simulating Data with SAS.)
2. Vary the distribution of the test data. Concentrate mainly on how the algorithm performs on typical data, but also test examples of "best case" and "worst case" scenarios for which the algorithm might run very quickly or very slowly. If the performance is affected by other factors such as the proportion of outliers or missing values, then include those factors in the study.
3. If possible, construct timing tests that run between 0.1 and 1 seconds. This length is long enough to be reliably measured but short enough that you can run many tests. If you try to time very short intervals (less than a millisecond), you will discover that the operating system constantly performs unrelated background tasks that can interfere with your timings, which leads to noisy measurements. For ultra-short intervals (less than a microsecond), you can encounter an "uncertainty principle" phenomena in which the very act measuring the performance of an algorithm affects the performance that you are trying to measure.
4. To reduce timing noise, call the routine multiple times and report the mean or median. If the performance can vary greatly (±20% or more), report the distribution of times by using quantiles or by displaying a box plot.
5. Use a "burn-in" call to reduce the impact of overhead costs. The first time a routine is called, it might require the operating system to load DLLs or allocate a huge block of memory. When the routine is called a second time, the operating system might have cached the needed information. If so, subsequent calls could be considerably faster than the first.
6. Use a line plot or box plots to present the performance results, where the vertical axis represents time and the horizontal axis represents the size of the data. If the performance characteristics depend on other factors, you can represent those factors by overlaying multiple lines or by constructing a panel of graphs.

### Timing the performance of algorithms in SAS/IML

This article is motivated by an R programmer who was trying to find the fastest way (on average) to check whether a vector contains more than one unique value. The programmer was timing the performance of five different methods in R in hopes of finding the fastest. I will use the same example, but will examine only two SAS/IML methods:

• The ALL function in the SAS/IML language tests whether all elements of a vector are equal to some specified value. Thus the expression ALL(x = x[1]) is true only when all elements of a vector are the same.
• The UNIQUE function in the SAS/IML language returns an ordered list of the distinct elements of a vector. Thus the expression (ncol(UNIQUE(x))=1) is true only when a vector containsone distinct value.

The ALL function should be fast to discard non-constant vectors because it a "short-circuiting" operation. That is, as soon as it detects two different values, it returns 0 (false). If you have a vector with 100,000 elements, the ALL function might only examine a small number of elements to determine that the vector is not constant. In contrast, the UNIQUE function should be relatively slow: it always examines all the elements, and it allocates memory to return a sorted list of all unique values.

The following program illustrates many of the best practices. It constructs random binary vectors that contain between 100,000 and 1 million elements. Most elements are 0, but there is a small probability that an element could be 1. Some of the vectors will contain a 1 near the beginning of the vector (the best case for the ALL function), others will contain a 1 near the end (the worst case for ALL).

```proc iml; /* TIP: "Burn-in" by calling each important function once before you start timing */ x = randfun(N, "Bernoulli", 1/N); /* Simulate data for size */ r = all(x = x[1]); /* method 1: The ALL function */ r = (ncol(unique(x)) = 1); /* method 2: The UNIQUE function */ /* end burn-in */   sizes = do(2,10,2)*1e5; /* TIP: choose sizes so test runs in reasonable time */ NReps = 300; /* TIP: call functions multiple times */   TotalTime = j(ncol(sizes), 2); /* result matrix: for each size, save average time */ do j = 1 to ncol(sizes); /* TIP: use vectors of different sizes */ N = sizes[j]; x = j(N, 1); /* TRICK: estimate time to generate the random vectors; subtract that time later */ t0 = time(); do i = 1 to NReps; call randgen(x, "Bernoulli", 1/N); end; tRand = time() - t0;   /* Method 1: time for NReps calls */ t0 = time(); do i = 1 to NReps; call randgen(x, "Bernoulli", 1/N); /* TIP: simulate data for size */ r = all(x = x[1]); /* Method 1: The ALL function */ end; TotalTime[j,1] = time() - t0 - tRand; /* subtract time to generate random numbers */   /* Method 2: time for NReps calls */ t0 = time(); do i = 1 to NReps; call randgen(x, "Bernoulli", 1/N); /* TIP: simulate data for size */ r = (ncol(unique(x)) = 1); /* Method 2: The UNIQUE function */ end; TotalTime[j,2] = time() - t0 - tRand; /* subtract time to generate random numbers */ end; AvgTime = TotalTime / NReps; /* compute average time per call */ print AvgTime[c={ALL UNIQUE} r=(putn(sizes,'comma9.')) format=6.4];   /* TIP: create a series plot that overlays both curves */ Size = Sizes` // Sizes`; /* convert from wide to long data */ Time = AvgTime[,1] // AvgTime[,2]; Group = j(ncol(sizes), 1, "ALL") // j(ncol(sizes), 1, "UNIQUE"); title; call series(Size, Time) group=Group grid={x y} option="curvelabel" label={"Size", "Average Time per Call"};```

The results are shown to the right. The graph shows the average time to determine whether the elements of a vector of size N are unique for N in the range [1e5, 1e6]. Notice that the graph shows the average time out of 300 different samples of data. As expected, the average time for the ALL function is less than the average time for the UNIQUE function. For the ALL function, some of the tests run almost instantaneously, whereas others require longer run times. The method that calls the UNIQUE function has less variation, although the variation is not shown in this graph.

In terms of absolute time, both methods require only a few milliseconds. In relative terms, however, the ALL method is much faster, and the relative difference increases as the size of the data increases.

Notice that the program demonstrates a useful trick. The ALL function runs much faster than the time required to generate a random vector with a million elements. Therefore the time required to generate a vector and determine whether it is constant is dominated by generating the data. To estimate ONLY the time spent by the ALL and UNIQUE functions, you can either pre-compute the data or you can estimate how long it takes to generate the data and subtract that estimate from the total time. Because this particular test requires generating 300 vectors with 1 million elements, pre-computing and storing the vectors would require a lot of RAM, therefore I used the estimation trick.

In conclusion, when you are timing an algorithm in a vector-matrix language, follow the best practices in this article. The most important tip is to call the method multiple times and plot the average (as I have done here) or plot the distribution of times (not shown). Plotting the average gives you an estimate for the expected time whereas plotting the distribution enables you to estimate upper and lower bounds on the times.

SAS/IML programmers can find additional examples of timing performance in the following articles:

The post 6 tips for timing the performance of algorithms appeared first on The DO Loop.

23
Jan

## Five reasons to check out the new SAS analytical documentation

The SAS analytical documentation has a new look.

Beginning with the 14.2 release of the SAS analytical products (which shipped with SAS 9.4m4 in November 2016), the HTML version of the online documentation has moved to a new framework called the Help Center. The URL for the online documentation is easy to remember:
http://support.sas.com/documentation/

This article shows the 14.2 documentation for the SAS analytical products, as highlighted in the adjacent image. Documentation for previous releases is also available.

The 14.2 link takes you to a new page that contains links for the User's Guides for each SAS analytical product, such as SAS/STAT, SAS/ETS, SAS/IML, and so on. When you click on a User's Guide, you are taken to the new Help Center.

An example page for the SAS/STAT documentation is shown in the following image (click to enlarge). As in previous versions of the help system, the Help Center provides drop-down lists (Overview, Getting Started, Syntax, etc.) for quick navigation within a procedure. There are also arrows (now in the upper right corner) that take you to the previous or next page in the book.

The following list describes five features of the Help Center that are either new or that extend features of the older HTML format. The locations of these features are highlighted with red rectangles in the previous image.

Five new features in the #SAS analytical documentation
Click To Tweet

1. What's New in 14.2: No matter how the information is delivered, content is king. I really appreciate that SAS publishes a "What's New" chapter, which highlights new features and enhancements for each product. For long-time users of the software, the "What's New" chapter is the first place to go to determine what new options are available in the new release of your favorite SAS product.
2. Toggle the Table of Contents (TOC): Like the older HTML documentation, the new Help Center shows the TOC for each book and chapter in a left-side pane. The Help Center enables you to toggle the TOC by clicking the icon in the upper left corner (or use CTRL+SHIFT+C). Removing the TOC provides more screen area for the documentation. This is especially important on a small display, such as on a laptop or tablet, where the TOC is hidden by default. Click the icon again to restore the TOC pane.
3. Enhanced search within a book: In the older HTML doc, search results appear on a separate HTML page. In the new Help Center, the search facility (click the magnifying class icon or CTRL+SHIFT+S) displays the results in a pop-up scrollable window as shown in the adjacent image. When you click a search result, the Help Center updates to display the new page. If you search again, the search window remembers your previous query. If you want to close the search window, press the ESC key. The search facility supports complex expressions, Boolean operators, and proximity searches.
4. Links to the SAS Sample Library: My favorite new feature is that the documentation now links directly to the SAS Sample Library. For decades, SAS has distributed the Sample Library, which provides the complete data and programming statements to reproduce advanced examples. However, many SAS programmers do not know how to access the SAS Sample Library. The new 14.2 documentation now links directly to the sample programs for each analytical procedure. Simply click the link at the top of each example to see the complete data and sample program.
5. Links to Videos: Some SAS programmers prefer watch videos to learn about new features in SAS. Over the past few years, SAS R&D has recorded dozens of video presentations about the analytical procedures. Now the documentation contains links to these videos, which often provide an overview of a procedure, analysis, or set of options. The Videos tab appears for chapters that contain videos.

In summary, the new Help Center framework provides additional ways for SAS customers to learn about the syntax, options, and output of SAS analytical procedures. At the present time, only analytical products use the Help Center. The documentation for Base SAS continues to be provided in HTML and PDF formats.

Check out the SAS analytical products 14.2 documentation and let me know what you think. Do you like something that I didn't mention? Post a comment.

tags: Tips and Techniques

The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.

16
Jan

## PUT it there! Six tips for using PUT and %PUT statements in SAS

For SAS programmers, the PUT statement in the DATA step and the %PUT macro statement are useful statements that enable you to display the values of variables and macro variables, respectively. By default, the output appears in the SAS log. This article shares a few tips that help you to use these statements more effectively.

### Tip 1: Display the name and value of a variable

The PUT statement supports a "named output" syntax that enables you to easily display a variable name and value. The trick is to put an equal sign immediately after the name of a variable: PUT varname=; For example, the following statement displays the text "z=" followed by the value of z:

```data _null_; x = 9.1; y = 6; z = sqrt(x**2 + y**2); put z=; /* display variable and value */ run;```
`z=10.9`

### Tip 2: Display values of arrays

You can extend the previous tip to arrays and to sets of variables. The PUT statement enables you to display elements of an array (or multiple variables) by specifying the array name in parentheses, followed by an equal sign in parentheses, as follows:

```data _null_; array x[5]; do k = 1 to dim(x); x[k] = k**2; end; put (x[*]) (=); /* put each element of array on separate lines */ put (x1 x3 x5) (=); /* put each variable/value on separate lines */ run;```
```x1=1 x2=4 x3=9 x4=16 x5=25 x1=1 x3=9 x5=25```

This syntax is not supported for _TEMPORARY_ arrays. However, as a workaraoun, you can use the CATQ function to concatenate array values into a character variable, as follows:

```temp = catq('d', ',', of x[*]); /* x can be _TEMPORARY_ array */ put temp=;```

Incidentally, if you ever want to apply a format to the values, the format name goes inside the second set of parentheses, after the equal sign: put (x1 x3 x5) (=6.2);

### Tip 3: Display values on separate lines

The previous tip displayed all values on a single line. Sometimes it is useful to display each value on its own line. To do that, put a slash after the equal sign, as follows:

```... put (x[*]) (=/); /* put each element on separate lines */ ...```
```x1=1 x2=4 x3=9 x4=16 x5=25```

### Tip 4: Display all name-value pairs

You can display all values of all variables by using the _ALL_ keyword, as follows:

```data _null_; x = 9.1; y = 6; z = sqrt(x**2 + y**2); A = "SAS"; B = "Statistics"; put _ALL_; /* display all variables and values */ run;```
`x=9.1 y=6 z=10.9 A=SAS B=Statistics _ERROR_=0 _N_=1`

Notice that in addition to the user-defined variables, the _ALL_ keyword also prints the values of two automatic variables named _ERROR_ and _N_.

### Tip 5: Display the name and value of a macro variable

Just as the PUT statement displays the value of an ordinary variable, you can use the %PUT statement to display the value of a macro variable. If you use the special "&=" syntax, SAS will display the name and value of a macro variable. For example, to display your SAS version, you can display the value of the SYSVLONG automatic system macro variable, as follows:

`%put &=SYSVLONG;`
```SYSVLONG=9.04.01M4P110916
```

The results above are for my system, which is running SAS 9.4M4. Your SAS version might be different.

### Tip 6: Display all name-value pairs for macros

You can display the name and value of all user-defined macros by using the _USER_ keyword. You can display the values of all SAS automatic system macros by using the _AUTOMATIC_ keyword.

```%let N = 50; %let NumSamples = 1e4; %put _USER_;```
```GLOBAL N 50
GLOBAL NUMSAMPLES 1e4
```

### Conclusion and References

There you have it: six tips to make it easier to display the value of SAS variables and macro variables. Thanks to Jiangtang Hu who pointed out the %PUT &=var syntax in his blog in 2012. For additional features of the PUT and %PUT statements, see:

The post PUT it there! Six tips for using PUT and %PUT statements in SAS appeared first on The DO Loop.

30
Nov

## Append data to add markers to SAS graphs

Do you want to create customized SAS graphs by using PROC SGPLOT and the other ODS graphics procedures? An essential skill that you need to learn is how to merge, join, append, and concatenate SAS data sets that come from different sources. The SAS statistical graphics procedures (SG procedures) enable you to overlay all kinds of customized curves, markers, and bars. However, the SG procedures expect all the data for a graph to be in a single SAS data set. Therefore it is often necessary to append two or more data sets before you can create a complex graph.

This article discusses two ways to combine data sets in order to create ODS graphics. An alternative is to use the SG annotation facility to add extra curves or markers to the graph. Personally, I prefer to use the techniques in this article for simple features, and reserve annotation for adding highly complex and non-standard features.

### Overlay curves

In a previous article, I discussed how to structure a SAS data set so that you can overlay curves on a scatter plot.

The diagram at the right shows the main idea of that article. The X and Y variables contain the original data, which are the coordinates for a scatter plot. Secondary information was appended to the end of the data. The X1 and Y1 variables contain the coordinates of a custom scatter plot smoother. The X2 and Y2 variables contain the coordinates of a different scatter plot smoother.

This structure enables you to use the SGPLOT procedure to overlay two curves on the scatter plot. You use a SCATTER statement and two SERIES statements to create the graph. See the previous article for details.

### Overlay markers: Wide form

In addition to overlaying curves, I sometimes want to add special markers to the scatter plot. In this article I will show how to add a marker that shows the location of the sample mean. This article shows how to use PROC MEANS to create an output data set that contains the coordinates of the sample mean, then append that data set to the original data.

Add special markers to a graph using PROC SGPLOT #SASTip
Click To Tweet

The following statements use PROC MEANS to compute the sample mean for four variables in the SasHelp.Iris data set, which contains the measurements for 150 iris flowers. To emphasize the general syntax of this computation, I use macro variables, but that is not necessary:

```%let DSName = Sashelp.Iris; %let VarNames = PetalLength PetalWidth SepalLength SepalWidth;   proc means data=&DSName noprint; var &VarNames; output out=Means(drop=_TYPE_ _FREQ_) mean= / autoname; run;```

The AUTONAME option on the OUTPUT statement tells PROC MEANS to append the name of the statistic to the variable names. Thus the output data set contains variables with names like PetalLength_Mean and SepalWidth_Mean. As shown in the diagram in the previous section, this enables you to append the new data to the end of the old data in "wide form" as follows:

```data Wide; set &DSName Means; /* add four new variables; pad with missing values */ run;   ods graphics / attrpriority=color subpixel; proc sgplot data=Wide; scatter x=SepalWidth y=PetalLength / legendlabel="Data"; ellipse x=SepalWidth y=PetalLength / type=mean; scatter x=SepalWidth_Mean y=PetalLength_Mean / legendlabel="Sample Mean" markerattrs=(symbol=X color=firebrick); run;```

The first SCATTER statement and the ELLIPSE statement use the original data. Recall that the ELLIPSE statement draws an approximate confidence ellipse for the mean of the population. The second SCATTER statement uses the sample means, which are appended to the end of the original data. The second SCATTER statement draws a red marker at the location of the sample mean.

You can use this same method to plot other sample statistics (such as the median) or to highlight special values such as the origin of a coordinate system.

### Overlay markers: Long form

In some situations it is more convenient to append the secondary data in "long form." In the long form, the secondary data set contains the same variable names as in the original data. You can use the SAS data step to create a variable that identifies the original and supplementary observations. This technique can be useful when you want to show multiple markers (sample mean, median, mode, ...) by using the GROUP= option on one SCATTER statement.

The following call to PROC MEANS does not use the AUTONAME option. Therefore the output data set contains variables that have the same name as the input data. You can use the IN= data set option to create an ID variable that identifies the data from the computed statistics:

```/* Long form. New data has same name but different group ID */ proc means data=&DSName noprint; var &VarNames; output out=Means(drop=_TYPE_ _FREQ_) mean=; run;   data Long; set &DSName Means(in=newdata); if newdata then GroupID = "Mean"; else GroupID = "Data"; run;```

The DATA step created the GroupID variable, which has the values "Data" for the original observations and the value "Mean" for the appended observations. This data structure is useful for calling PROC SGSCATTER, which supports the GROUP= option, but does not support multiple PLOT statements, as follows:

```ods graphics / attrpriority=none; proc sgscatter data=Long datacontrastcolors=(steelblue firebrick) datasymbols=(Circle X); plot (PetalLength PetalWidth)*(SepalLength SepalWidth) / group=groupID; run;```

In conclusion, this article demonstrates a useful technique for adding markers to a graph. The technique requires that you concatenate the original data with supplementary data. Appending and merging data is a technique that is used often when creating ODS statistical graphics in SAS. It is a great technique to add to your programming toolbox.

The post Append data to add markers to SAS graphs appeared first on The DO Loop.

31
Oct

## Counting observations for which two events occur

Every year near Halloween I write an article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick (which features the CMISS function and the crossproducts matrix in SAS/IML) enables you to count the number of observations that are missing for pairs of columns. In other words, this trick counts how many missing values are in columns 1 and 2, columns 1 and 3, and so on for all "k choose 2" pairwise combinations of k columns.

This trick generalizes, so keep reading even if you don't care about counting missing values. Given ANY indicator matrix, you can use this trick to count the pairwise occurrence of events.

Counting pairwise events: A connection to matrix multiplication #Statistics
Click To Tweet

### Counting missing values

Of course, you can count combinations of missing values without using SAS/IML or matrices. I've previously shown how to use PROC MI to count all combinations of missing values among a set of k variables. However, as I said, this SAS/IML trick enables you to count more than just missing values: it generalizes to count ANY pairwise event.

To get started, read data into a SAS/IML matrix. The following statements read 5209 observations and eight variables into the matrix X:

```proc iml; varNames = {"AgeAtStart" "Diastolic" "Systolic" "Height" "Weight" "MRW" "Smoking" "Cholesterol"}; use Sashelp.Heart; /* open data set */ read all var varNames into X; /* create numeric data matrix, X */ close Sashelp.Heart;```

The first part of the trick is to convert the matrix into a 0/1 indicator matrix, where 1 indicates that a data value is missing and 0 indicates nonmissing. I like to use the CMISS function in Base SAS for this task.

The second part of the trick is to recognize that if C is an indicator matrix, the crossproducts matrix P=C`C is a matrix that gives the counts of the pairwise events in C. The element P[i,j] is the inner product of the i_th and j_th columns of C, and because C is an indicator matrix the inner product counts the number of simultaneous "events" for column i and column j.

In the SAS/IML language, the code to compute pairwise counts of missing values requires only two lines:

```C = cmiss(X); /* create 0/1 indicator matrix */ P = C`*C; /* pairwise counts */ print P[c=varNames r=varNames label="Pairwise Missing Counts"];```

The P matrix is symmetric. The diagonal elements P[i,i] show the number of missing values for the i_th variable. For example, the Height, Weight, and MRW variables have 6 missing values whereas the Cholesterol variable has 152 missing values. The off-diagonal elements P[i,j] show the number of observations that are simultaneously missing for the i_th and j_th variables. For example, 28 observations have missing values for both the Smoking and Cholesterol variables; two observations have missing values for both the Height and Weight variables.

### Counting any pair of events

If C is any indicator matrix, the crossproducts matrix P=C`C counts the number of observations for which two columns of C are simultaneously 1.

For example, the following statements standardize the data to create z-scores. You can form an indicator matrix that has the value 1 if a z-score is exceeds 3 in absolute value; these are outliers for normally distributed data. The crossproducts matrix counts the number of observations that are univariate outliers (on the diagonal) and the number that are outliers for a pair of variables:

```Z = (X - mean(X)) / std(X); /* standardize data */ L = (abs(Z) > 3); /* L indicates outliers */ Q = L`*L; /* counts for pairs of outliers */ print Q[c=varNames r=varNames label="Pairwise Outliers"];```

The table indicates that 52 patients have a diastolic blood pressure (BP) that is more than three standard deviations above the mean. Similarly, 83 patients are outliers for systolic BP, and 35 patients are outliers for both measurements.

The following graph visualizes the patients that have high blood pressure in each category:

```Outlier = j(nrow(L), 1, "No "); /* create grouping variable */ Outlier[ loc(L[,2] | L[,3]) ] = "Univariate"; Outlier[ loc(L[,2] & L[,3]) ] = "Bivariate"; title "Standardized Blood Pressure"; title2 "Outliers More than Three Standard Deviations from Mean"; call scatter(Z[,2], Z[,3]) group=Outlier label={"Standardized Diastolic" "Standardized Systolic"} option="transparency=0.5 markerattrs=(symbol=CircleFilled)" other="refline 3/axis=x; refline 3/axis=y;";```

In conclusion, you can create an indicator matrix for any set of conditions. By forming the crossproducts matrix, you get the counts of all observations that satisfy each condition individually and in pairs. I think this trick is a real treat!

The post Counting observations for which two events occur appeared first on The DO Loop.