27
Sep

Data-driven simulation

In a large simulation study, it can be convenient to have a "control file" that contains the parameters for the study. My recent article about how to simulate multivariate normal clusters demonstrates a simple example of this technique. The simulation in that article uses an input data set that contains the parameters (mean, standard deviations, and correlations) for the simulation. A SAS procedure (PROC SIMNORMAL) simulates data based on the parameters in the input data set.

This is a powerful paradigm. Instead of hard-coding the parameters in the program (or as macro variables), the parameters are stored in a data set that is processed by the program. This is sometimes called data-driven programming. (Some people call it dynamic programming, but there is an optimization technique of the same name so I will use the term "data-driven.") In a data-driven program, when you want to run the program with new parameters, you merely modify the data set that contains the control parameters.

I have previously written about a different way to control a batch program by passing in parameters on the command line when you invoke the SAS program.

Static programming and hard-coded parameters

Before looking at data-driven programming, let's review the static approach. I will simulate clusters of univariate normal data as an example.

Suppose that you want to simulate normal data for three different groups. Each group has its own sample size (N), mean, and standard deviation. In my book Simulating Data with SAS (p. 206), I show how to simulate this sort of ANOVA design by using arrays, as follows.

/* Static simulation: Parameters embedded in the simulation program */
data AnovaStatic;
/* define parameters for three simulated group */
array N[3]       _temporary_ (50,   50,   50);   /* sample sizes */
array Mean[3]    _temporary_ (14.6, 42.6, 55.5); /* center for each group */
array StdDev[3]  _temporary_ ( 1.7,  4.7,  5.5); /* spread for each group */
 
call streaminit(12345);
do k = 1 to dim(N);              /* for each group */
   do i = 1 to N[k];             /* simulate N[k] observations */
      x = rand("Normal", Mean[k], StdDev[k]); /* from k_th normal distribution */
      output;
   end;
end;
run;

The DATA step contains two loops, one for the groups and the other for the observations within each group. The parameters for each group are stored in arrays. Notice that if you want to change the parameters (including the number of groups), you need to edit the program. I call this method "static programming" because the behavior of the program is determined at the time that the program is written. This is a perfectly acceptable method for most applications. It has the advantage that you know exactly what the program will do by looking at the program.

Data-driven programming: Put parameters in a file

An alternative is to put the parameters for each group into a file or data set. If the k_th row in the data set contains the parameters for the k_th group, then the implicit loop in the DATA step will iterate over all groups, regardless of the number of groups. The following DATA step creates the parameters for three groups, which are read and processed by the second DATA step. The parameter values are the same as for the static example, but are transposed and processed row-by-row instead of via arrays:

/* Data-driven simulation: Parameters in a data set, processed by the simulation program */
data params;                     /* define parameters for each simulated group */
input N Mean StdDev;
datalines; 
50 14.6 1.7
50 42.6 4.7
50 55.5 5.5
;
 
data AnovaDynamic;
call streaminit(12345);
set params;                      /* implicit loop over groups k=1,2,... */
do i = 1 to N;                   /* simulate N[k] observations */
   x = rand("Normal", Mean, StdDev); /* from k_th normal distribution */
   output;
end;
run;

Notice the difference between the static and dynamic techniques. The static technique simulates data from three groups whose parameters are specified in temporary arrays. The dynamic technique simulates data from an arbitrary number of groups. Currently, the PARAMS data specifies three groups, but if I change the PARAMS data set to represent 10 or 1000 groups, the AnovaDynamic DATA step will simulate data from the new design without any modification.

Generate the parameters from real data

The data-driven technique is useful when the parameters are themselves the results of an analysis. For example, a common simulation technique is to generate the moments of real data (mean, variance, skewness, and so forth) and to use those statistics in place of the population parameters that they estimate. (See Chapter 16, "Moment Matching," in Simulating Statistics with SAS.)

The following call to PROC MEANS generates the sample mean and standard deviation for real data and writes those values to a data set:

proc means data=sashelp.iris N Mean StdDev stackods;
   class Species;
   var PetalLength;
   ods output Summary=params;
run;

The output data set from PROC MEANS creates a PARAMS data set that contains the variables (N, MEAN, and STDDEV) that are read by the simulation program. Therefore, you can immediately run the AnovaDynamic DATA step to simulate normal data from the sample statistics. A visualization of the resulting simulated data is shown below.

You can run PROC MEANS on other data and other variables and the AnovaDynamic step will continue to work without any modification. The simulation is controlled entirely by the values in the "control file," which is the PARAMS data set.

You can generalize this technique by wrapping the program in a SAS macro in which the name of the parameter file and the name of the simulated data set are provided at run time. With a macro implementation, you can read from multiple input files and write to multiple output data sets. You could use such a macro, for example, to break up a large simulation study into smaller independent sub-simulations, each controlled by its own file of input parameters. In a gridded environment, each sub-simulation can be processed independently and in parallel, thus reducing the total time required to complete the study.

Although this article discusses control files in the context of statistical simulation, other applications are possible. Have you used a similar technique to control a program by using an input file that contains the parameters for the program? Leave a comment.

The post Data-driven simulation appeared first on The DO Loop.

21
Aug

6 tips for timing the performance of algorithms

When you implement a statistical algorithm in a vector-matrix language such as SAS/IML, R, or MATLAB, you should measure the performance of your implementation, which means that you should time how long a program takes to analyze data of varying sizes and characteristics. There are some general tips that can help you eliminate bottlenecks in your program so that your program is fast as lightning! In fact, it is a little bit frightening how quickly you can become an expert in timing.

The following general principles apply regardless of the language that you use to implement the algorithm:

  1. Use simulation to construct test data of varying sizes. By simulating data, you can easily vary the size of the size while preserving distributional characteristics. For example, you might want to test an algorithm on data that are normally distributed and contain 25, 50, 75, and 100 thousand observations. (Each language has different techniques to simulate data efficiently. For SAS software, see Simulating Data with SAS.)
  2. Vary the distribution of the test data. Concentrate mainly on how the algorithm performs on typical data, but also test examples of "best case" and "worst case" scenarios for which the algorithm might run very quickly or very slowly. If the performance is affected by other factors such as the proportion of outliers or missing values, then include those factors in the study.
  3. If possible, construct timing tests that run between 0.1 and 1 seconds. This length is long enough to be reliably measured but short enough that you can run many tests. If you try to time very short intervals (less than a millisecond), you will discover that the operating system constantly performs unrelated background tasks that can interfere with your timings, which leads to noisy measurements. For ultra-short intervals (less than a microsecond), you can encounter an "uncertainty principle" phenomena in which the very act measuring the performance of an algorithm affects the performance that you are trying to measure.
  4. To reduce timing noise, call the routine multiple times and report the mean or median. If the performance can vary greatly (±20% or more), report the distribution of times by using quantiles or by displaying a box plot.
  5. Use a "burn-in" call to reduce the impact of overhead costs. The first time a routine is called, it might require the operating system to load DLLs or allocate a huge block of memory. When the routine is called a second time, the operating system might have cached the needed information. If so, subsequent calls could be considerably faster than the first.
  6. Use a line plot or box plots to present the performance results, where the vertical axis represents time and the horizontal axis represents the size of the data. If the performance characteristics depend on other factors, you can represent those factors by overlaying multiple lines or by constructing a panel of graphs.

Timing the performance of algorithms in SAS/IML

This article is motivated by an R programmer who was trying to find the fastest way (on average) to check whether a vector contains more than one unique value. The programmer was timing the performance of five different methods in R in hopes of finding the fastest. I will use the same example, but will examine only two SAS/IML methods:

  • The ALL function in the SAS/IML language tests whether all elements of a vector are equal to some specified value. Thus the expression ALL(x = x[1]) is true only when all elements of a vector are the same.
  • The UNIQUE function in the SAS/IML language returns an ordered list of the distinct elements of a vector. Thus the expression (ncol(UNIQUE(x))=1) is true only when a vector containsone distinct value.

The ALL function should be fast to discard non-constant vectors because it a "short-circuiting" operation. That is, as soon as it detects two different values, it returns 0 (false). If you have a vector with 100,000 elements, the ALL function might only examine a small number of elements to determine that the vector is not constant. In contrast, the UNIQUE function should be relatively slow: it always examines all the elements, and it allocates memory to return a sorted list of all unique values.

The following program illustrates many of the best practices. It constructs random binary vectors that contain between 100,000 and 1 million elements. Most elements are 0, but there is a small probability that an element could be 1. Some of the vectors will contain a 1 near the beginning of the vector (the best case for the ALL function), others will contain a 1 near the end (the worst case for ALL).

proc iml;
/* TIP: "Burn-in" by calling each important function once before you start timing */
   x = randfun(N, "Bernoulli", 1/N);   /* Simulate data for size */
   r = all(x = x[1]);          /* method 1: The ALL function */
   r = (ncol(unique(x)) = 1);  /* method 2: The UNIQUE function */
/* end burn-in */
 
sizes = do(2,10,2)*1e5;        /* TIP: choose sizes so test runs in reasonable time */
NReps = 300;                   /* TIP: call functions multiple times */
 
TotalTime = j(ncol(sizes), 2);  /* result matrix: for each size, save average time */
do j = 1 to ncol(sizes);        /* TIP: use vectors of different sizes */
   N = sizes[j];
   x = j(N, 1);
   /* TRICK: estimate time to generate the random vectors; subtract that time later */
   t0 = time();
      do i = 1 to NReps;  call randgen(x, "Bernoulli", 1/N);  end;
   tRand = time() - t0;
 
   /* Method 1: time for NReps calls */
   t0 = time();
   do i = 1 to NReps;
      call randgen(x, "Bernoulli", 1/N);   /* TIP: simulate data for size */
      r = all(x = x[1]);                   /* Method 1: The ALL function */
   end;
   TotalTime[j,1] = time() - t0 - tRand;   /* subtract time to generate random numbers */
 
   /* Method 2: time for NReps calls */
   t0 = time();
   do i = 1 to NReps;
      call randgen(x, "Bernoulli", 1/N);   /* TIP: simulate data for size */
      r = (ncol(unique(x)) = 1);           /* Method 2: The UNIQUE function */
   end;
   TotalTime[j,2] = time() - t0 - tRand;   /* subtract time to generate random numbers */
end;
AvgTime = TotalTime / NReps;               /* compute average time per call */
print AvgTime[c={ALL UNIQUE} r=(putn(sizes,'comma9.')) format=6.4];
 
/* TIP: create a series plot that overlays both curves */
Size = Sizes` // Sizes`;                   /* convert from wide to long data */
Time = AvgTime[,1] // AvgTime[,2];
Group = j(ncol(sizes), 1, "ALL") // j(ncol(sizes), 1, "UNIQUE");
title;
call series(Size, Time) group=Group grid={x y} 
            option="curvelabel" label={"Size", "Average Time per Call"};

The results are shown to the right. The graph shows the average time to determine whether the elements of a vector of size N are unique for N in the range [1e5, 1e6]. Notice that the graph shows the average time out of 300 different samples of data. As expected, the average time for the ALL function is less than the average time for the UNIQUE function. For the ALL function, some of the tests run almost instantaneously, whereas others require longer run times. The method that calls the UNIQUE function has less variation, although the variation is not shown in this graph.

In terms of absolute time, both methods require only a few milliseconds. In relative terms, however, the ALL method is much faster, and the relative difference increases as the size of the data increases.

Notice that the program demonstrates a useful trick. The ALL function runs much faster than the time required to generate a random vector with a million elements. Therefore the time required to generate a vector and determine whether it is constant is dominated by generating the data. To estimate ONLY the time spent by the ALL and UNIQUE functions, you can either pre-compute the data or you can estimate how long it takes to generate the data and subtract that estimate from the total time. Because this particular test requires generating 300 vectors with 1 million elements, pre-computing and storing the vectors would require a lot of RAM, therefore I used the estimation trick.

In conclusion, when you are timing an algorithm in a vector-matrix language, follow the best practices in this article. The most important tip is to call the method multiple times and plot the average (as I have done here) or plot the distribution of times (not shown). Plotting the average gives you an estimate for the expected time whereas plotting the distribution enables you to estimate upper and lower bounds on the times.

SAS/IML programmers can find additional examples of timing performance in the following articles:

The post 6 tips for timing the performance of algorithms appeared first on The DO Loop.

23
Jan

Five reasons to check out the new SAS analytical documentation

Link to documentation for the SAS analytical products

The SAS analytical documentation has a new look.

Beginning with the 14.2 release of the SAS analytical products (which shipped with SAS 9.4m4 in November 2016), the HTML version of the online documentation has moved to a new framework called the Help Center. The URL for the online documentation is easy to remember:
http://support.sas.com/documentation/

This article shows the 14.2 documentation for the SAS analytical products, as highlighted in the adjacent image. Documentation for previous releases is also available.

The 14.2 link takes you to a new page that contains links for the User's Guides for each SAS analytical product, such as SAS/STAT, SAS/ETS, SAS/IML, and so on. When you click on a User's Guide, you are taken to the new Help Center.

An example page for the SAS/STAT documentation is shown in the following image (click to enlarge). As in previous versions of the help system, the Help Center provides drop-down lists (Overview, Getting Started, Syntax, etc.) for quick navigation within a procedure. There are also arrows (now in the upper right corner) that take you to the previous or next page in the book.

SAS Documentation in the Help Center

The following list describes five features of the Help Center that are either new or that extend features of the older HTML format. The locations of these features are highlighted with red rectangles in the previous image.


Five new features in the #SAS analytical documentation
Click To Tweet


  1. What's New in 14.2: No matter how the information is delivered, content is king. I really appreciate that SAS publishes a "What's New" chapter, which highlights new features and enhancements for each product. For long-time users of the software, the "What's New" chapter is the first place to go to determine what new options are available in the new release of your favorite SAS product.
  2. Toggle the Table of Contents (TOC): Like the older HTML documentation, the new Help Center shows the TOC for each book and chapter in a left-side pane. The Help Center enables you to toggle the TOC by clicking the icon in the upper left corner (or use CTRL+SHIFT+C). Removing the TOC provides more screen area for the documentation. This is especially important on a small display, such as on a laptop or tablet, where the TOC is hidden by default. Click the icon again to restore the TOC pane.
  3. SAS/STAT documentation search facility Enhanced search within a book: In the older HTML doc, search results appear on a separate HTML page. In the new Help Center, the search facility (click the magnifying class icon or CTRL+SHIFT+S) displays the results in a pop-up scrollable window as shown in the adjacent image. When you click a search result, the Help Center updates to display the new page. If you search again, the search window remembers your previous query. If you want to close the search window, press the ESC key. The search facility supports complex expressions, Boolean operators, and proximity searches.
  4. Links to the SAS Sample Library: My favorite new feature is that the documentation now links directly to the SAS Sample Library. For decades, SAS has distributed the Sample Library, which provides the complete data and programming statements to reproduce advanced examples. However, many SAS programmers do not know how to access the SAS Sample Library. The new 14.2 documentation now links directly to the sample programs for each analytical procedure. Simply click the link at the top of each example to see the complete data and sample program.
  5. Links to Videos: Some SAS programmers prefer watch videos to learn about new features in SAS. Over the past few years, SAS R&D has recorded dozens of video presentations about the analytical procedures. Now the documentation contains links to these videos, which often provide an overview of a procedure, analysis, or set of options. The Videos tab appears for chapters that contain videos.

In summary, the new Help Center framework provides additional ways for SAS customers to learn about the syntax, options, and output of SAS analytical procedures. At the present time, only analytical products use the Help Center. The documentation for Base SAS continues to be provided in HTML and PDF formats.

Check out the SAS analytical products 14.2 documentation and let me know what you think. Do you like something that I didn't mention? Post a comment.

tags: Tips and Techniques

The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.

16
Jan

PUT it there! Six tips for using PUT and %PUT statements in SAS

For SAS programmers, the PUT statement in the DATA step and the %PUT macro statement are useful statements that enable you to display the values of variables and macro variables, respectively. By default, the output appears in the SAS log. This article shares a few tips that help you to use these statements more effectively.

Tip 1: Display the name and value of a variable

The PUT statement supports a "named output" syntax that enables you to easily display a variable name and value. The trick is to put an equal sign immediately after the name of a variable: PUT varname=; For example, the following statement displays the text "z=" followed by the value of z:

data _null_;
x = 9.1; y = 6; z = sqrt(x**2 + y**2);
put z=;           /* display variable and value */
run;
z=10.9

Tip 2: Display values of arrays

You can extend the previous tip to arrays and to sets of variables. The PUT statement enables you to display elements of an array (or multiple variables) by specifying the array name in parentheses, followed by an equal sign in parentheses, as follows:

data _null_;
array x[5];
do k = 1 to dim(x);
   x[k] = k**2;
end;
put (x[*]) (=);     /* put each element of array on separate lines */
put (x1 x3 x5) (=); /* put each variable/value on separate lines */
run;
x1=1 x2=4 x3=9 x4=16 x5=25
x1=1 x3=9 x5=25

This syntax is not supported for _TEMPORARY_ arrays. However, as a workaraoun, you can use the CATQ function to concatenate array values into a character variable, as follows:

temp = catq('d', ',', of x[*]);         /* x can be _TEMPORARY_ array */
put temp=;

Incidentally, if you ever want to apply a format to the values, the format name goes inside the second set of parentheses, after the equal sign: put (x1 x3 x5) (=6.2);

Tip 3: Display values on separate lines

The previous tip displayed all values on a single line. Sometimes it is useful to display each value on its own line. To do that, put a slash after the equal sign, as follows:

...
put (x[*]) (=/);                   /* put each element on separate lines */
...
x1=1
x2=4
x3=9
x4=16
x5=25

Tip 4: Display all name-value pairs

You can display all values of all variables by using the _ALL_ keyword, as follows:

data _null_;
x = 9.1; y = 6; z = sqrt(x**2 + y**2);
A = "SAS"; B = "Statistics";
put _ALL_;              /* display all variables and values */
run;
x=9.1 y=6 z=10.9 A=SAS B=Statistics _ERROR_=0 _N_=1

Notice that in addition to the user-defined variables, the _ALL_ keyword also prints the values of two automatic variables named _ERROR_ and _N_.

Tip 5: Display the name and value of a macro variable

Just as the PUT statement displays the value of an ordinary variable, you can use the %PUT statement to display the value of a macro variable. If you use the special "&=" syntax, SAS will display the name and value of a macro variable. For example, to display your SAS version, you can display the value of the SYSVLONG automatic system macro variable, as follows:

%put &=SYSVLONG;
SYSVLONG=9.04.01M4P110916

The results above are for my system, which is running SAS 9.4M4. Your SAS version might be different.

Tip 6: Display all name-value pairs for macros

You can display the name and value of all user-defined macros by using the _USER_ keyword. You can display the values of all SAS automatic system macros by using the _AUTOMATIC_ keyword.

%let N = 50;
%let NumSamples = 1e4;
%put _USER_;
GLOBAL N 50
GLOBAL NUMSAMPLES 1e4

Conclusion and References

There you have it: six tips to make it easier to display the value of SAS variables and macro variables. Thanks to Jiangtang Hu who pointed out the %PUT &=var syntax in his blog in 2012. For additional features of the PUT and %PUT statements, see:

tags: SAS Programming, Tips and Techniques

The post PUT it there! Six tips for using PUT and %PUT statements in SAS appeared first on The DO Loop.

30
Nov

Append data to add markers to SAS graphs

Do you want to create customized SAS graphs by using PROC SGPLOT and the other ODS graphics procedures? An essential skill that you need to learn is how to merge, join, append, and concatenate SAS data sets that come from different sources. The SAS statistical graphics procedures (SG procedures) enable you to overlay all kinds of customized curves, markers, and bars. However, the SG procedures expect all the data for a graph to be in a single SAS data set. Therefore it is often necessary to append two or more data sets before you can create a complex graph.

This article discusses two ways to combine data sets in order to create ODS graphics. An alternative is to use the SG annotation facility to add extra curves or markers to the graph. Personally, I prefer to use the techniques in this article for simple features, and reserve annotation for adding highly complex and non-standard features.

Overlay curves

sgplotoverlay

In a previous article, I discussed how to structure a SAS data set so that you can overlay curves on a scatter plot.

The diagram at the right shows the main idea of that article. The X and Y variables contain the original data, which are the coordinates for a scatter plot. Secondary information was appended to the end of the data. The X1 and Y1 variables contain the coordinates of a custom scatter plot smoother. The X2 and Y2 variables contain the coordinates of a different scatter plot smoother.

This structure enables you to use the SGPLOT procedure to overlay two curves on the scatter plot. You use a SCATTER statement and two SERIES statements to create the graph. See the previous article for details.

Overlay markers: Wide form

In addition to overlaying curves, I sometimes want to add special markers to the scatter plot. In this article I will show how to add a marker that shows the location of the sample mean. This article shows how to use PROC MEANS to create an output data set that contains the coordinates of the sample mean, then append that data set to the original data.


Add special markers to a graph using PROC SGPLOT #SASTip
Click To Tweet


The following statements use PROC MEANS to compute the sample mean for four variables in the SasHelp.Iris data set, which contains the measurements for 150 iris flowers. To emphasize the general syntax of this computation, I use macro variables, but that is not necessary:

%let DSName = Sashelp.Iris;
%let VarNames = PetalLength PetalWidth SepalLength SepalWidth;
 
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean= / autoname;
run;

The AUTONAME option on the OUTPUT statement tells PROC MEANS to append the name of the statistic to the variable names. Thus the output data set contains variables with names like PetalLength_Mean and SepalWidth_Mean. As shown in the diagram in the previous section, this enables you to append the new data to the end of the old data in "wide form" as follows:

data Wide;
   set &DSName Means; /* add four new variables; pad with missing values */
run;
 
ods graphics / attrpriority=color subpixel;
proc sgplot data=Wide;
scatter x=SepalWidth y=PetalLength / legendlabel="Data";
ellipse x=SepalWidth y=PetalLength / type=mean;
scatter x=SepalWidth_Mean y=PetalLength_Mean / 
         legendlabel="Sample Mean" markerattrs=(symbol=X color=firebrick);
run;
Scatter plot with markers for sample means

The first SCATTER statement and the ELLIPSE statement use the original data. Recall that the ELLIPSE statement draws an approximate confidence ellipse for the mean of the population. The second SCATTER statement uses the sample means, which are appended to the end of the original data. The second SCATTER statement draws a red marker at the location of the sample mean.

You can use this same method to plot other sample statistics (such as the median) or to highlight special values such as the origin of a coordinate system.

Overlay markers: Long form

In some situations it is more convenient to append the secondary data in "long form." In the long form, the secondary data set contains the same variable names as in the original data. You can use the SAS data step to create a variable that identifies the original and supplementary observations. This technique can be useful when you want to show multiple markers (sample mean, median, mode, ...) by using the GROUP= option on one SCATTER statement.

The following call to PROC MEANS does not use the AUTONAME option. Therefore the output data set contains variables that have the same name as the input data. You can use the IN= data set option to create an ID variable that identifies the data from the computed statistics:

/* Long form. New data has same name but different group ID */
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean=;
run;
 
data Long;
set &DSName Means(in=newdata);
if newdata then 
   GroupID = "Mean";
else GroupID = "Data";
run;

The DATA step created the GroupID variable, which has the values "Data" for the original observations and the value "Mean" for the appended observations. This data structure is useful for calling PROC SGSCATTER, which supports the GROUP= option, but does not support multiple PLOT statements, as follows:

ods graphics / attrpriority=none;
proc sgscatter data=Long 
   datacontrastcolors=(steelblue firebrick)
   datasymbols=(Circle X);
plot (PetalLength PetalWidth)*(SepalLength SepalWidth) / group=groupID;
run;
Scatter plot matrix with markers for sample means

In conclusion, this article demonstrates a useful technique for adding markers to a graph. The technique requires that you concatenate the original data with supplementary data. Appending and merging data is a technique that is used often when creating ODS statistical graphics in SAS. It is a great technique to add to your programming toolbox.

tags: SAS Programming, Statistical Graphics, Tips and Techniques

The post Append data to add markers to SAS graphs appeared first on The DO Loop.

31
Oct

Counting observations for which two events occur

Every year near Halloween I write an article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick (which features the CMISS function and the crossproducts matrix in SAS/IML) enables you to count the number of observations that are missing for pairs of columns. In other words, this trick counts how many missing values are in columns 1 and 2, columns 1 and 3, and so on for all "k choose 2" pairwise combinations of k columns.

This trick generalizes, so keep reading even if you don't care about counting missing values. Given ANY indicator matrix, you can use this trick to count the pairwise occurrence of events.


Counting pairwise events: A connection to matrix multiplication #Statistics
Click To Tweet


Counting missing values

Of course, you can count combinations of missing values without using SAS/IML or matrices. I've previously shown how to use PROC MI to count all combinations of missing values among a set of k variables. However, as I said, this SAS/IML trick enables you to count more than just missing values: it generalizes to count ANY pairwise event.

To get started, read data into a SAS/IML matrix. The following statements read 5209 observations and eight variables into the matrix X:

proc iml;
varNames = {"AgeAtStart" "Diastolic" "Systolic" 
            "Height" "Weight" "MRW" "Smoking" "Cholesterol"};
use Sashelp.Heart;               /* open data set */
read all var varNames into X;    /* create numeric data matrix, X */
close Sashelp.Heart;

The first part of the trick is to convert the matrix into a 0/1 indicator matrix, where 1 indicates that a data value is missing and 0 indicates nonmissing. I like to use the CMISS function in Base SAS for this task.

The second part of the trick is to recognize that if C is an indicator matrix, the crossproducts matrix P=C`C is a matrix that gives the counts of the pairwise events in C. The element P[i,j] is the inner product of the i_th and j_th columns of C, and because C is an indicator matrix the inner product counts the number of simultaneous "events" for column i and column j.

In the SAS/IML language, the code to compute pairwise counts of missing values requires only two lines:

C = cmiss(X);                    /* create 0/1 indicator matrix */
P = C`*C;                        /* pairwise counts */
print P[c=varNames r=varNames label="Pairwise Missing Counts"];
pairwise1

The P matrix is symmetric. The diagonal elements P[i,i] show the number of missing values for the i_th variable. For example, the Height, Weight, and MRW variables have 6 missing values whereas the Cholesterol variable has 152 missing values. The off-diagonal elements P[i,j] show the number of observations that are simultaneously missing for the i_th and j_th variables. For example, 28 observations have missing values for both the Smoking and Cholesterol variables; two observations have missing values for both the Height and Weight variables.

Counting any pair of events

If C is any indicator matrix, the crossproducts matrix P=C`C counts the number of observations for which two columns of C are simultaneously 1.

For example, the following statements standardize the data to create z-scores. You can form an indicator matrix that has the value 1 if a z-score is exceeds 3 in absolute value; these are outliers for normally distributed data. The crossproducts matrix counts the number of observations that are univariate outliers (on the diagonal) and the number that are outliers for a pair of variables:

Z = (X - mean(X)) / std(X);      /* standardize data */
L = (abs(Z) > 3);                /* L indicates outliers */
Q = L`*L;                        /* counts for pairs of outliers */
print Q[c=varNames r=varNames label="Pairwise Outliers"];
pairwise2

The table indicates that 52 patients have a diastolic blood pressure (BP) that is more than three standard deviations above the mean. Similarly, 83 patients are outliers for systolic BP, and 35 patients are outliers for both measurements.

The following graph visualizes the patients that have high blood pressure in each category:

Outlier = j(nrow(L), 1, "No         ");       /* create grouping variable */
Outlier[ loc(L[,2] | L[,3]) ] = "Univariate";
Outlier[ loc(L[,2] & L[,3]) ] = "Bivariate";
title "Standardized Blood Pressure";
title2 "Outliers More than Three Standard Deviations from Mean";
call scatter(Z[,2], Z[,3]) group=Outlier 
         label={"Standardized Diastolic" "Standardized Systolic"}
         option="transparency=0.5 markerattrs=(symbol=CircleFilled)"
         other="refline 3/axis=x; refline 3/axis=y;";
pairwise3

In conclusion, you can create an indicator matrix for any set of conditions. By forming the crossproducts matrix, you get the counts of all observations that satisfy each condition individually and in pairs. I think this trick is a real treat!

tags: Data Analysis, Matrix Computations, Tips and Techniques

The post Counting observations for which two events occur appeared first on The DO Loop.

22
Jun

Use the EFFECTPLOT statement to visualize regression models in SAS

An effect plot (created by using the EFFECTPLOT statement) that visualizes a complex regression model

Graphs enable you to visualize how the predicted values for a regression model depend on the model effects. You can gain an intuitive understanding of a model by using the EFFECTPLOT statement in SAS to create graphs like the one shown at the top of this article.

Many SAS regression procedures automatically create ODS graphics for simple regression models. For more complex models (including interaction effects and link functions), you can use the EFFECTPLOT statement to construct effect plots. An effect plot shows the predicted response as a function of certain covariates while other covariates are held constant.


Use effect plots in #SAS to help interpret regression models. #DataViz
Click To Tweet


The EFFECTPLOT statement was introduced in SAS 9.22, but it is not as well known as it should be. Although many procedure include an EFFECTPLOT statement as part of their syntax, I will use the PLM procedure (PLM = post-linear modeling) to show how to construct effect plots. I have previously shown how to use the PLM procedure to score regression models. A good introduction to the PLM procedure is Tobias and Cai (2010), "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models."

The data for this article is the Sashelp.BWeight data set, which is distributed with SAS. There are 50,000 records. Each row gives information about the birth weight of a baby, including information about the mother. This article uses the following variables:

  • MomAge: The mothers were between the ages of 18 and 45. The MomAge variable is centered at the mean age, which is 27. Thus MomAge=-7 means the mother was 20 years old whereas MomAge=5 means that the mother was 32 years old.
  • CigsPerDay: The average number of cigarettes per day that the mother smoked during pregnancy.
  • Boy: An indicator variable. If the baby was a boy, then Boy=1; otherwise Boy=0.

The following DATA step creates a SAS view that creates an indicator variable, Underweight, which has the value 1 if the baby's birth weight was less than 2500 grams and 0 otherwise:

/* Underweight=1 if the birth weight is <2500 grams and Underweight=0 otherwise */
data babyWeight / view=BabyWeight;
   set sashelp.bweight;
   Underweight = (Weight < 2500);
run;

A logistic model with a continuous-continuous interaction

To illustrate the capabilities of the EFFECTPLOT statement, the following statements use PROC LOGISTIC to model the probability of having an underweight boy baby (less than 2500 grams). The explanatory effects are MomAge, CigsPerDay, and the interaction effect between those two variables. The STORE statement creates an item store called logiModel. The item store is read by PROC PLM, which creates the effect plot:

proc logistic data=babyWeight;
   where Boy=1;                  /* restrict to baby boys */
   model Underweight(event='1') = MomAge | CigsPerDay;
   store logiModel;
run;
 
title "Probability of Underweight Boy Baby";
proc plm source=logiModel;
   effectplot fit(x=MomAge plotby=CigsPerDay);
run;
Effect plot (created by using the EFFECTPLOT statement): Predicted probability of underweight boy by mother's age and daily cigarettes

In this example, the output is a panel of plots that show the predicted probability of having an underweight boy baby as a function of the mother's relative age. (Remember: the age is centered at 27 years.) The panel shows slices of the continuous CigsPerDay variable, which enables you to see how the predicted response changes with increasing cigarette use.

The graphs indicate that the probability of an underweight boy is very low in nonsmoking mothers, regardless of the mother's age. In smoking mothers, however, the probability of having an underweight boy increases with age. For mothers of a given age, the probability of an underweight boy increases with the number of cigarettes smoked.

The example shows a panel of fit plots, where the paneling variable is determined by the PLOTBY= option. You can also "stack" the predicted probability curves by using a slice plot. You can specify a slice plot by using the SLICEFIT keyword. You specify the slicing variable by using the SLICEBY= option, as follows:

proc plm source=logiModel;
   effectplot slicefit(x=MomAge sliceby=CigsPerDay);
run;

An example of a slice plot is shown in the next section.

You can also use the EFFECTPLOT statement to create a contour plot of the predicted response as a function of the two continuous covariates, which is also shown in the next section.

A logistic model with categorical-continuous interactions

The effect plot is especially useful when visualizing complex models. When there are several independent variables and interactions, you can create multiple plots that show the predicted response at various levels of categorical or continuous variables. By default, covariates that do not appear in the plots are fixed at their mean level (for continuous variables) or their reference level (for classification variables).

The previous example used a WHERE clause to restrict the data to boy babies. Suppose that you want to include the gender of the baby as a covariate in the regression model. The following call to PROC LOGISTIC includes the main effects and two-way interactions between two continuous and one classification variable. The call to PROC PLM creates a panel of slice plots. Each slice plot shows predicted probability curves for slices of the CigsPerDay variable. The panels are determined by levels of the Boy variable, which is specified on the PLOTBY= option:

proc logistic data=babyWeight;
   class Boy;
   model Underweight(event='1') = MomAge | CigsPerDay | Boy @2;
   store logiModel;
run;
 
proc plm source=logiModel;
   effectplot slicefit(x=MomAge sliceby=CigsPerDay plotby=Boy);
run;

The output is shown in the graph at the top of this article. The right side of the panel shows the predicted probabilities for boys. These curves are similar to those in the previous example, but now they are overlaid on a single plot. The left side of the panel shows the corresponding curves for girl babies. In general, the model predicts that girl babies have a higher probability to be underweight (relative to boys) in smoking mothers. The effect is noticeable most dramatically for younger mothers.

If you want to add confidence limits for the predicted curves, you can use the CLM option: effectplot slicefit(...) / CLM.

You can specify the levels of a continuous variable that are used to slice or panel the curves. For example, most cigarettes come in a pack of 20, so the following EFFECTPLOT statement visually compares the effect of smoking for pregnant women who smoke zero, one, or two packs per day:

   effectplot slicefit(x=MomAge sliceby=CigsPerDay=0 20 40 plotby=Boy);

Notice that there are no parentheses around the argument to the SLICEBY= option. That is, you might expect the syntax to be sliceby=(CigsPerDay=0 20 40), but that syntax is not supported.

If you want to directly compare the probabilities for boys and girls, you might want to interchange the SLICEBY= and PLOTBY= variables. The following statements create a graph that has three panels, and each panel directly compares boys and girls:

proc plm source=logiModel;
   effectplot slicefit(x=MomAge sliceby=boy plotby=CigsPerDay=0 20 40);
run;

As mentioned previously, you can also create contour plots that display the predicted response as a function of two continuous variables. The following statements create two contour plots, one for boy babies and one for girls:

proc plm restore=logiModel;
   effectplot contour(x=MomAge y=CigsPerDay plotby=Boy);
run;
An effect plot (created by using the EFFECTPLOT statement) that visualizes the response surface for each level of a categorical variable

Summary of the EFFECTPLOT statement

The EFFECTPLOT statement enables you to create plots that visualize interaction effects in complex regression models. The EFFECTPLOT statement is a hidden gem in SAS/STAT software that deserves more recognition. The easiest way to create an effect plot is to use the STORE statement in a regression procedure to create an item store, then use PROC PLM to create effect plots. In that way, you only need to fit a model once, but you can create many plots that help you to understand the model.

You can overlay curves, create panels, and even create contour plots. Several other plot types are also possible. See the documentation for the EFFECTPLOT statement for the full syntax, options, and additional examples of how to create plots that visualize interactions in generalized linear models.

tags: Statistical Graphics, Statistical Programming, Tips and Techniques

The post Use the EFFECTPLOT statement to visualize regression models in SAS appeared first on The DO Loop.

6
Jun

How to write CONTRAST and ESTIMATE statements in SAS regression procedures

I got several positive comments about a recent tip, "How to fit a variety of logistic regression models in SAS." A reader asked if I knew any other similar resources about statistical analysis in SAS.

Absolutely! One gem that comes to mind is "Examples of writing CONTRAST and ESTIMATE statements." SAS statistical programmers often ask how to write CONTRAST and ESTIMATE statements on discussion forums such as the SAS Support Community for Statistical Procedures.


How to write CONTRAST and ESTIMATE statements in #SAS regression procedures. #Statistics
Click To Tweet


The Knowledge Base article features regression models that you might encounter in PROC GLM, PROC LOGISTIC, and PROC GENMOD. The article includes the following topics:

  • How to express certain hypotheses as linear combinations of regression coefficients.
  • Why you must know the order of parameters for classification variables to properly write CONTRAST and ESTIMATE statements.
  • How to write CONTRAST and ESTIMATE statements for interaction effects.
  • How to specify linear combinations that include fractions like 1/3 or 1/6 that cannot be expressed as a terminating decimal value.
  • How to estimate or test contrasts of log odds in logistic models that use either GLM or EFFECT (deviation from the mean) encodings.
  • How to use the CONTRAST statement to compare nested models.

The article is written for a technical audience, and the examples are complex. However, if you are statistically sophisticated analyst, then "Examples of Writing CONTRAST and ESTIMATE Statements" is an excellent tutorial. Bookmark this page in case you need it!

And if you still have questions after reading the article, remember that the SAS Support Community is just a click away.

tags: Tips and Techniques

The post How to write CONTRAST and ESTIMATE statements in SAS regression procedures appeared first on The DO Loop.

11
May

Ten tips before you run an optimization

Optimization is a primary tool of computational statistics. SAS/IML software provides a suite of nonlinear optimizers that makes it easy to find an optimum for a user-defined objective function. You can perform unconstrained optimization, or define linear or nonlinear constraints for constrained optimization.

Over the years I have seen many questions about optimization (which is also call nonlinear programming (NLP)) posted to discussion forums such as the SAS/IML Support Community. A typical question describes a program that is producing an error, will not converge, or is running exceedingly slow. Such problems can be exceedingly frustrating, and some programmers express their frustration by punctuating their questions with multiple exclamation points:

  • "Why is the Newton-Raphson (NLPNRA) routine broken?!!!"
  • "Why is SAS/IML taking forever to solve this problem????!!!!!"

To paraphrase Shakespeare, in most cases the fault lies not in our optimization routines, but in ourselves. The following checklist describes 10 common SAS/IML issues that lead to errors or lack of convergence in nonlinear optimization. If you check this list before attempting to optimize a function, then you increase the chances that the SAS/IML optimization routines will produce an optimal solution quickly and efficiently.


10 ways to avoid errors and improve efficiency in nonlinear #optimization. #SAStip
Click To Tweet


  1. Thoroughly test and debug the objective function before you optimize. This is the most common reason that people post questions to a discussion forum. The programmer reports that the optimization routine is giving an error message, but the real problem is that there is a logical error in the objective function. When the optimizer calls the objective function, BLAM!—the program stops with an error. The solution is to test the objective function before you call the optimizer. Make sure that the objective function produces correct results for a range of possible input values.

  2. Vectorize the objective function, when possible. The objective function will be called many times. Therefore make sure it is as efficient as possible. In a matrix-vector language such as SAS/IML, you should vectorize, which means replacing loops by using a vector computation.

  3. Provide a good initial guess. Unfortunately, many programmers don't put much thought into the initial guess. They use a vector of zeros or ones or assign some arbitrary values. The initial guess is important. Newton's method converges quadratically to a solution if the initial guess is close to the solution. A good guess might converge in a few iterations, whereas a bad guess might not converge at all or might require dozens of iterations. You can use graphical techniques to find an initial guess. If you are running a constrained optimization, ensure that the initial guess satisfies the constraints.

  4. Specify derivatives correctly. Some optimizers (such as Newton's method) use derivative information to find an optimum of a smooth objective function. If your optimization is running slowly or not converging, you might want to specify the analytical derivatives. I've written a previous article about how to specify derivatives and ensure that they are correct. If you do not write a derivative function, SAS/IML will use finite differences. Finite differences require evaluating the objective function many times, which can be slow. For objective functions that are extremely flat near the optimum, finite differences can be inaccurate or lead to convergence issues.

  5. Use global variables to specify constants. In the SAS/IML language, the objective function for the NLP routines accepts a single vector of parameters. The optimizer tries to adjust the values of those parameters in order to optimize the objective function, subject to any constraints. But sometimes the objective function requires additional (constant) parameters. For example, in maximum likelihood estimation, the objective function needs access to the data. When you define the objective function, use the GLOBAL statement to specify the data and other unchanging parameters.

  6. Check your data and your model. Sometimes there is nothing wrong with the objective function, it is the data that is causing the problem. In maximum likelihood estimation (MLE), the data is part of the optimization. Degenerate data, constant data, collinearities, and singular matrices can arise when you attempt to fit an inappropriate model to data. Lack of convergence could mean that the model does not fit the data. Furthermore, not every MLE problem has a solution! Use a combination of research and visualization to make sure that a solution exists and is not degenerate.

  7. During the debugging phase, print the iteration history. When you use the NLP routines in SAS/IML, you create a vector of options that controls the optimizer algorithm. The second element of this vector specifies the amount of output that you want to create. When I am debugging a problem, I usually specify a value of 4, which means that the optimizer prints the iteration history and other details. By examining the output, you can often discover whether the initial guess is converging to an optimal value, is bouncing around, or is diverging. If the optimization is not converging, examining the iteration history is often the first step to understanding why.

  8. Define constraints correctly. In many statistical problems, parameters are constrained. For example, in maximum likelihood estimation, scale parameters must be positive. The simplest linear-constraint matrix has two rows and a column for each parameter. The first row represents the lower bound for each parameter and the second row represents the upper bound. Put the value 0 in the first row for parameters that must be nonnegative, or use a small number such as 1E-6 if you need to bound the parameter away from zero. The SAS/IML documentation shows how you can add additional rows to specify additional constraints. For nonlinear constraints, you can use the NLC= option to specify a module that defines the feasible region.

  9. Trap out-of-domain conditions in the objective function. When you use the NLC= option to define nonlinear constraints, the final solution should lie in the feasible region. However, the optimizer is entitled to evaluate the objective function in the infeasible region. Consequently, if the objective function contains logs, square roots, or other functions that are undefined in some regions of parameter space, you should trap and handle bad parameter values within the objective function. If you are maximizing the objective function, return a large negative value (such as -1E6) when the input parameters are outside the domain of the function.

  10. Do not attempt 10,000 optimizations until you can successfully complete one optimization. Sometimes optimizations are part of a simulation study. Make sure that the optimization is robust and completes successfully for one simulated sample. Then try five samples. Then 100. Time the performance of your simulation on 100 samples and use that time to estimate the run time for 10,000 samples. If you encounter a simulated sample for which the optimization does not converge, save that sample so that you can carefully analyze it and discover why the optimization failed. Remember that you can check the return code from the NLP routines to determine if the optimization converged.

Optimization can be challenging, but you can increase the chance of success if you follow the tips and strategies in this checklist. These tips help you to avoid common pitfalls. By following these suggestions, your SAS/IML optimization will be more efficient, more accurate, and less frustrating.

tags: Optimization, Tips and Techniques

The post Ten tips before you run an optimization appeared first on The DO Loop.

24
Feb

Four ways to create a design matrix in SAS

SAS programmers sometimes ask, "How do I create a design matrix in SAS?" A design matrix is a numerical matrix that represents the explanatory variables in regression models. In simple models, the design matrix contains one column for each continuous variable and multiple columns (called dummy variables) for each classification variable.

I previously wrote about how to create dummy variables in SAS by using the GLMMOD procedure to create binary indicator variables for each categorical variable. But PROC GLMMOD is not the only way to generate design matrices in SAS. This article demonstrate four SAS procedures that create design matrices: GLMMOD, LOGISTIC, TRANSREG, and GLIMMIX. Of the four, the LOGISTIC procedure provides the most flexibility for creating design matrices and also supports an easy-to-use syntax.

How categorical variables are represented in a design matrix in SAS

The CLASS statement in a SAS procedure specifies categorical variables that should be replaced by dummy variables when forming the design matrix. The process of forming columns in a design matrix is called parameterization or encoding. The three most popular parameterizations are the GLM encoding, the EFFECT encoding, and the REFERENCE encoding. For a detailed explanation of these encodings, see the section "Parameterization of Model Effects" in the SAS/STAT documentation. For applications and interpretation of different parameterizations, see Pasta (2005).

The following DATA step create an example data set with 10 observations. It has three fixed effects: one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels. It also has a categorical variable (HospitalID) that will be used as a random effect.

data Patients;
   HospitalID = mod(_N_, 4);
   keep HospitalID Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
run;
 
proc print; run;
Example data set for creating design matrices in SAS

PROC GLMMOD: Design matrices that use the GLM encoding

The simplest way to create dummy variables is by using the GLMMOD procedure, which can produce a basic design matrix with GLM encoding. The GLM encoding is a singular parameterization in which each categorical variable is represented by k binary variables, where k is the number of levels in the variable. There is also an intercept column that has all 1s. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to create interaction effects. See my previous article for an example of how to use PROC GLMMOD to create a design matrix and how the singular parameterization affects parameter estimates in regression.

PROC LOGISTIC: Design matrices for any parameterization

You can also create a design matrix in SAS by using the LOGISTIC procedure. The PROC LOGISTIC statement supports a DESIGNONLY option, which prevents the procedure from running the analysis. Instead it only forms the design matrix and writes it to a data set. By default, PROC LOGISTIC uses the EFFECT encoding for classification variables, but you can use the PARAM= option on the CLASS statement to specify any parameterization.

A drawback of using PROC LOGISTIC is that you must supply a binary response variable on the MODEL statement, which might require you to run an extra DATA step. The following DATA step creates a view that contains a variable that has the constant value 0. This variable is used on the left-hand side of the MODEL statement in PROC LOGISTIC, but is dropped from the design matrix:

data Temp / view=Temp;
   set Patients;
   FakeY = 0;
run;
 
proc logistic data=Temp outdesign=EffectDesign(drop=FakeY) outdesignonly;
   class sex BP_Status / param=effect; /* also supports REFERENCE & GLM encoding */
   model FakeY = Cholesterol Sex BP_Status;
run;
 
proc print data=EffectDesign; run;
Design matrix in SAS with effect encoding

The design matrix shows the effect encoding, which uses –1 to indicate the reference level, which by default is the last level in alphabetical order. The name of a dummy variable is the conatenation of the original variable name and a level. For example, the Sex variable is replaced by the dummy variable named SexFemale, which has the value 1 to represent females and –1 to represent the reference level ("Male"). The BP_Status variable is replaced by two variables. The BP_StatusHigh variable contains 1 for patients that have high blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise. Similarly, the BP_StatusNormal dummy variable has the value 1 for patients with normal blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise.

The effect encoding produces k-1 columns for a categorical variable that has k levels. This results in a nonsingular design matrix.

You can use the REF= option after each classification variable to specify the reference level. You can also use the PARAM= option on the CLASS statement to specify a different parameterization. For example, the following statements create a design matrix that uses the REFERENCE parameterization. The reference level for the Sex variable is set to "Female" and the reference level for the BP_Status variable is set to "Normal."

proc logistic data=Temp outdesign=RefDesign(drop=FakeY) outdesignonly;
   class sex(ref="Female") BP_Status(ref="Normal") / param=reference; 
   model FakeY = Sex BP_Status;
run;
 
proc print data=RefDesign; run;
t_design3

Parameterizations affect the way that parameter estimates are interpreted in a regression analysis. For the reference encoding, parameter estimates of main effects indicate the difference of each level as compared to the effect of the reference level. For the effect encoding, the comparison is to the average effect over all levels.

PROC TRANSREG: Design matrices and a macro for variable names

Using PROC LOGISTIC is very flexible, but it has two drawbacks: You have to create a fake response variable, and you have to look at the output data set to discover the names of the dummy variables. In contrast, PROC TRANSREG does not require that you specify a response variable when you generate the design matrix. Furthermore, the procedure creates a macro variable (&_TRGIND, for "TRANSREG indicator" variables) that contains the names of the columns of the design matrix. Another nice feature is that the output data set contains the original variables, and you can use the ID variable to output additional variables.

However, the syntax for the TRANSREG procedure is different from most other SAS regression procedures. Instead of a CLASS statement, you specify classification effects in a CLASS() transformation list. You can use the ZERO= option to control reference levels, and the procedure supports the GLM and EFFECT parameterization, but not the REFERENCE encoding. The following statements show an example that generates a design matrix with the effect encoding:

proc transreg data=Patients design;
   model identity(Cholesterol) 
         class(Sex BP_Status / EFFECT zero="Female" "Normal");
   output out=B;
run;
 
proc print data=B; 
   var Intercept &_TrgInd; 
run;

The output is not shown because it is identical to the EffectDesign data set in the previous section. Notice that the output is displayed by using the &_TRGIND macro variable. For details about generating design matrices, see the TRANSREG documentation section "Using the DESIGN Output Option."

PROC GLIMMIX: Design matrices for fixed and random effects

PROC GLIMMIX enables you to construct two design matrices: one for the fixed effects and another for the random effects. The PROC GLIMMIX statement supports an OUTDESIGN= option that you can use to specify the output data set and a NOFIT option that ensures that the procedure will not try to fit the model.

The following statements create an output data set that contains two design matrices:

proc glimmix data=Patients outdesign(names novar)=MixedDesign nofit;
   class sex BP_Status HospitalID;
   model Cholesterol = Sex BP_Status;
   random HospitalID;
   ods select ColumnNames;
run;
 
proc print data=MixedDesign; run;
Design matrix in SAS for fixed and random effects

Dummy variables for the fixed effects are prefixed by "_X" and dummy variables for the random effects are prefixed by "_Z." Two additional tables (not shown) associate the levels of the original variables with the columns of the design matrices.

The GLIMMIX procedure uses only the GLM parameterization. Consequently, there is little advantage to using PROC GLIMMIX instead of PROC GLMMOD. You can generate the same designs by calling PROC GLMMOD twice, once for the fixed effects and once for the random effects.

Summary

In summary, SAS provides four procedures that you can use to generate design matrices for continuous variables, classification variables, and their interactions. The GLMMOD procedure is ideal for creating design matrices that use the GLM encoding. PROC LOGISTIC supports all encodings in SAS and provides an easy-to-use syntax for specifying interactions. PROC TRANSREG supports fewer parameterizations, but does not require that you manufacture a response variable. Lastly, the GLIMMIX procedure produces design matrices for both fixed and random effects.

tags: Data Analysis, Tips and Techniques

The post Four ways to create a design matrix in SAS appeared first on The DO Loop.

Back to Top