How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS

My article about the difference between CLASS variables and BY variables in SAS focused on SAS analytical procedures. However, the BY statement is also useful in the SAS DATA step where it is used to merge data sets and to analyze data at the group level. When you use the BY statement in the DATA step, the DATA step creates two temporary indicator variables for each variable in the BY statement. The names of these variables are FIRST.variable and LAST.variable, where variable is the name of a variable in the BY statement. For example, if you use the statement BY Sex, then the names of the indicator variables are FIRST.Sex and LAST.Sex.

This article gives several examples of using the FIRST.variable and LAST.variable indicator variables for BY-group analysis in the SAS DATA step. The first example shows how to compute counts and cumulative amounts for each BY group. The second example shows how to compute the time between the first and last visit of a patient to a clinic, as well as the change in a measured quantity between the first and last visit. BY-group processing in the DATA step is a fundamental operation that belongs in every SAS programmer's tool box.

Use FIRST. and LAST. variables to find count the size of groups

The first example uses data from the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. The data are distributed with SAS. The following DATA step extracts the Smoking_Status and Weight variables and sorts the data by the Smoking_Status variable:

proc sort data=Sashelp.Heart(keep=Smoking_Status Weight)
   by Smoking_Status;

Because the data are sorted by the Smoking_Status variable, you can use the FIRST.Smoking_Status and LAST.Smoking_Status temporary variables to count the number of observations in each level of the Smoking_Status variable. (PROC FREQ computes the same information, but does not require sorted data.) When you use the BY Smoking_Status statement, the DATA step automatically creates the FIRST.Smoking_Status and LAST.Smoking_Status indicator variables. As its name implies, the FIRST.Smoking_Status variable has the value 1 for the first observation in each BY group and the value 0 otherwise. (More correctly, the value is 1 for the first record and for records for which the Smoking_Status variable is different than it was for the previous record.) Similarly, the LAST.Smoking_Status indicator variable has the value 1 for the last observation in each BY group and 0 otherwise.

The following DATA step defines a variable named Count and initializes Count=0 at the beginning of each BY group. For every observation in the BY group, the Count variable is incremented by 1. When the last record in each BY group is read, that record is written to the Count data set.

data Count;
   set Heart;                 /* data are sorted by Smoking_Status */
   BY Smoking_Status;         /* automatically creates indicator vars */
   if FIRST.Smoking_Status then
      Count = 0;              /* initialize Count at beginning of each BY group */
   Count + 1;                 /* increment Count for each record */
   if LAST.Smoking_Status;    /* output only the last record of each BY group */
proc print data=Count noobs; 
   format Count comma10.;
   var Smoking_Status Count;
Use FIRST.variable and LAST.variable to count the size of groups

The same technique enables you to accumulate values of a variable within a group. For example, you can accumulate the total weight of all patients in each smoking group by using the following statements:

if FIRST.Smoking_Status then
   cumWt = 0;
cumWt + Weight;

This same technique can be used to accumulate revenue from various sources, such as departments, stores, or regions.

Use FIRST. and LAST. variables to compute duration of treatment

Another common use of the FIRST.variable and LAST.variable indicator variables is to determine the length of time between a patient's first visit and his last visit. Consider the following DATA step, which defines the dates and weights for four male patients who visited a clinic as part of a weight-loss program:

data Patients;
informat Date date7.;
format Date date7. PatientID Z4.;
input PatientID Date Weight @@;
1021 04Jan16  302  1042 06Jan16  285
1053 07Jan16  325  1063 11Jan16  291
1053 01Feb16  299  1021 01Feb16  288
1063 09Feb16  283  1042 16Feb16  279
1021 07Mar16  280  1063 09Mar16  272
1042 28Mar16  272  1021 04Apr16  273
1063 20Apr16  270  1053 28Apr16  289
1053 13May16  295  1063 31May16  269

For these data, you can sort by the patient ID and by the date of visit. After sorting, the first record for each patient contains the first visit to the clinic and the last record contains the last visit. You can subtract the patient's weight for these dates to determine how much the patient gained or lost during the trial. You can also use the INTCK function to compute the elapsed time between visits. If you want to measure time in days, you can simply subtract the dates, but the INTCK function enables you to compute duration in terms of years, months, weeks, and other time units.

proc sort data=Patients;
   by PatientID Date;
data weightLoss;
   set Patients;
   BY PatientID;
   retain startDate startWeight;                 /* RETAIN the starting values */
   if FIRST.PatientID then do;
      startDate = Date; startWeight = Weight;    /* remember the initial values */
   if LAST.PatientID then do;
      endDate = Date; endWeight = Weight;
      elapsedDays = intck('day', startDate, endDate); /* elapsed time (in days) */
      weightLoss = startWeight - endWeight;           /* weight loss */
      AvgWeightLoss = weightLoss / elapsedDays;       /* average weight loss per day */
      output;                                         /* output only the last record in each group */
proc print noobs; 
   var PatientID elapsedDays startWeight endWeight weightLoss AvgWeightLoss;
Use FIRST.variable and LAST.variable to compute elapsed time and average quantities in a group

The output data set summarizes each patient's activities at the clinic, including his average weight loss and the duration of his treatment.

Some programmers think that the FIRST.variable and LAST.variable indicator variables require that the data be sorted, but that is not true. The temporary variables are created whenever you use a BY statement in a DATA step. You can use the NOTSORTED option on the BY statement to process records regardless of the sort order.


In summary, the BY statement in the DATA step automatically creates two indicator variables. You can use the variables to determine the first and last record in each BY group. Typically the FIRST.variable indicator is used to initialize summary statistics and to remember the initial values of measurement. The LAST.variable indicator is used to output the result of the computations, which often includes simple descriptive statistics such as a sum, difference, maximum, minimum, or average values.

BY-group processing in the DATA step is a common topic that is presented at SAS conferences. Some authors use FIRST.BY and LAST.BY as the name of the indicator variables. For further reading, I recommend the paper "The Power of the BY Statement" (Choate and Dunn, 2007). SAS also provides several samples about BY-group processing in the SAS DATA step, including the following:

The post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS appeared first on The DO Loop.


The difference between CLASS statements and BY statements in SAS

When I first learned to program in SAS, I remember being confused about the difference between CLASS statements and BY statements. A novice SAS programmer recently asked when to use one instead of the other, so this article explains the difference between the CLASS statement and BY variables in SAS procedures.

The BY statement and the CLASS statement in SAS both enable you to specify one or more categorical variables whose levels define subgroups of the data. (For simplicity, we consider only a single categorical variable.) The primary difference is that the BY statement computes many analyses, each on a subset of the data, whereas the CLASS statement computes a single analysis of all the data. Specifically,

  • The BY statement repeats an analysis on every subgroup. The subgroups are treated as independent samples. If a BY variable defines k groups, the output will contains k copies of every table and graph, one copy for the first group, one copy for the second group, and so on.
  • The CLASS statement enables you to include a categorical variable as part of an analysis. Often the CLASS variable is used to compare the groups, such as in a t test or an ANOVA analysis. In regression models, the CLASS statement enables you to estimate parameters for the levels of a categorical variable, thereby estimating the effect of each level on the response. Another use of a CLASS variable is to define categories for a classification task, such as a discriminant analysis.

To illustrate the differences between an analysis that uses a BY statement and one that uses a CLASS statement, let's create a subset (called Cars) of the Sashelp.Cars data. The levels of the Origin variable indicate whether a vehicle is manufactured in "Asia", "Europe", or the "USA". For efficiency reasons, most classical SAS procedures require that you sort the data when you use a BY statement. Therefore, a call to PROC SORT creates a sorted version of the data called CarsSorted, which will be used for the BY-group analyses.

data Cars;
   set Sashelp.Cars;
   where cylinders in (4,6,8) and type ^= 'Hybrid'; 
proc sort data=Cars out=CarsSorted; 
   by Origin; 

Descriptive statistics for grouped data

When you generate descriptive statistics for groups of data, the univariate statistics are identical whether you use a CLASS statement or a BY statement. What changes is the way that the statistics are displayed. When you use the CLASS statement, you get one table that contains all statistics or one graph that shows the distribution of each subgroup. However, when you use the BY statement you get multiple tables and graphs.

The following statements use the CLASS statement to produce descriptive statistics. PROC UNIVARIATE displays one (paneled) graph that shows a comparative histogram for the vehicles that are made in Asia, Europe, and USA. PROC MEANS displays one table that contains descriptive statistics:

proc univariate data=Cars;
   class Origin;
   var Horsepower;
   histogram Horsepower / nrows=3; /* must use NROWS= to get panel */
   ods select histogram;
proc means data=Cars N Mean Std;
   class Origin;
   var Horsepower Weight Mpg_Highway;

In contrast, if you run a BY-group analysis on the levels of the Origin variable, you will see three times as many tables and graphs. Each analysis is preceded by a label that identifies each BY group. Notice that the BY-group analysis uses the sorted data.

proc means data=CarsSorted N Mean Std;
   by Origin;
   var Horsepower Weight Mpg_Highway;

Always remember that the output from a BY statement is equivalent to the output from running the procedure multiple times on subsets of the data. For example, the previous statistics could also be generated by calling PROC MEANS three times, each call with a different WHERE clause, as follows:

proc means N Mean Std data=CarsSorted( where=(origin='Asia') );
   var Horsepower Weight Mpg_Highway;
proc means N Mean Std data=CarsSorted( where=(origin='Europe') );
   var Horsepower Weight Mpg_Highway;
proc means N Mean Std data=CarsSorted( where=(origin='USA') );
   var Horsepower Weight Mpg_Highway;

In fact, if you ever find yourself repeating an analysis many times (perhaps by using a macro loop), you should consider whether you can rewrite your program to be more efficient by using a BY statement.

Comparing groups: Use the CLASS statement

As a general rule, you should use a CLASS statement when you want to compare or contrast groups. For example, the following call to PROC GLM performs an ANOVA analysis on the horsepower (response variable) for the three groups defined by the Origin variable. The procedure automatically creates a graph that displays three boxplots, one for each group. The procedure also computes parameter estimates for the levels of the CLASS variable (not shown).

proc glm data=Cars; /* by default, create graph with side-by-side boxplots */
   class Origin;
   model Horsepower = Origin / solution;

You can specify multiple variables on the CLASS statement to include multiple categorical variables in a model. Any variables that are not listed on the CLASS statement are assumed to be continuous. Thus the following call to PROC GLM analyzes a model that has one continuous and one classification variable. The procedure automatically produces a graph that overlays the three regression curves on the data:

ods graphics /antialias=on;
title "CLASS Variable Regression: One Model with Multiple Parameters";
proc GLM data=Cars plots=FitPlot;
   class Origin;
   model Horsepower = Origin | Weight / solution;
   ods select ParameterEstimates ANCOVAPlot;

In contrast, if you use a BY statement, the Origin variable cannot be part of the model but is used only to subset the data. If you use a BY statement, you obtain three different models of the form Horsepower = Weight. You get three parameter estimates tables and three graphs, each showing one regression line overlaid on a subset of the data.

Predicted Values: CLASS VARIABLE versus BY Variable

When you use a BY statement and fit three models of the form Horsepower = Weight, the procedure fits a total of six parameters. Notice that when you use the CLASS statement and fit the model Horsepower = Origin | Weight, you also fit six free parameters. It turns out that these two methods produce the same predicted values. In fact, you can combine the parameter estimates (for the GLM parameterization) for the CLASS model to obtain the parameter estimates from the BY-variable analysis, as shown below. Each parameter estimate for the BY-variable models are obtained as the sum of two estimates for the CLASS-variable analysis:

For many regression models, the predicted values for the BY-variable analyses are the same as for a particular model that uses a CLASS variable. As shown above, you can even see how the parameters are related when you use a GLM or reference parameterization. However, the CLASS variable formulation can fit models (such as the equal-slope model Horsepower = Origin Weight) that are not available when you use a BY variable to fit three separate models. Furthermore, the CLASS statement provides parameter estimates so that you can see the effect of the groups on the response variable. It is more difficult to compare the models that are produced by using the BY statement.

Other CLASS-like statements in SAS

Some SAS procedures use other syntax to analyze groups. In particular, the SGPLOT procedure calls classification variables "group variables." If you want to overlay graphs for multiple groups, you can use the GROUP= option on many SGPLOT statements. (Some statements support the CATEGORY= option, which is similar.) For example, to replicate the two-variable regression analysis from PROC GLM, you can use the following statements in PROC SGPLOT:

proc sgplot data=Cars;
   reg y=Horsepower x=Weight / group=Origin; /* Horsepower = Origin | Weight */


In summary, use the BY statement in SAS procedures when you want to repeat an analysis for every level of one or more categorical variables. The variables define the subsets but are not otherwise part of the analysis. In classical SAS procedures, the data must be sorted by the BY variables. A BY-group analysis can produce many tables and graphs, so you might want to suppress the ODS output and write the results to a SAS data set.

Use the CLASS statement when you want to include a categorical variable in a model. A CLASS statement often enables you to compare or contrast subgroups. For example, in regression models you can evaluate the relative effect of each level on the response variable.

In some cases, the BY statement and the CLASS statement produce identical statistics. However, the CLASS statement enables you to fit a wider variety of models.

The post The difference between CLASS statements and BY statements in SAS appeared first on The DO Loop.


What is rank correlation?

When someone refers to the correlation between two variables, they are probably referring to the Pearson correlation, which is the standard statistic that is taught in elementary statistics courses. Elementary courses do not usually mention that there are other measures of correlation.

Why would anyone want a different estimate of correlation? Well, the Pearson correlation, which is also known as the product-moment correlation, uses empirical moments of the data (means and standard deviations) to estimate the linear association between two variables. However, means and standard deviations can be unduly influenced by outliers in the data, so the Pearson correlation is not a robust statistic.

A simple robust alternative to the Pearson correlation is called the Spearman rank correlation, which is defined as the Pearson correlation of the ranks of each variable. (If a variable contains tied values, replace those values by their average rank.) The Spearman rank correlation is simple to compute and conceptually easy to understand. Some advantages of the rank correlation are

  • The rank correlation is always in the interval [-1, 1]. For "tame" data, the Spearman and Pearson correlations are close to each other. In fact, if X and Y are bivariate normal random variables with Pearson correlation ρ, then the Spearman correlation is 6/π arcsin(ρ/2), which is very close to the identity function on [-1, 1].
  • The rank correlation is robust to outliers. For example, the data set X={1, 2, 2, 5} has the same ranks as the set Y={1, 2, 2, 500}. Therefore for any third variable Z, the rank correlation between X and Z is the same as the rank correlation between Y and Z.
  • The rank correlation is invariant under any monotonic increasing transformation of the data, such as LOG, EXP, and SQRT. In the previous example, the rank correlation between Z and X is the same as the rank correlation between Z and the log-transform of X, which is {log(1), log(2), log(2), log(5)}. This is in contrast to the Pearson correlation, which is only invariant under affine transformations with positive scaling factors (X → a*X + b, where a > 0).
  • The rank correlation can be used for any ordinal variable. For example, if the variable X has the ordinal values {"Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied"}, and the variable Y has the ordinal values {"Low", "Medium", "High"}, then you can compute a rank correlation between X and Y.

Compute rank correlation in SAS

PROC CORR in SAS supports several measures of correlation, including the Pearson and Spearman correlations. For data without outliers, the two measures are often similar. For example, the following call to PROC CORR computes the Spearman rank correlation between three variables in the Sashelp.Class data set:

/* Compute PEARSON and SPEARMAN rank correlation by using PROC CORR in SAS */
proc corr data=sashelp.class noprob nosimple PEARSON SPEARMAN;
   var height weight age;
Spearman rank correlation compared to the Pearson correlation in SAS

According to both statistics, these variables are very positively correlated, with correlations in the range [0.7, 0.88]. Notice that the rank correlations (the lower table) are similar to the Pearson correlations for these data. However, if the data contain outliers, the rank correlation estimate is less influenced by the magnitude of the outliers.

Compute rank correlation manually

As mentioned earlier, the Spearman rank correlation is conceptually easy to understand. It consists of two steps: compute the ranks of each variable and compute the Pearson correlation between the ranks. It is instructive to reproduce each step in the Spearman computation. You can use PROC RANK in SAS to compute the ranks of the variables, then use PROC CORR with the PEARSON option to compute the Pearson correlation of the ranks. If the data do not contain any missing values, then the following statements implement to two steps that compute the Spearman rank correlation:

/* Compute the Spearman rank correlation "manually" by explicitly computing ranks */
/* First compute ranks; use average rank for ties */
proc rank data=sashelp.class out=classRank ties=mean;
   var height weight age;
   ranks RankHeight RankWeight RankAge;
/* Then compute Pearson correlation on the ranks */
proc corr data=classRank noprob nosimple PEARSON;
   var RankHeight RankWeight RankAge;

The resulting table of correlations is the same as in the previous section and is not shown. Although PROC CORR can compute the rank correlation directly, it is comforting that these two steps produce the same answer. Furthermore, this two-step method can be useful if you decide to implement a rank-based statistic that is not produced by any SAS procedure. This two-step method is also the way to compute the Spearman correlation of character ordinal variables because PROC CORR does not analyze character variables. However, PROC RANK supports both character and numeric variables.

If you have missing values in your data, then make sure you delete the observations that contain missing values before you call PROC RANK. Equivalently, you can use a WHERE statement to omit the missing values. For example, you could insert the following statement into the PROC RANK statements:
   where height^=. & weight^=. & age^=.;

Compute rank correlation in SAS/IML software

In the SAS/IML language, the CORR function computes the Spearman rank correlation directly, as follows. The results are the same as the results from PROC CORR, and are not shown.

proc iml;
use sashelp.class;
   read all var {height weight age} into X;
RankCorr = corr(X, "Spearman");  /* compute rank correlation */

If you ever need to compute a rank-based statistic manually, you can also use the RANKTIE function to compute the ranks of the elements in a numerical vector, such as
   ranktie(X[ ,1], "Mean");


The Spearman rank correlation is a robust measure of the linear association between variables. It is related to the classical Pearson correlation because it is defined as the Pearson correlation between the ranks of the individual variables. It has some very nice properties, including being robust to outliers and being invariant under monotonic increasing transformations of the data. For other measures of correlation that are supported in SAS, see the PROC CORR documentation.

The post What is rank correlation? appeared first on The DO Loop.


The IFN function versus the IF-THEN/ELSE statement in SAS

I have previously discussed how to define functions that safely evaluate their arguments and return a missing value if the argument is not in the domain of the function. The canonical example is the LOG function, which is defined only for positive arguments. For example, to evaluate the LOG function on a sequence of (possibly non-positive) values, you can use the following IF-THEN/ELSE logic:

data Log;
input x @@;
if x > 0 then 
   logX = log(x);
   logX = .;
-1 1 2 . 0 10  
proc print; run;

On SAS discussion forums, I sometimes see questions from people who try to use the IFN function to accomplish the same logic. That is, in place of the IF-THEN/ELSE logic, they try to use the following one-liner in the DATA step:

logX = ifn(x>0, log(x), .);

Although this looks like the same logic, there is a subtle difference. All three arguments to the IFN function are evaluated BEFORE the function is called, and the results of the evaluation are then passed to the function. For example, if x= -1, then the SAS DATA step does the following:

  1. Evaluate the Boolean expression x>0. When x= -1, the expression evaluates to 0.
  2. Evaluate the second argument log(x). This happens regardless of the result of evaluating the first expression. When x= -1, the expression log(x) is invalid and the SAS log will report
    NOTE: Invalid argument to function LOG
    NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values.
  3. Call the IFN function as logX = IFN(0, ., .), which results in assigning a missing value to logX.

In Step 2, SAS evaluates log(x) unconditionally for every value of x, which leads to out-of-domain errors when x is not positive. This is exactly the situation that the programmer was trying to avoid! In contrast, the IF-THEN/ELSE logic only evaluates log(x) when x is positive. Consequently, the SAS log is clean when you use the IF-THEN/ELSE statement.

There are plenty of situations for which the IFN function (and it's cousin, the IFC function) are useful, but for testing out-of-domain conditions, use IF-THEN/ELSE logic instead.

The post The IFN function versus the IF-THEN/ELSE statement in SAS appeared first on The DO Loop.


A simple trick to construct symmetric intervals

Many intervals in statistics have the form p ± δ, where p is a point estimate and δ is the radius (or half-width) of the interval. (For example, many two-sided confidence intervals have this form, where δ is proportional to the standard error.) Many years ago I wrote an article that mentioned that you can construct these intervals in the SAS/IML language by using a concatenation operator (|| or //). The concatenation creates a two-element vector, like this:

proc iml;
mu = 50; 
delta = 1.5;
CI = mu - delta  || mu + delta;   /* horizontal concatenation ==> 1x2 vector */

Last week it occurred to me that there is a simple trick that is even easier: use the fact that SAS/IML is a matrix-vector language to encode the "±" sign as a vector {-1, 1}. When SAS/IML sees a scalar multiplied by a vector, the result will be a vector:

CI = mu + {-1  1}*delta;         /* vector operation ==> 1x2 vector */
print CI;

You can extend this example to compute many intervals by using a single statement. For example, in elementary statistics we learn the "68-95-99.7 rule" for the normal distribution. The rule says that in a random sample drawn from a normal population, about 68% of the observations will be within 1 standard deviation of the mean, about 95% will be within 2 standard deviations, and about 99.7 % will be within 3 standard deviations of the mean. You can construct those intervals by using a "multiplier matrix" whose first row is {-1, +1}, whose second row is {-2, +2}, and whose third row is {-3, +3}. The following SAS/IML statements construct the three intervals for the 69-95-99.7 rule for a normal population with mean 50 and standard deviation 8:

mu = 50;  sigma = 8;
m = {-1 1,
     -2 2,
     -3 3};
Intervals = mu + m*sigma;
ApproxPct = {"68%", "95%", "99.7"};
print Intervals[rowname=ApproxPct];

Just for fun, let's simulate a large sample from the normal population and empirically confirm the 68-95-99.7 rule. You can use the RANDFUN function to generate a random sample and use the BIN function to detect which observations are in each interval:

call randseed(12345);
n = 10000;                                /* sample size */
x = randfun(n, "Normal", mu, sigma);      /* simulate normal sample */
ObservedPct = j(3,1,0);
do i = 1 to 3;
   b = bin(x, Intervals[i,]);             /* b[i]=1 if x[i] in interval */
   ObservedPct[i] = sum(b) / n;           /* percentage of x in interval */
results = Intervals || {0.68, 0.95, 0.997} || ObservedPct;
print results[colname={"Lower" "Upper" "Predicted" "Observed"}
              label="Probability of Normal Variate in Intervals: X ~ N(50, 8)"];

The simulation confirms the 68-95-99.7 rule. Remember that the rule is a mnemonic device. You can compute the exact probabilities by using the CDF function. In SAS/IML, the exact computation is p = cdf("Normal", m[,2]) - cdf("Normal", m[,1]);

In summary, the SAS/IML language provides an easy syntax to construct intervals that are symmetric about a central value. You can use a vector such as {-1, 1} to construct an interval of the form p ± δ, or you can use a k x 2 matrix to construct k symmetric intervals.

The post A simple trick to construct symmetric intervals appeared first on The DO Loop.


LEAVE and CONTINUE: Two ways to control the flow in a SAS DO loop

SAS programmers who have experience with other programming languages sometimes wonder whether the SAS language supports statements that are equivalent to the "break" and "continue" statements in other languages. The answer is yes. The LEAVE statement in the SAS DATA step is equivalent to the "break" statement. It provides a way to immediately exit from an iterative loop. The CONTINUE statements in the SAS DATA step skips over any remaining statements in the body of a loop and starts the next iteration.

Not all languages in SAS support those statements. For example, the SAS/IML language does not. However, you can use an alternative syntax to implement the same logical behavior, as shown in this article.

To review the syntax of various DO, DO-WHILE, and DO-UNTIL loops in SAS, see "Loops in SAS."

The LEAVE statement

The LEAVE statement exits a DO loop, usually as part of an IF-THEN statement to test whether a certain condition is met.

To illustrate the LEAVE statement consider a DATA step that simulates tossing a coin until "heads" appears. (Represent "tails" by a 0 and "heads" by a 1, chosen at random.) In the following SAS DATA step, the DO WHILE statement is always true, so the program potentially contains an infinite loop. However, the LEAVE statement is used to break out of the loop when "heads" appears. This coin-tossing experiment is repeated 100 times, which is equivalent to 100 draws from the geometric distribution:

/* toss a coin until "heads" (1) */
data Toss;
call streaminit(321);
do trial = 1 to 100;               /* simulate an experiment 100 times */
   count = 0;                      /* how many tosses until heads? */
   do while (1);                   /* loop forever */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      if coin = 1 then LEAVE;      /* exit loop when "heads" */
      count + 1;                   /* otherwise increment count */
keep trial count;

Some people like this programming paradigm (set up an infinite loop, break out when a condition is satisfied), but I personally prefer a DO UNTIL loop because the exit condition is easier to see when I read the program. For example, the code for each trial could be rewritten as

   count  = 0;                     /* how many tosses until heads? */
   done = 0;                       /* initialize flag variable */
   do until (done);                /* exit loop when "heads" */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      done = (coin = 1);           /* update flag variable */
      if ^done then                /* otherwise increment count */
         count + 1;

Notice that the LEAVE statement exits only the inner loop. In this example, the LEAVE statement does n ot affect the iteration of the DO TRIAL loop.

The CONTINUE statement

The CONTINUE statement tells the DATA step to skip the remaining body of the loop and go to the next iteration. It is used to skip processing when a condition is true. To illustrate the CONTINUE statement, let's simulate a coin-tossing experiment in which you toss a coin until "heads" appears OR until you have tossed the coin five times. In the following SAS DATA step, if tails (0) appears the CONTINUE statement executes, which skips the remaining statements and begins the next iteration of the loop. Consequently, the DONE=1 assignment is not executed when coin=0. Only if heads (1) appears does the DONE variable get assigned to a nonzero value, thereby ending the DO-UNTIL loop:

data Toss2;
call streaminit(321);
do trial = 1 to 100;               /* simulate an experiment 100 times */
   done = 0;                       /* initialize flag variable */
   do count = 0 to 4 until (done); /* iterate at most 5 times */
      coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
      if coin = 0 then CONTINUE;   /* tails: go to next iteration */
      done = 1;                    /* exit loop when "heads" */
keep trial count;

The CONTINUE statement is not strictly necessary, although it can be convenient. You can always use an IF-THEN statement to bypass the remainder of the loop, as follows:

   coin = rand("Bernoulli", 0.5);  /* random 0 or 1 */
   /* wrap the remainder of the body in an IF-THEN statement */
   if coin ^= 0 then do;           /* heads: abort the loop */
      done = 1;                    /* exit loop when "heads" */
      /* other computations, as needed */

The CONTINUE and the LEAVE statements are examples of "jump statements" that tell the program the location of the next statement to execute. Both statements jump to a statement that might be far away. Consequently, programs that contain these statements are less structured than programs that avoid them. I try to avoid these statements in my programs, although sometimes the LEAVE statement is the simplest way to abort a loop when the program has to check for multiple exit conditions.

While we are on the topic, another jump statement in SAS is the GOTO statement, which can be used to emulate the behavior of LEAVE and CONTINUE. I avoid the GOTO statement because I think programs are easier to read and maintain when the logical flow of the program is controlled by using the DO-WHILE, DO-UNTIL, and IF-THEN statements. I also use those control statements in the SAS/IML language, which does not support the CONTINUE or LEAVE statements (although it does support the GOTO statement).

What are your views? Do you use the CONTINUE or LEAVE statement to simplify the error-handling logic of your programs? Or do you avoid them in favor of a more structured programming style? Why?

The post LEAVE and CONTINUE: Two ways to control the flow in a SAS DO loop appeared first on The DO Loop.


An easy way to run thousands of regressions in SAS

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.

Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet

Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   /* ...then accumulate statistics... */

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
drop x:;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.


You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

tags: Data Analysis, Getting Started, Statistical Programming

The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.


ODS OUTPUT: Store any statistic created by any SAS procedure

In the beginning SAS created procedures and output. The output was formless and void. Then SAS said, "Let there be ODS," and there was ODS. Customers saw that ODS was good, and SAS separated the computation from the display and management of output.

The preceding paragraph oversimplifies the SAS Output Delivery System (ODS), but the truth is that ODS is a powerful feature of SAS. You can use ODS to send SAS tables and graphics to various output destinations, including HTML, PDF, RTF, and PowerPoint. You can control the style and attributes of the output, thus creating a customized report. There have been hundreds of papers and books written about ODS. A very basic introduction is Olinger (2000) "ODS for Dummies."

To a statistical programmer the most useful destination is the OUTPUT destination. The OUTPUT destination sends a table or graph to a SAS data set. Consequently, you can programmatically access each element of the output.

The implications of the previous statement are monumental. I cannot overstate the importance of the OUTPUT destination, so let me say it again:

The ODS OUTPUT destination enables you to store any value that is produced by any SAS procedure. You can then read that value by using a SAS program.

The ODS OUTPUT destination answers a common question that is asked by new programmers on SAS discussion forums: "How can I get a statistic into a data set or into a macro variable?" The steps are as follows:

  1. Use ODS TRACE ON (or the SAS documentation) to find the name of the ODS table that contains the statistic that you want.
  2. Use the ODS OUTPUT statement to specify the table name and a data set name. The syntax is ODS OUTPUT TableName=DataSetName. Then run the procedure to generate the table.
  3. Read the data set to obtain the value of the statistic.

New to #SAS programming? How to get any statistic into a data set.
Click To Tweet

Find the name of the ODS table

As an example, suppose that you intend to use PROC REG to perform a linear regression, and you want to capture the R-square value in a SAS data set. The documentation for the procedure lists all ODS tables that the procedure can create, or you can use the ODS TRACE ON statement to display the table names that are produced by PROC REG. The data are the 428 vehicles in the Sashelp.Cars data set, which is distributed with SAS:

ods trace on;                           /* write ODS table names to log */
proc reg data=Sashelp.Cars plots=none;
   model Horsepower = EngineSize Weight;
ods trace off;                          /* stop writing to log */
Output Added:
Name:       FitStatistics
Label:      Fit Statistics
Template:   Stat.REG.FitStatistics
Path:       Reg.MODEL1.Fit.Horsepower.FitStatistics

By looking at the output, you can see that the third table contains the R-square value. By looking at the SAS log, you can see that the name of the third table is "FitStatistics."

Save the table to a SAS data set

Now that you know the name of the ODS table is "FitStatistics," use the ODS OUTPUT destination to write that table to a SAS data set, as follows:

ods output FitStatistics=Output;        /* the data set name is 'Output' */
proc reg data=Sashelp.Cars plots=none;  /* same procedure call */
   model Horsepower = EngineSize Weight;
proc print data=Output noobs;

The output from PROC PRINT shows the structure of the output data set. Notice that the data set often looks different from the original displayed table. The data set contains non-printing columns (like Model and Dependent) that do not appear in the displayed table. The data set also contains columns that contain the raw numerical values and the (formatted) character values of the statistics. The columns cValue1 and nValue1 represent the same information, except that the cValue1 is a character column whereas nValue1 is a numerical column. The same applies to the cValue2 and nValue2 columns. The character values might contain formatted or rounded values.

Read the value of the statistic into a macro variable

From the previous PROC PRINT output, you can see that the numerical value of the R-square statistic is in the first row and is in the nValue2 column. You can therefore read and process that value by using a standard WHERE clause. For example, the following statements use the SYMPUTX subroutine to create a macro variable that contains the value of the R-square statistic:

data _null_;
set Output;
if Label2="R-Square" then call symputx("RSq", nValue2);
%put RSq = &RSq;
RSq = 0.6201360929

The SAS log shows that the R-square value is now contained in the Rsq macro variable.

Storing the statistic in a macro variable is only one way to use the data set. You could also read the statistics into PROC IML or PROC SQL for further computation, or show the value of the statistic in a graph.

BY-group processing: Multiple samples and multiple statistics

The previous sections show how to save a single table to a SAS data set. It is just as easy to create a data set that contains multiple statistics, one for each level in a BY-group analysis.

Suppose that you want to run several regressions, one for each value of the Origin variable, which has the values "Asia," "Europe," and "USA." The following call to PROC SORT sorts the data by the Origin variable. The sorted data is stored in the CARS data set.

proc sort data=Sashelp.Cars out=Cars;
   by Origin;

You can then specify Origin on the BY statement in PROC REG to carry out three regression analyses. When you run a BY-group analysis, you might not want to see all of the results displayed on the computer screen, especially if your goal is to save the results in an output data set. You can use the ODS EXCLUDE statement to suppress SAS output.

ods exclude all;                    /* suppress tables to screen */
ods output FitStatistics=Output2;   /* 'Output2' contains results for each BY group */
proc reg data=Cars plots=none;
   by Origin;
   model Horsepower = EngineSize Weight;
ods exclude none;                   /* no longer suppress tables */
proc print data=Output2 noobs;
   where Label2="R-Square";
   var Origin Label2 nValue2;

The output from PROC PRINT shows the R-square statistics for each model. Notice that the BY-group variables (in this case, Origin) are added to output data sets when you run a BY-group analysis. You can now use the statistics in programs or graphs.

Alternatives to using ODS OUTPUT

Some procedures provide an alternative option for creating an output data set that contains statistics. Always check the SAS documentation to see if the procedure provides an option that writes common statistics to an output data set. For example, the documentation for the PROC REG statement states that you can use the OUTEST= option with the RSQUARE option to obtain an output data set that contains the parameter estimates and other model statistics such as the R-square value. Thus for this example, you do not need to use the ODS OUTPUT statement to direct the FitStatistics table to a data set. Instead, you can obtain the statistic as follows:

proc reg data=Cars NOPRINT outest=Output3 RSQUARE; /* statistics in 'Output3' */
   by Origin;
   model Horsepower = EngineSize Weight;
proc print data=Output3 noobs;
   format _RSQ_ 8.6;
   var Origin _RSQ_;


In summary, the ODS OUTPUT statement enables you to create a data set that contains any statistic that is produced by a SAS procedure. You can use the ODS OUTPUT statement to capture a statistic and use it later in your program.

tags: Getting Started

The post ODS OUTPUT: Store any statistic created by any SAS procedure appeared first on The DO Loop.


Solve linear programming problems in SAS

In some applications, you need to optimize a linear objective function of many variables, subject to linear constraints. Solving this problem is called linear programming or linear optimization. This article shows two ways to solve linear programming problems in SAS: You can use the OPTMODEL procedure in SAS/OR software or use the LPSOLVE function in SAS/IML software.

Formulating the optimization

A linear programming problem can always be written in a standard vector form. In the standard form, the unknown variables are nonnegative, which is written in vector form as x0. Because the objective function is linear, it can be expressed as the inner product of a known vector of coefficients (c) with the unknown variable vector: cTx Because the constraints are also linear, the inequality constraints can always be written as Axb for a known constraint matrix A and a known vector of values b.

In practice, it is often convenient to be able to specify the problem in a non-standardized form in which some of the constraints represent "greater than," others represent "less than," and others represent equality. Computer software can translate the problem into a standardized form and solve it.

A linear programming problem example

As an example of how to solve a linear programming problem in SAS, let's pose a particular two-variable problem:

  • Let x = {x1, x2} be the vector of unknown positive variables.
  • Define the objective function to maximize as 3*x1 + 5*x2. In vector form, the objective function is cTx where c = {3, 5}.
  • Define the linear constraints as follows:
    3*x1 + -2*x2 ≤ 10
    5*x1 + 10*x2 ≤ 56
    4*x1 +  2*x2 ≥  7
    In matrix form, you can define the 3 x 2 matrix of constraint coefficients A = {3  -2, 5  10, 4  2} and the right-hand-side vector b = {10, 56, 7}. There is not a standard way to express a "vector of symbols," but we can invent one. Let LEG = {≤, ≤, ≥} be a vector of symbols. ("LEG" is a mnemonic for "Less than," "Equal to," or "Greater than.") With this notation, the vector of constraints is written as Ax LEG b.
Feasible region for linear program problem  in SAS. The optimal solution is marked.

The graph shows the polygonal region in the plane that satisfies the constraints for this problem. The region (called the feasible region) is defined by the two axes and the three linear inequalities. The color in the interior of the region indicates the value of the objective function within the feasible region. The green star indicates the optimal solution, which is x = {5.3, 2.95}.

The theory of linear programming says that an optimal solution will always be found at a vertex of the feasible region, which in 2-D is a polygon. In this problem, the feasible polygon has only five vertices, so you could evaluate the objective function at each vertex by hand to find the optimal solution. For high-dimensional problems, however, you definitely want to use a computer!

Linear programming in SAS by using the OPTMODEL procedure

The OPTMODEL procedure is part of SAS/OR software. It provides a natural programming language with which to define and solve all kinds of optimization problems. The linear programming example in this article is similar to the "Getting Started" example in the PROC OPTMODEL chapter about linear programming. The following statements use syntax that is remarkably similar to the mathematical formulation of the problem:

proc optmodel;
var x{i in 1..2} >= 0;           /* information about the variables */
max z = 3*x[1] +  5*x[2];        /* define the objective function */
con c1: 3*x[1] + -2*x[2] <= 10;  /* specify linear constraints */
con c2: 5*x[1] + 10*x[2] <= 56;
con c3: 4*x[1] +  2*x[2] >=  7;
solve;                           /* solve the LP problem */
print x;
Solution to linear programming problem with PROC OPTMODEL i nSAS/OR software

The OPTMODEL procedure prints two tables. The first (not shown) is a table that describes the algorithm that was used to solve the problem. The second table is the solution vector, which is x = {5.3, 2.95}.

For an introduction to using the OPTMODEL procedure to solve linear programming problems, see the 2011 paper by Rob Pratt and Ed Hughes.

Linear programming by using the LPSOLVE subroutine in SAS/IML

Not every SAS customer has a license for SAS/OR software, but hundreds of thousands of people have access to the SAS/IML matrix language. In addition to companies that license SAS/IML software, SAS/IML is part of the free SAS University Edition, which has been downloaded almost one million times by students, teachers, researchers, and self-learners.

Whereas the syntax in PROC OPTMODEL closely reflects the mathematical formulation, the SAS/IML language uses matrices and vectors to specify the problem. You then pass those matrices as arguments to the LPSOLVE subroutine. The subroutine returns the solution (and other information) in output arguments. The following list summarizes the information that you must provide:

  • Define the range of the variables: You can specify a vector for the lower bounds and/or upper bounds of the variables.
  • Define the objective function: Specify the vector of coefficients (c) such that c`*x is the linear objective function. Specify whether the goal is to minimize or maximize the objective function.
  • Specify the linear constraints: Specify the matrix of constraint coefficients (A), the right-hand side vector of values (b), and a vector (LEG) that indicates whether each constraint is an inequality or an equality. (The documentation calls the LEG vector the "rowsense" vector because it specifies the "sense" or direction of each row in the constraint matrix.)

The following SAS/IML program defines and solves the same LP problem as in the previous section. I've added plenty of comments so you can see how the elements in this program compare to the more compact representation in PROC OPTMODEL:

proc iml;
/* information about the variables */
LowerB = {0, 0};           /* lower bound constraints on x */
UpperB = {., .};           /* upper bound constraints on x */
/* define the objective function */
c = {3, 5};                /* vector for objective function c*x */
/* control vector for optimization */
ctrl = {-1,                /* maximize objective */
         1};               /* print some details */
/* specify linear constraints */
A = {3 -2,                 /* matrix of constraint coefficients */
     5 10,
     4  2};
b = {10,                   /* RHS of constraint eqns (column vector) */
LEG = {L, L, G};           /* specify symbols for constraints:
                              'L' for less than or equal
                              'E' for equal
                              'G' for greater than or equal */
/* solve the LP problem */
CALL LPSOLVE(rc, objVal, result, dual, reducost, /* output variables */
             c, A, b,      /* objective and linear constraints */
             ctrl,         /* control vector */
             LEG, /*range*/ , LowerB, UpperB); 
print rc objVal, result[rowname={x1 x2}];
Solution to linear programming problem        by using the LPSOLVE function in SAS/IML software

In the call to LPSOLVE, the first five arguments are output arguments. The first three of these are printed:

  • If the call finds the optimal solution, then the return code (rc, first parameter) is 0.
  • The value of the objective function at the optimal value is returned in the second parameter (objVal).
  • The solution to the LP problem (the optimal value of the variables) is returned in the third argument.

Although the LPSOLVE function was not as simple to use as PROC OPTMODEL, it obtains the same solution. The LPSOLVE subroutine supports many features that are not mentioned here. For details, see the documentation for LPSOLVE.

The LPSOLVE subroutine was introduced in SAS/IML 13.1, which was shipped with SAS 9.4m1. The LPSOLVE function replaces the older LP subroutine, which is deprecated. SAS 9.3 customers can call the LP subroutine, which works similarly.

tags: Getting Started, Optimization

The post Solve linear programming problems in SAS appeared first on The DO Loop.


Use SAS formats to bin numerical variables

SAS formats are flexible, dynamic, and have many uses. For example, you can use formats to count missing values and to change the order of a categorical variable in a table or plot. Did you know that you can also use SAS formats to bin a numerical variable into categories? This can be very convenient because you do not need to create a new variable in the data set; you merely apply a format to an existing variable.

Income categories: Are you middle class?

Many people use several IF-THEN/ELSE statements in the DATA step (or in a DATA step view) to create a new discrete variable that represents binned values of a continuous variable. That is a fine technique, but an alternative technique is to create a user-defined format that bins a continuous variable. One advantage of a custom format is that you can apply the format to multiple variables in multiple data sets.

For example, suppose that you want to define income categories such as "working class," "middle class," and the ultra-rich "1 percenters." According to a 2012 Money magazine article, the following cut points divide US household income into seven categories:

/* 2012 income categories for US according to Money magazine */
proc format;
value IncomeFmt  
      low   -<  23000 = "Poverty"        /* < 23 thousand         */
      23000 -<  32500 = "Working"        /* [ 23,  32.5) thousand */
      32500 -<  60000 = "Lower Middle"   /* [ 32.5, 60) thousand  */
      60000 -< 100000 = "Middle"         /* [ 60, 100) thousand   */
     100000 -< 150000 = "Upper Middle"   /* [100, 150) thousand   */
     150000 -< 250000 = "5 Percenter"    /* [150, 250) thousand   */
     250000 -   high  = "1 Percenter";   /* > 250 thousand        */

The call to PROC FORMAT creates a custom format called IncomeFmt. When you assign the IncomeFmt format to a numerical variable, SAS will look at the value of each observation and determine the formatted value from the raw value. For example, a value of 18,000 is less than 23,000, so that value is formatted as "Poverty." A value of 85,000 is in the half-open interval [60000, 100000), so that value is formatted as "Middle."

The following DATA step defines the incomes for 26 fictitious individuals. The IncomeFmt format is assigned to the Income variable:

data incomes;
length Name $10.;
input Name Income @@;
format Income IncomeFmt.;     /* assign IncomeFmt format to Income variable */
Amy        65100 Brad      146500 
Carlos    113300 Dimtri     28800 
Eduardo   233300 Felicity   14600 
Guo        43400 Hector    141700 
Irene      53400 Jacob     170300 
Katerina   43100 Liu        66800 
Michael    15800 Nancy      30900 
Oscar      31800 Pablo      65800 
Quentin    40000 Rick       62200 
Stephan    32900 Tracy      64000 
Umberto   124000 Victoria  220700 
Wanda     263800 Xie         9300 
Yolanda    23400 Zachary    93800 

The Income variable is a continuous variable, but the format bins each value into one of seven discrete values. Consequently, SAS procedures can analyze the Income variable as if it were a discrete variable. For example, you can count the number of individuals in each income category by calling PROC FREQ:

proc freq data=incomes; 
   tables Income / nocum norow nocol;

Assigning or unassigning formats at run time

The underlying data values are not lost. You can use a FORMAT statement in a SAS procedure to temporarily assign or unassign a format. If you remove the format, you can analyze the underlying raw data. For example, the following call to PROC UNIVARIATE analyzes the raw incomes:

proc univariate data=incomes;
   format Income;     /* remove the format for this analysis */
   var Income;

In a similar way, if you specify the Income variable on a CLASS statement in a regression procedures, the formatted values are used for the analysis. However, if you do NOT include it on the CLASS statement, then the variable is treated as a continuous variable and the unformatted values are used.

Subset data by using formatted values

If you run PROC PRINT on the income data, it LOOKS like the Income variable is a character variable. Furthermore, it is analyzed like a character variable when used in some SAS procedures such as PROC FREQ. Consequently, you might forget that the Income variable is actually numeric. However, if you treat Income as a character variable in the DATA set or a WHERE clause, then SAS will report an error. For example, the following WHERE clause is incorrect:

proc print data=incomes; 
where Income in ("5 Percenter", "1 Percenter"); /* WRONG: Income is numeric */
ERROR: WHERE clause operator requires compatible variables.

SAS reports an error because the WHERE clause cannot compare the raw (numeric) values of the Income variable with elements of a set that contains two strings. When you see an error message like this, use PROC CONTENTS to investigate the attributes of the variable:

ods select Position;
proc contents data=incomes order=varnum; run;

The output from PROC CONTENTS informs you that the Income variable is numeric and displays the name of the format that is attached to it.

If you know the cutoff values that are used for the format, you could create a valid WHERE clause that uses numeric values: where Income GE 150000. However, usually it makes more sense to create a valid WHERE clause by using the PUT statement to apply the format to the raw data and compare formatted values:

/* use formatted values to subset data */
proc print data=incomes; 
where put(Income, IncomeFmt.) in ("5 Percenter", "1 Percenter");

You can use other DATA step functions when constructing WHERE clauses. A typical example is when a variable is a SAS date. For example, the Sashelp.Air data set contains a variable named Date. You can use the following WHERE clause to analyze the subset of data that corresponds to the month of January in years prior to 1955:

proc print data=Sashelp.Air;
where month(date)=1 and year(date)<1955;  /* all January dates prior to 1955 */


As shown in this article, SAS formats are very useful and flexible:

  • You can use a custom format to bin a continuous variable into categories.
  • Within a SAS procedure, you can temporarily assign or unassign a format to change the way that the data are analyzed.
  • The WHERE clause looks at raw data values, so use the PUT function in a WHERE clause if you want to subset the data according to the formatted values.
tags: Getting Started, SAS Programming

The post Use SAS formats to bin numerical variables appeared first on The DO Loop.

Back to Top