Compute with combinations: Maximize a function over combinations of variables

About once a month I see a question on the SAS Support Communities that involves what I like to call "computations with combinations." A typical question asks how to find k values (from a set of p values) that maximize or minimize some function, such as "I have 5 variables, and for each observation I want to find the largest product among any 3 values."

These types of problems are specific examples of a single abstract problem, as follows:

  1. From a set of p values, generate all subsets that contain k < p elements. Call the subsets Y1, Y2, ..., Yt, where t equals "p choose k". (In SAS, use the COMB function to compute the number of combinations: t = comb(p,k).)
  2. For each subset, evaluate some function on the subset. Call the values z1, z2, ..., zt.
  3. Return some statistic of the zi. Often the statistics is a maximum or minimum, but it could also be a mean, variance, or percentile.

This is an "exhaustive" method that explicitly generates all subsets, so clearly this technique is impractical for large values of p. The examples that I've seen on discussion forums often use p ≤ 10 and small values of k (often 2, 3, or 4). For parameters in this range, an exhaustive solution is feasible.

This general problem includes "leave-one-out" or jackknife estimates as a special case (k = p – 1), so clearly this formulation is both general and powerful. This formulation also includes the knapsack problem in discrete optimization. In the knapsack problem, you have p items and a knapsack that can hold k items. You want to choose the items so that the knapsack holds as much value as possible. The knapsack problem maximizes the sum of the values whereas the general problem in this article can handle nonlinear functions of the values.

Example data

You can use the following DATA set to simulate integer data with a specified number of columns and rows. I use the relatively new "Integer" distribution to generate uniformly distributed integers in the range [-3, 9].

%let p = 5;      /* number of variables */
%let NObs = 6;   /* number of observations */
data have(drop=i j);
call streaminit(123);
array x[&p];
do i = 1 to &NObs;
   do j = 1 to dim(x);
      x[j] = rand("Integer", -3, 9); /* SAS 9.4M4 */
proc print data=have; run;
Sample data

Computing with combinations in SAS/IML

For p = 5 and k = 3, the problem is: "For each observation of the 5 variables, find the largest product among any 3 values." In the SAS/IML language, you can solve problems like this by using the ALLCOMB function to generate all combinations of size k from the index set {1,2,...,p}. These values are indices that you can use to reference each combination of values. You can evaluate your function on each combination and then compute the max, min, mean, etc. For example, the following SAS/IML statements generate all combinations of 3 values from the set {1, 2, 3, 4, 5}:

proc iml;
p = 5; k = 3;
c = allcomb(p, k);  /* combinations of p items taken k at a time */
print c;
All combinations of 5 elements taken 3 at a time

A cool feature of the SAS/IML language is that you can use these values as column subscripts! In particular, the expression X[i, c] generates all 3-fold combinations of values in the i_th row. You can then use the SHAPE function to reshape the values into a matrix that has 3 columns, as follows:

/* Example: find all combination of elements in the first row */
varNames = "x1":"x5";
use have; read all var varNames into X; close;
Y = X[1, c];                 /* all combinations of columns for 1st row */
M = shape(Y, nrow(c), k);    /* reshape so each row has k elements */
prod = M[, #];               /* product of elements across columns */
print M prod;
A function evaluated each subset of size 3

Notice that each row of the matrix M contains k = 3 elements of Y. There are "5 choose 3" = 10 possible ways to choose 3 items from a set of 5, so the M matrix has 10 rows. Notice that you can use a subscript reduction operator (#) to compute the product of elements for each combination of elements. The maximum three-value product for the first row of data is 24.

The following loop performs this computation for each observation. The result is a vector that contains the maximum three-value product of each row. The original data and the results are then displayed side by side:

/* for each row and for X1-X4, find maximum product of three elements */
result = j(nrow(X), 1);
do i = 1 to nrow(X);
   Y = X[i, c];                  /* get i_th row and all combinations of coloumns */
   M = shape(Y, nrow(c), k);     /* reshape so each row has k elements */
   result[i] = max( M[,#] );     /* max of product of rows */
print X[colname=varNames] result[L="maxMul"];
Maximum product of three elements from a set of 5

Of course, if the computation for each observation is more complicated than in this example, you can define a function that computes the result and then call the module like this: result[i]= MyFunc(M);

Generate all combinations in the DATA step

You can perform a similar computation in the DATA step, but it requires more loops. You can use the ALLCOMB function (or the LEXCOMBI function) to generate all k-fold combinations of the indices {1, 2, ..., p}. You should call the ALLCOMB function inside a loop from 1 to NCOMB(p, k). Inside the loop, you can evaluate the objective function on each combination of data values. Many DATA step functions such as MAX, MIN, SMALLEST, and LARGEST accept arrays of variables, so you probably want to store the variables and the indices in arrays. The following DATA step contains comments that describe each step of the program:

%let p = 5;
%let k = 3;
%let NChooseK = %sysfunc(comb(&p,&k));  /* N choose k */
data Want(keep=x1-x&p maxProd);
set have;
array x[&p] x1-x&p;        /* array of data */
array c[&k];               /* array of indices */
array r[&NChoosek];        /* array of results for each combination */
ncomb = comb(&p, &k);      /* number of combinations */
do i=1 to &k; c[i]=0; end; /* zero the array of indices before first call to ALLCOMB */
do j = 1 to ncomb;
   call allcombi(&p, &k, of c[*]);  /* generate j_th combination of indices */
   /* evaluate function of the array {x[c[1]], x[c[2]], ..., x[c[k]]} */
   r[j] = 1;               /* initialize product to 1 */
   do i=1 to &k; r[j] = r[j] * x[c[i]]; end;  /* product of j_th combination */
maxProd = max(of r[*]);    /* max of products */
proc print data=Want; run;
Maximum product of three elements from a set of five

The DATA step uses an array (R) of values to store the result of the function evaluated on each subset. For a MAX or MIN computation, this array is not necessary because you can keep track of the current MAX or MIN inside the loop over combinations. However, for more general problems (for example, find the median value), an array might be necessary.

In summary, this article shows how to solve a general class of problems. The general problem generates all subsets of size k from a set of size p. For each subset, you evaluate a function and produce a statistic. From among the "p choose k" statistics, you then choose the max, min, or some other measure. This article shows how to solve these problems efficiently in the SAS/IML language or in the SAS DATA step. Because this is a "brute force" technique, it is limited to small values of p. I suggest p ≤ 25.

The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.


Fit a distribution from quantiles

Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are "most likely" to have produced the observed data. However, how can you fit a distribution if you do not have access to the data?

This question was asked by a SAS programmer who wanted to fit a gamma distribution by using sample quantiles of the data. In particular, the p[rogrammer said, "we have the 50th and 90th percentile" of the data and "want to find the parameters for the gamma distribution [that fit] our data."

This is an interesting question. Recall that the method of moments uses sample moments (mean, variance, skewness,...) to estimate parameters in a distribution. When you use the method of moments, you express the moments of the distribution in terms of the parameters, set the distribution's moments equal to the sample moments, and solve for the parameter values for which the equation is true.

In a similar way, you can fit a distribution matching quantiles: Equate the sample and distributional quantiles and solve for the parameters of the distribution. This is sometimes called quantile-matching estimation (QME). Because the quantiles involve the cumulative distribution function (CDF), the equation does not usually have a closed-form solution and must be solved numerically.

Fit a two-parameter distribution from two quantiles

CDF that matches quantiles of a sample

To answer the programmer's question, suppose you do not have the original data, but you are told that the 50th percentile (median) of the data is x = 4 and the 90th percentile is x = 8. You suspect that the data are distributed according to a gamma distribution, which has a shape parameter (α) and a scale parameter (β). To use quantile-matching estimation, set F(4; α, β) = 0.5 and F(8; α, β) = 0.9, where F is the cumulative distribution of the Gamma(α, β) distribution. You can then solve for the values of (α, β) that satisfy the equations. You will get a CDF that matches the quantiles of the data, as shown to the right.

I have previously written about four ways to solve nonlinear equations in SAS. One way is to use PROC MODEL, as shown below:

data initial;
   alpha=1; beta=1;    /* initial guess for finding root */
   p1=0.5;  X1 = 4;    /* eqn for 1st quantile: F(X1; alpha, beta) = p1 */
   p2=0.9; X2 =  8;    /* eqn for 2nd quantile: F(X2; alpha, beta) = p2 */
proc model data=initial;
  eq.one = cdf("Gamma", X1, alpha, beta) - p1;   /* find root of eq1 */
  eq.two = cdf("Gamma", X2, alpha, beta) - p2;   /*    and eq2 */
  solve alpha beta / solveprint out=solved outpredict;
proc print data=solved noobs;
   var alpha beta;
Quantile-matching estimates: parameters estimated from observed quantiles

The output indicates that the parameters (α, β) = (2.96, 1.52) are the values for which the Gamma(α, β) quantiles match the sample quantiles. You can see this by graphing the CDF function and adding reference lines at the 50th and 90th percentiles, as shown at the beginning of this section. The following SAS code creates the graph:

/* Graph the CDF function to verify that the solution makes sense */
data Check;
set solved;            /* estimates of (alpha, beta) from solving eqns */
do x = 0 to 12 by 0.2;
   CDF = cdf("gamma", x, alpha, beta);
title "CDF of Gamma Distribution";
title2 "Showing 50th and 90th Percentiles";
proc sgplot data=Check;
   series x=x y=CDF / curvelabel;
   dropline y=0.5 X=4 / dropto=both;   /* first percentile */
   dropline y=0.9 X=8 / dropto=both;   /* second percentile */
   yaxis values=(0 to 1 by 0.1) label="Cumulative Probability";
   xaxis values=(0 to 12 by 2);

Least squares estimates for matching quantiles

The previous section is relevant when you have as many sample quantiles as parameters. If you have more sample quantiles than parameters, then the system is overconstrained and you probably want to compute a least squares solution. If there are m sample quantiles, the least squares solution is the set of parameters that minimizes the sum of squares Σim (piF(xi; α, β))2.

For example, the following DATA step contains five sample quantiles. The observation (p,q) = (0.1, 1.48) indicates that the 10th percentile is x=1.48. The second observation indicates that the 25th percentile is x=2.50. The last observation indicates that the 90th percentile is x=7.99. You can use PROC NLIN to find a least squares solution to the quantile-matching problem, as follows:

data SampleQntls;
input p q;  /* p is cumul probability; q is p_th sample quantile */
0.1  1.48 
0.25 2.50
0.5  4.25
0.75 6.00
0.9  7.99 
/* least squares fit of parameters */
proc nlin data=SampleQntls /* sometimes the NOHALVE option is useful */
   parms alpha 2 beta 2;
   bounds 0 < alpha beta;
   model p = cdf("Gamma", q, alpha, beta);
proc print data=PE noobs;
   var alpha beta;
Least squares estimates based on matching sample quantiles

The solution indicates the parameter values (α, β) = (2.72, 1.70) minimize the sum of squares between the observed and theoretical quantiles. The following graph shows the observed quantiles overlaid on the CDF of the fitted Gamma(α, β) distribution. Alternatively, you can graph the quantile-quantile plot of the observed and fitted quantiles.

CDF for fitted distribution where parameters are based on matching sample quantiles

Weighted least squares estimates for matching quantiles

For small samples, quantiles in the tail of a distribution have a large standard error, which means that the observed quantile might not be close to the theoretical quantile. One way to handle that uncertainty is to compute a weighted regression analysis where each sample quantile is weighted by the inverse of its variance. According to Stuart and Ord (Kendall’s Advanced Theory of Statistics, 1994, section 10.10), the standard error of the p_th sample quantile in a sample of size n is σ2 = p(1-p) / (n fp)2), where ξp is the p_th quantile of the distribution and f is the probability density function.

In PROC NLIN, you can perform weighted analysis by using the automatic variable _WEIGHT_. The following statements define the variance of the p_th sample quantile and define weights equal to the inverse variance. Notice the NOHALVE option, which can be useful for iteratively reweighted least squares problems. The option eliminates the requirement that the weighted sum of squares must decrease at every iteration.

/* weighted least squares fit where w[i] = 1/variance[i] */
proc nlin data=SampleQntls NOHALVE
   parms alpha 2 beta 2;
   bounds 0 < alpha beta;
   N = 80;                            /* sample size */
   xi = quantile("gamma", p, alpha, beta); /* quantile of distrib */
   f = pdf("Gamma", xi, alpha, beta); /* density at quantile */
   variance = p*(1-p) / (N * f**2);   /* variance of sample quantiles */
   _weight_ = 1 / variance;           /* weight for each observation */
   model p = cdf("Gamma", q, alpha, beta);
Parameter estimates where parameters are based on a weighted matching of observed quantiles

The parameter estimates for the weighted analysis are slightly different than for the unweighted analysis. The following graph shows the CDF for the weighted estimates, which does not pass as close to the 75th and 90th percentiles as does the CDF for the unweighted estimates. This is because the PDF of the gamma distribution is relatively small for those quantiles, which causes the regression to underweight those sample quantiles.

CDF for fitted distribution where parameters are based on a weighted matching of observed quantiles

In summary, this article shows how to use SAS to fit distribution parameters to observed quantiles by using quantile-matching estimation (QME). If the number of quantiles is the same as the number of parameters, you can numerically solve for the parameters for which the quantiles of the distribution equal the sample quantiles. If you have more quantiles than parameters, you can compute a least squares estimate of the parameters. Because quantile estimates in the tail of a distribution have larger uncertainty, you might want to underweight those quantiles. One way to do that is to run a weighted least squares regression where the weights are inversely proportional to the variance of the sample quantiles.

The post Fit a distribution from quantiles appeared first on The DO Loop.


A Monte Carlo algorithm to estimate a median

This article describes and implements a fast algorithm that estimates a median for very large samples. The traditional median estimate sorts a sample of size N and returns the middle value (when N is odd). The algorithm in this article uses Monte Carlo techniques to estimate the median much faster.

Of course, there's no such thing as a free lunch. The Monte Carlo estimate is an approximation. It is useful when a quick-and-dirty estimate is more useful than a more precise value that takes longer to compute. For example, if you want to find the median value of 10 billion credit card transactions, it might not matter whether the median $16.74 or $16.75. Either would be an adequate estimate for many business decisions.

Although the traditional sample median is commonly used, it is merely an estimate. All sample quantiles are estimates of a population quantity, and there are many ways to estimate quantiles in statistical software. To estimate these quantities faster, researchers have developed approximate methods such as the piecewise-parabolic algorithm in PROC MEANS or "probabilistic methods" such as the ones in this article.

Example data: A sample from a mixture distribution

Sample from mixture distribution showing sample median

Consider the following data set, which simulates 10 million observations from a mixture of an exponential and a normal distribution. You can compute the exact median for the distribution, which is 8.065. A histogram of the data and the population median are shown to the right.

Because the sample size is large, PROC MEANS takes a few seconds to compute the sample median by using the traditional algorithm which sorts the data (an O(N log N) operation) and then returns the middle value. The sample median is 8.065.

/* simulate from a mixture distribution. See
   https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html */
%let N = 1e7;
data Have;
call streaminit(12345);
do i = 1 to &N;
   d = rand("Bernoulli", 0.6);
   if d = 0 then
      x = rand("Exponential", 0.5);
      x = rand("Normal", 10, 2);
keep x;
/* quantile estimate */
proc means data=Have N Median;
   var x;

A naive resampling algorithm

The basic idea of resampling methods is to extract a random subset of the data and compute statistics on the subset. Although the inferential statistics (standard errors and confidence interval) on the subset are not valid for the original data, the point estimates are typically close, especially when the original data and the subsample are large.

The simplest form of a resampling algorithm is to generate a random subsample of the data and compute the median of the subsample. The following example uses PROC SURVEYSELECT to resample (with replacement) from the data. The subsample is one tenth as large as the original sample. The subsequent call to PROC MEANS computes the median of the subsample, which is 8.063.

proc surveyselect data=Have seed=1234567 NOPRINT 
     method=urs samprate=0.1;                 /* 10% sampling with replacement */
title "Estimate from 10% of the data";
proc means data=SubSample N Median;
   freq NumberHits;
   var x;

It takes less time to extract a 10% subset and compute the median than it takes to compute the median of the full data. This naive resampling is simple to implement and often works well, as in this case. The estimate on the resampled data is close to the true population median, which is 8.065. (However, the standard error of the traditional estimate is smaller.)

You might worry, however, that by discarding 90% of the data you might discard too much information. In fact, there is a 90% chance that we discarded the middle data value! Thus it is wise to construct the subsample more carefully. The next section constructs a subsample that excludes data from the tails of the distribution and keeps data from the middle of the distribution.

A probabilistic algorithm for the median

This section presents a more sophisticated approximate algorithm for the median. My presentation of this Monte Carlo algorithm is based on the lecture notes of Prof. H-K Hon and a Web page for computer scientists. Unfortunately, neither source credits the original author of this algorithm. If you know the original reference, please post it in the comments.

Begin with a set S that contains N observations. The following algorithm returns an approximate median with high probability:

  1. Choose a random sample (with replacement) of size k = N3/4 from the data. Call the sample R.
  2. Choose lower and upper "sentinels," L and U. The sentinels are the lower and upper quantiles of R that correspond to the ranks k/2 ± sqrt(N). These sentinels define a symmetric interval about the median of R.
  3. (Optional) Verify that the interval [L, U] contains the median value of the original data.
  4. Let C be the subset of the original observations that are within the interval [L, U]. This is a small subset.
  5. Return the median of C.

The idea is to use the random sample only to find an interval that probably contains the median. The width of the interval depends on the size of the data. If N=10,000, the algorithm uses the 40th and 60th percentiles of the random subset R. For N=1E8, it uses the 49th and 51th percentiles.

The following SAS/IML program implements the Monte Carlo estimate for the median:

proc iml;
/* Choose a set R of k = N##(3/4) elements in x, chosen  at random with replacement. 
   Choose quantiles of R to form an interval [L,U].
   Return the median of data that are within [L,U]. */
start MedianRand(x);
   N = nrow(x);
   k = ceil(N##0.75);    /* smaller sample size */
   R = T(sample(x, k));  /* bootstrap sample of size k (with replacement) */
   call sort(R);         /* Sort this subset of data */
   L = R[ floor(k/2 - sqrt(N)) ]; /* lower sentinel */
   U = R[  ceil(k/2 + sqrt(N)) ]; /* upper sentinel */
   UTail = (x > L);       /* indicator for large values */
   LTail = (x < U);       /* indicator for small values */
   if sum(UTail) > N/2 & sum(LTail) > N/2 then do;
      C = x[loc(UTail & LTail)]; /* extract central portion of data */
      return ( median(C) );      
   else do;
      print "Median not between sentinels!";
      return (.);
/* run the algorithm on example data */
use Have; read all var {"x"}; close;
call randseed(456);
t1 = time();
   approxMed = MedianRand(x);  /* compute approximate median; time it */
tApproxMed = time()-t1;
t0 = time();
   med = median(x);            /* compute traditional estimate; time it */
tMedian = time()-t0;
results = (approxMed || med) // (tApproxMed|| tMedian);
print results[colname={"Approx" "Traditional"}
              rowname={"Estimate" "Time"} F=Best6.];

The output indicates that the Monte Carlo algorithm gives an estimate that is close to the traditional estimate, but produces it three times faster. If you repeat this experiment for 1E8 observations, the Monte Carlo algorithm computes an estimate in 2.4 seconds versus 11.4 seconds for the traditional estimate, which is almost five times faster. As the data size grows, the speed advantage of the Monte Carlo algorithm increases. See Prof. Hon's notes for more details.

In summary, this article shows how to implement a probabilistic algorithm for estimating the median for large data. The Monte Carlo algorithm uses a bootstrap subsample to estimate a symmetric interval that probably contains the true median, then uses the interval to strategically choose and analyze only part of the original data. For large data, this Monte Carlo estimate is much faster than the traditional estimate.

The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.


Compute the quantiles of any distribution

Your statistical software probably provides a function that computes quantiles of common probability distributions such as the normal, exponential, and beta distributions. Because there are infinitely many probability distributions, you might encounter a distribution for which a built-in quantile function is not implemented. No problem! This article shows how to numerically compute the quantiles of any probability distribution from the definition of the cumulative distribution (CDF).

In SAS, the QUANTILE function computes the quantiles for about 25 distributions. This article shows how you can use numerical root-finding methods (and possibly numerical integration) in SAS/IML software to compute the quantile function for ANY continuous distribution. I have previously written about related topics and particular examples, such as the following:

The quantile is the root of an integral equation

Quantiles are the solutions to the equation CDF(x)-p=0, where p is a probability

Computing a quantile would make a good final exam question for an undergraduate class in numerical analysis. Although some distributions have an explicit CDF, many distributions are defined only by a probability density function (the PDF, f(x)) and numerical integration must be used to compute the cumulative distribution (the CDF, F(x)). A canonical example is the normal distribution. I've previously shown how to use numerical integration to compute a CDF from a PDF by using the definition F(x) = ∫ f(t) dt, where the lower limit of the integral is –∞ and the upper limit is x.

Whether the CDF is defined analytically or through numerical integration, the quantile for p is found implicitly as the solution to the equation F(x) = p, where p is a probability in the interval (0,1). This is illustrated by the graph at the right.

Equivalently, you can define G(x; p) = F(x) – p so that the quantile is the root of the equation G(x; p) = 0. For well-behaved densities that occur in practice, a numerical root is easily found because the CDF is monotonically increasing. (If you like pathological functions, see the Cantor staircase distribution.)

Example: Create a custom distribution

SAS/IML software provides the QUAD subroutine, which provides numerical integration, and the FROOT function, which solves for roots. Thus SAS/IML is an ideal computational environment for computing quantiles for custom distributions.

As an example, consider a distribution that is a mixture of an exponential and a normal distribution:
F(x) = 0.4 Fexp(x; 0.5) + 0.6 Φ(x; 10, 2),
where Fexp(x; 0.5) is the exponential distribution with scale parameter 0.5 and Φ(x; 10, 2) is the normal CDF with mean 20 and standard deviation 2. In this case, you do not need to use numerical integration to compute the CDF. You can compute the CDF as a linear combination of the exponential and normal CDFs, as shown in the following SAS/IML function:

/* program to numerically find quantiles for a custom distribution */
proc iml;
/* Define the cumulative distribution function here. */
start CustomCDF(x); 
   F = 0.4*cdf("Exponential", x, 0.5) +
       0.6*cdf("Normal", x, 10, 2);
   return F;

The previous section shows the graph of the CDF on the interval [0, 16]. The vertical and horizontal lines correspond to the first, second and third quartiles of the distribution. The quartiles are close to the values Q1 ≈ 0.5, Q2 ≈ 8, and Q3 ≈ 10.5. The next section shows how to compute the quantiles.

Compute quantiles for an arbitrary distribution

As long as you can define a function that evaluates the CDF, you can find quantiles. For unbounded distributions, it is usually helpful to plot the CDF so that you can visually estimate an interval that contains the quantile. (For bounded distributions, the support of the distribution contains all quantiles.) For the mixture distribution in the previous section, it is clear that the quantiles are in the interval [0, 16].

The following program finds arbitrary quantiles for whichever CDF is evaluated by the CustomCDF function. To find quantiles for a different function, you can modify the CustomCDF and change the interval on which to find the quantiles. You do not need to modify the RootFunc or CustomQuantile functions.

/* Express CDF(x)=p as the root for the function CDF(x)-p. */
start RootFunc(x) global(gProb);
   return CustomCDF(x) - gProb;  /* quantile for p is root of CDF(x)-p */
/* You need to provide an interval on which to search for the quantiles. */
start CustomQuantile(p, Interval) global(gProb);
   q = j(nrow(p), ncol(p), .);              /* allocate result matrix */
   do i = 1 to nrow(p)*ncol(p);             /* for each element of p... */
      gProb = p[i];                         /*    set global variable   */
      q[i] = froot("RootFunc", Interval);   /*    find root (quantile)  */ 
   return q;
/* Example: look for quartiles in interval [0, 16] */
probs = {0.25 0.5 0.75};         /* Q1, Q2, Q3 */
intvl = {0 16};                  /* interval on which to search for quantiles */
quartiles = CustomQuantile(probs, intvl);
print quartiles[colname={Q1 Q2 Q3}];
Quartiles for mixture of exponential and normal distributions

In summary, you can compute an arbitrary quantile of an arbitrary continuous distribution if you can (1) evaluate the CDF at any point and (2) numerically solve for the root of the equation CDF(x)-p for a probability value, p. Because the support of the distribution is arbitrary, the implementation requires that you provide an interval [a,b] that contains the quantile.

The computation should be robust and accurate for non-pathological distributions provided that the density is not tiny or zero at the value of the quantile. Although this example is illustrated in SAS, the same method will work in other software.

The post Compute the quantiles of any distribution appeared first on The DO Loop.


Use lists to pass parameters to SAS/IML functions

A popular way to use lists in the SAS/IML language is to pack together several related matrices into a single data structure that can be passed to a function. Imagine that you have written an algorithm that requires a dozen different parameters. Historically, you would have to pass those parameters to the function by explicitly listing each argument. If you use lists (introduced in SAS/IML 14.2, which is SAS 9.4M4), you can pack all parameters into one list and pass that list into the function module where the individual items can be extracted as needed.

Example: An algorithm that requires multiple parameters

To illustrate this list-passing technique, consider the algorithm for Newton's method, which is an iterative method for finding the roots of a nonlinear systems of equations. An implementation of Newton's method is included the SAS/IML documentation. Newton's method requires several arguments:

  • The name of a module that evaluates the function whose roots are sought.
  • The name of a module that evaluates the Jacobian matrix (the derivative of the function).
  • An initial guess for the solution.
  • The maximum number of iterations before the algorithm decides that the method is not converging.
  • A convergence criterion that determines the accuracy of the solution.

The example in the documentation (which was written in the 1980s) uses global variables and hard-codes the names of functions and the convergence criteria. The example is meant to demonstrate the basic ideas, but is not reusable. You can make the example more flexible and robust by passing parameters to the module as shown in the next section.

Create a list of parameters

Suppose that you define the modules F and DF to evaluate a multivariate function and the matrix of partial derivatives (the Jacobian matrix), respectively. The following modules evaluate the same mathematical function and derivatives that are used in the SAS/IML documentation:

proc iml;
/* Newton's method to solve for roots of the system of equations
  F(x1,x2) = [ x1+x2-x1*x2+2 ] = [ 0 ]
             [ x1*exp(-x2)-1 ]   [ 0 ]
/* 1. define a function F that evaluates a multivariate function */
start F(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   f = (x1+x2-x1*x2+2) // (x1*exp(-x2)-1); /* evaluate the function */
   return ( f );
finish F;
/* 2. define a function DF that evaluates the Jacobian of F */
start DF(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   J = {. ., . .};
   J[1,1] = 1-x2;       J[1,2] = 1-x1;
   J[2,1] = exp(-x2);   J[2,2] =  -x1*exp(-x2);
   return ( J );
finish DF;

As shown in a previous article, you can create a named list that contains the parameters for Newton's algorithm. In SAS/IML 14.3, you can create the list by specifying a sequence of name-value pairs:

/* 3. list of NAME  = VALUE pairs  (syntax uses SAS/IML 14.3) */
args = [#"Func"     = "F",         /* name of module that evaluates the function */
        #"Deriv"    = "DF",        /* name of module that evaluates the deivative */ 
        #"x0"       = {0.1, -2.0}, /* initial guess for solution */
        #"MaxIters" = 100,         /* maximum iterations */
        #"ConvCrit" = 1e-6 ];      /* convergence criterion */

In SAS/IML 14.3, you can use the list item operator ($) to specify an item in a list. You can also get the name of the function and use CALL EXECUTE to evaluate the function at the initial guess, as follows:

x = args$"x0";                           /* initial guess */ 
EvalFunc = "y=" + args$"Func" + "(x);";  /* string for "y=F(x);" */
call execute(EvalFunc);                  /* evaluates function at x */
print EvalFunc, y;

This is a useful trick for evaluating a function by name. The next section uses this trick inside a module that implements Newton's method.

Pass a list of parameters to and from a SAS/IML module

You can now implement Newton's method by defining a SAS/IML module that takes one argument, which is a list. Inside the module, the parameters are extracted from the list and used to evaluate the functions F and DF and to control the iteration and convergence of Newton's method, as follows:

/* 4. Newton's method for solving the roots for F(x)=0. The argument to the module is a list. */
start Newton( L );
   x = L$"x0";                              /* initial guess */
   EvalFunc = "y=" + L$"Func" + "(x);";     /* string for "y=F(x);" */
   EvalDeriv = "D=" + L$"Deriv" + "(x);";   /* string for "D=DF(x);" */
   call execute(EvalFunc);          /* evaluate F at starting values */
   converged = (max(abs(y)) < L$"ConvCrit");              /* 0 or 1 */
   do iter = 1 to L$"Maxiters"      /* iterate until max iterations */
             while( ^converged );   /*      or until convergence */
      call execute(EvalDeriv);      /* evaluate derivatives D=DF(x) */
      delta = -solve(D,y);          /* solve for correction vector */
      x = x+delta;                  /* update x with new approximation */
      call execute(EvalFunc);       /* evaluate the function y=F(x)*/
      converged = (max(abs(y)) < L$"ConvCrit");
   /* you can return x or  a list that summarizes the results */
   result = [#"x"          = x,           /* approximate root (if converged) */
             #"iterations" = iter,        /* total number of iterations */
             #"status"      = converged]; /* 1 if root found; 0 otherwise */
   return ( result );
finish Newton;
/* 5. call Newton's method and pass in a list of parameters */
soln = Newton( args );              
xSoln = soln$"x";                   /* get root */
if soln$"status" then               /* check convergence statust */
   print "Newton's method converged in " (soln$"iterations") " iterations", xSoln;
   print "Newton's method did not converge in " (soln$"iterations") " iterations";

The Newton module returns a named list that has three items. The item named "status" is a binary value that indicates whether the method converged. If the method converged, the item named "x" contains the approximate root. The item named "iterations" contains the number of iterations of the algorithm.

In summary, you can use lists to create a structure that contains many parameters. You can pass the list to a SAS/IML module, which can extract and use the parameters. This enables you to simplify the calling syntax for user-defined modules that require many parameters.

For additional examples of using lists to pass heterogeneous data to SAS/IML modules, see "More Than Matrices: SAS/IML Software Supports New Data Structures" (Wicklin, 2017, pp. 11–12) and the SAS/IML documentation for lists.

The post Use lists to pass parameters to SAS/IML functions appeared first on The DO Loop.


Create lists by using a natural syntax in SAS/IML

SAS/IML 14.3 (SAS 9.4M5) introduced a new syntax for creating lists and for assigning and extracting item in a list. Lists (introduced in SAS/IML 14.2) are data structures that are convenient for holding heterogeneous data. A single list can hold character matrices, numeric matrices, scalar values, and other lists, as discussed in a previous article about how to use lists in SAS/IML.

The list creation operator

You can use square brackets to create a list. The elements of the list are separated by commas. For example, the following syntax creates a list that contains three elements. The list represents a hypothetical patient in a cholesterol-lowering study.

proc iml;
/* 1. New list creation syntax (L = [val1, val2,...]) */
/*                 Cholesterol at
       Name  Age   0mo  3mo  6mo  12mo */
P1 = ["Bob",  36, {182, 170, 170, 162}]; /* Patient 1 */

The name of the list is P1. The first element of the list is a string, the second is a scalar number, and the third is a numeric vector. In this case, the elements are specified by using literal values, but you can also use previously defined variables. The following statement is equivalent but defines the list by using existing variables:

Name = "Bob"; Age = 36; Chol = {182, 170, 170, 162};
P1 = [Name, Age, Chol];   /* define list from copies of existing variables */

As mentioned earlier, lists can contain other lists. For example, if you have multiple patients in a study, you can create a list of each patient's data, then create a list that contains all the patients' data, as follows:

P2 = ["Fred", 52, {175, 165, 155}     ]; /* Patient 2 */
P3 = ["Tom",  45, {160, 145,   ., 139}]; /* Patient 3 */
Patients = [P1, P2, P3];                 /* a list of patients */

Assign and extract list items

You can use the list item operator ($) to specify an item in a list. For each patient, the age is stored as the second item in a list. If you want the age of Bob (the first patient), you can use either of the following statements:

/* 2. New list item syntax ($) */
BobAge = P1$2;         /* get 2nd item from P1 list */
BobAge = Patients$1$2; /* get 1st item of Patients, then 2nd item from that list */

The first statement is straightforward: P1$2 means "get the second item from the P1 list." The second statement is parsed left to right. The syntax Patient$1 means "get the first item from the Patients list," which is equivalent to P1. Thus Patient$1$2 gets Bob's age.

The preceding example uses literal values to specify the position of an item in a list, but you can also use a variable. This makes it possible to extract items in a loop. For example, the following statements loop over all patients, extract the name and cholesterol values of the patient, and compute each patient's average cholesterol value during the study:

N = ListLen(Patients);      /* number of patients */
Name = j(N,1,"     ");      /* allocate vector for names */
AvgChol = j(N,1,.);         /* allocate vector for results */
do i = 1 to N;
   Name[i] = Patients$i$1;  /* get name of i_th patient */
   Chol = Patients$i$3;     /* get sequence of cholesterol values */
   AvgChol[i] = mean(Chol); /* average value */
print AvgChol[rowname=Name];

You can use the list item operator on the left side of an assignment operator to change an item in a list. For example, suppose you discover that Bob's age was typed wrong: Bob is really 63, not 36. You can update Bob's data as follows:

P1$2 = 63;         /* update 2nd item in P1 list */
Patients$1 = P1;   /* update entire patient list */

Alternatively, you could use the syntax Patients$1$2 = 63 to update the data in place.

Extract sublists

To subset a list, use square brackets ([]) and specify the indices of the items. For example, the following statement extracts the second and third patients into a new list:

/* 3. Extract sublist SL = L[ {i1, i2,...} ] */
SubList = Patients[ {2 3} ]; /* Fred and Tom */

The sublist operator ([]) ALWAYS returns a list, even if you specify only one item! Thus Patients[1] is a LIST that contains one item. If you want the item itself, use Patients$1.

Named lists

In the previous example, the items in the list P1 are a patient's name, age, and cholesterol readings. If you want to extract Bob's age, you can write P1$2, but someone unfamiliar with the order of the list items would have no idea what that item represents. Thus it is helpful to define a named list, which is also called an associative array. When you specify a named list, you specify the items by using name-value pairs, as follows:

P = [#"Name" = "Bob",                        /* P$1 or P$"Name" */
     #"Age"  = 63,                           /* P$2 or P$"Age" */
     #"Cholesterol" = {182, 170, 170, 162}]; /* P$3 or P$"Cholesterol" */

You can use the names to refer to the items in a list. For example, the following statements extract the patient's name and cholesterol readings by using the list item operator:

Name = P$"Name";
Chol = P$"Cholesterol";       /* get Bob's measurements */
print Chol[Label=Name];

You can also use names in the sublist operator to extract a sublist:

L = P[ {"Age" "Cholesterol"} ];  /* sublist that contains two items */


In summary, SAS/IML 14.3 contains new syntax for creating a list and for extracting items and sublists. This syntax makes it easier to use lists and to read and write SAS/IML programs that use lists.

The post Create lists by using a natural syntax in SAS/IML appeared first on The DO Loop.


10 posts from 2017 that deserve a second look

Last week I wrote about the 10 most popular articles from The DO Loop in 2017. My most popular articles tend to be about elementary statistics or SAS programming tips. Less popular are the articles about advanced statistical and programming techniques. However, these technical articles fill an important niche. Not everyone needs to know how to interpret a diffogram that visually compares the differences between means between groups, but those who do often send me a note of thanks, which motivates me to research and write similar articles.

Statistical programmers might want to review the following technical articles from 2017. This ain't summertime, and the reading ain't easy, but I think these articles are worth the effort. I've broken them into three categories. Enjoy!

Statistical Concepts

Visualization of regression that uses a weight variable in SAS

Statistical Data Analysis and Visualization

Visualize multivariate regression model by slicing the continuous variables. Graph created by using the EFFECTPLOT SLICEFIT statement in SAS.

Advanced SAS Programming Techniques

If you are searching for a way to enhance your SAS knowledge in this New Year, I think these 10 articles from The DO Loop are worth a second look. Was there an article from 2017 that you found particularly useful? Leave a comment.

The post 10 posts from 2017 that deserve a second look appeared first on The DO Loop.


The method of moments: A smart way to choose initial parameters for MLE

When you run an optimization, it is often not clear how to provide the optimization algorithm with an initial guess for the parameters. A good guess converges quickly to the optimal solution whereas a bad guess might diverge or require many iterations to converge. Many people use a default value (such as 1) for all initial guesses. I have previously written about other ways to obtain an initial guess for an optimization.

In the special case of maximum likelihood estimation, you use a trick to provide a good guess to the optimization algorithm. For maximum likelihood estimation, the objective function is the log-likelihood function of a distribution. Consequently, you can use the method of moments to provide the initial guess for the parameters, which often results in fast convergence to the maximum likelihood estimates. The faster convergence can save you considerable time when you are analyzing thousands of data sets, such as during a simulation study.

The method of moments

Suppose you believe that some data can be modeled by a certain distribution such as a lognormal, Weibull, or beta distribution. A common way to estimate the distribution parameters is to compute the maximum likelihood estimates (MLEs). These are the values of the parameters that are "most likely" to have generated the observed data. To compute the MLEs you first construct the log-likelihood function and then use numerical optimization to find the parameter values that maximize the log-likelihood function.

The method of moments is an alternative way to fit a model to data. For a k-parameter distribution, you write the equations that give the first k central moments (mean, variance, skewness, ...) of the distribution in terms of the parameters. You then replace the distribution's moments with the sample mean, variance, and so forth. You invert the equations to solve for the parameters in terms of the observed moments.

For example, suppose you believe that some data are distributed according to a two-parameter beta distribution. If you look up the beta distribution on Wikipedia, you will see that the mean and variance of a Beta(a, b) distribution can be expressed in terms of the two parameters as follows:

Mean and variance of the beta distribution

The first equation represents the expected (mean) value of a beta-distributed random variable X. The second equation represents the variance. If the sample mean of the data is m and the sample variance is v, you can "plug in" these values for the left-hand side of the equations and solve for the values of a and b that satisfy the equations. Solving the first equation for a yields a = b m / (1 - m). If you substitute that expression into the second equation and solve for b, you get b = m - 1 + (m/v)(1 - m)2. Those equations give the parameter estimates from the method of moments.

Maximum likelihood estimation: Using an arbitrary guess

The method of moments can lead to improved convergence when fitting a distribution. For comparison, first consider what happens if you use an arbitrary initial guess for the optimization, such as a = b = 1. The following statements use PROC NLMIXED to compute maximum likelihood estimates for the parameters of the beta distribution. (You can read about how to use PROC NLMIXED to construct MLEs. You can also compute MLEs by using SAS/IML.)

data Beta;
input x @@;
0.58 0.23 0.41 0.69 0.64 0.12 0.07 0.19 
0.47 0.05 0.76 0.18 0.28 0.06 0.34 0.52 
0.42 0.48 0.11 0.39 0.24 0.25 0.22 0.05 
0.23 0.49 0.45 0.29 0.18 0.33 0.09 0.16 
0.45 0.27 0.31 0.41 0.30 0.19 0.27 0.57 
/* Compute parameter estimates for the beta distribution */
proc nlmixed data=Beta;
   parms a=1 b=1;                /* use arbitrary initial guesses */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
Solution to fitting a beta distribution in SAS

The MLEs are a = 1.8487 and b = 3.9552. The "Iteration History" table (not shown) shows that 8 iterations of a quasi-Newton algorithm were required to converge to the MLEs, and the objective function (the LOGPDF function) was called 27 times during the optimization.

The following graph shows a histogram of the data overlaid with a Beta(a=1.85, b=3.96) density curve. The curve fits the data well.

Histogram of data overlaid with a beta density curve, fitted by maximum likelihood estimation

Maximum likelihood estimation: Using method-of-moments for guesses

If you use the method-of-moments to provide an initial guess on the PARMS statement, the quasi-Newton algorithm might converge in fewer iterations. The following statements use PROC MEANS to compute the sample mean and variance, the use the DATA step to compute the method-of-moments estimates, as follows:

/* output mean and variance */
proc means data=Beta noprint;
   var x;
   output out=Moments mean=m var=v;  /* save sample moments */
/* use method of moments to estimate parameters for the beta distribution */
data ParamEst;
   retain a b;               /* set order of output variables */
   set Moments;
   b = m*(1-m)**2/v + m - 1; /* estimate parameters by using sample moments */
   a = b*m / (1-m);
   keep a b;

The method of moments estimates are a = 1.75 and b = 3.75, which are close to the MLE values. The DATA= option in the PARMS statement in PROC NLMIXED enables you to specify a data set that contains starting values for the parameters. The following call to PROC NLMIXED is exactly the same as the previous call except that the procedure uses the initial parameter values from the ParamEst data set:

proc nlmixed data=Beta;
   parms / data=ParamEst;        /* use initial guesses from method of moments */
   bounds a b > 0;
   LL = logpdf("beta", x, a, b); /* log-likelihood function */
   model x ~ general(LL);
Iteration history for MLE for beta distribution. Initial guesses are maximum likelihood estimates.

The MLEs from this call (not shown) are exactly the same as before. However, the "Iteration History" table indicates that quasi-Newton algorithm converged in 5 iterations and required only 14 calls the objective function. This compares with 8 iterations and 27 calls for the previous computation. Thus the method of moments resulted in about half as many functional evaluations. In my experience this example is typical: Choosing initial parameters by using the method of moments often reduces the computational time that is required to obtain the MLEs.

It is worth mentioning that the PARMS option in PROC NLMIXED supports two nice features:

  • The DATA= option accepts both wide and long data sets. The example here is wide: there are k variables that supply the values of the k parameters. You can also specify a two-variable data set with the names "Parameter" and "Estimate."
  • The BYDATA option enables you to specify initial guesses for each BY group. In simulation studies, you might use BY-group processing to compute the MLEs for thousands of samples. The BYDATA option is essential to improving the efficiency of a large simulation study. It can also help ensure that the optimization algorithm converges when the parameters for the samples range over a large set of values.

Summary and cautions

In summary, this article shows how to use the method of moments to provide a guess for optimizing a log-likelihood function, which often speeds up the time required to obtain the MLEs. The method requires that you can write the equations that relate the central moments of a distribution (mean, variance, ...) to the parameters for the distribution. (For a distribution that has k parameters, use the equations for the first k moments.) You then "plug in" the sample moments and algebraically invert the equations to obtain the parameter values in terms of the sample moments. This is not always possible analytically; you might need to use numerical methods for part of the computation.

A word of caution. For small samples, the method of moments might produce estimates that do not make sense. For example, the method might produce a negative estimate for a parameter that must be positive. In these cases, the method might be useless, although you could attempt to project the invalid parameters into the feasible region.

The post The method of moments: A smart way to choose initial parameters for MLE appeared first on The DO Loop.


Compute the CDF and quantiles of discrete distributions

A statistical programmer read my article about the beta-binomial distribution and wanted to know how to compute the cumulative distribution (CDF) and the quantile function for this distribution. In general, if you know the PDF for a discrete distribution, you can also compute the CDF and quantile functions. This article shows how.

Continuous versus discrete distributions

Recall the four essential functions for any probability distribution: the probability density function (PDF), the cumulative distribution function (CDF), the quantile function, and the random-variate generator. SAS software provides the PDF, CDF, QUANTILE, and RAND function, which support for about 25 common families of distributions. However, there are infinitely many probability distributions. What can you do if you need a distribution that is not natively supported in SAS? The answer is that you can use tools in SAS (especially in SAS/IML software) to implement these functions yourself.

I have previously shown how to use numerical integration and root-finding algorithms to compute the CDF and quantile function for continuous distributions such as the folded normal distribution and the generalized Gaussian distribution. For discrete distributions, you can use a summation to obtain the CDF from the PDF. The CDF at X=x is the sum of the PDF values for all values of X that are less than or equal to x. The discrete CDF is a step function, so it does not have an inverse function. Given a probability, p, the quantile for p is defined as the smallest value of x for which CDF(x) ≥ p.

To simplify the discussion, assume that the random variable X takes values on the integers 0, 1, 2, .... Many discrete probability distributions in statistics satisfy this condition, including the binomial, Poisson, geometric, beta-binomial, and more.

Compute a CDF from a PDF

I will use SAS/IML software to implement the CDF and quantile functions. The following SAS/IML modules define the PDF function and the CDF function. If you pass in a scalar value for x, the CDF function computes the PDF from X=0 to X=x and sums up the individual probabilities. When x is a vector, the implementation is slightly more complicated, but the idea is the same.

proc iml;
/* PMF function for the beta-binomial distribution. See
   https://blogs.sas.com/content/iml/2017/11/20/simulate-beta-binomial-sas.html */
start PDFBetaBinom(x, nTrials, a, b);
   logPMF = lcomb(nTrials, x) + logbeta(x + a, nTrials - x + b) - logbeta(a, b);
   return ( exp(logPMF) );
/* CDF function for the beta-binomial distribution */
start CDFBetaBinom(x, nTrials, a, b);
   y = 0:max(x);                               /* all values of X less than max(x) */
   PMF = PDFBetaBinom(y, NTrials, a, b);       /* compute PMF(t) for t=0..max(x) */
   cumPDF = cusum(PMF);                        /* cdf(T) = sum( PMF(t), t=0..T ) */
   /* The input x can be a vector of values in any order.
      Use the LOC function to assign CDF(x) for each value of x */
   CDF = j(nrow(x), ncol(x), .);               /* allocate result (same size as x) */
   u = unique(x);                              
   do i = 1 to ncol(u);                        /* for each unique value of x... */
      idx = loc(x=u[i]);                       /*    find the indices */
      CDF[idx] = cumPDF[1+u[i]];               /*    assign the CDF   */
   return CDF;
/* test the CDF function */
a = 6; b = 4; nTrials = 10;
c = CDFBetaBinom(t, nTrials, a, b);

Notice the computational trick that is used to compute the CDF for a vector of values. If you want to compute the CDF at two or more values (say X=5 and X=7), you can compute the PDF for X=0, 1, ... M where M is the maximum value that you need (such as 7). Along the way, you end up computing the CDF for all smaller values.

You can write the CDF values to a SAS data set and graph them to visualize the CDF of the beta-binomial distribution, as shown below:

Beta-binomial cumulative distribution

Compute quantiles from the CDF

Given a probability value p, the quantile for p is smallest value of x for which CDF(x) ≥ p. Sometimes you can find a formula that enables you to find the quantile directly, but, in general, the definition of a quantile requires that you compute the CDF for X=0, 1, 2,... and so forth until the CDF equals or exceeds p.

The support of the beta-binomial distribution is the set {0, 1, ..., nTrials}. The following function first computes the CDF function for all values. It then looks at each value of p and returns the corresponding value of X that satisfies the quantile definition.

start QuantileBetaBinom(_p, nTrials, a, b);
   p = colvec(_p);                         /* ensure input is column vector */
   y = 0:nTrials;                          /* all possible values of X   */
   CDF = CDFBetaBinom(y, NTrials, a, b);   /* all possible values of CDF */
   q = j(nrow(p), 1, .);                   /* allocate result vector */
   do i = 1 to nrow(p);
      idx = loc( p[i] <= CDF );            /* find all x such that p <= CDF(x) */ 
      q[i] = idx[1] - 1;                   /* smallest index. Subtract 1 from index to get x */
   return ( q );
p = {0.2, 0.5, 0.8};
q = QuantileBetaBinom(p, nTrials, a, b);
print p q;

The output shows the quantile values for p = {0.2, 0.5, 0.8}. You can find the quantiles visually by using the graph of the CDF function, shown in the previous section. For each value of p on the vertical axis, draw a horizontal line until the line hits a bar. The x value of the bar is the quantile for p.

Notice that the QuantileBetaBinom function uses the fact that the beta-binomial function has finite support. The function computes the CDF at the values 0, 1, ..., nTrials. From those values you can "look up" the quantile for any p. If you have distribution with infinite support (such as the geometric distribution) or very large finite support (such as nTrials=1E9), it is more efficient to apply an iterative process such as bisection to find the quantiles.


In summary, you can compute the CDF and quantile functions for a discrete distribution directly from the PDF. The process was illustrated by using the beta-binomial distribution. The CDF at X=x is the sum of the PDF evaluated for all values less than x. The quantile for p is the smallest value of x for which CDF(x) ≥ p.

The post Compute the CDF and quantiles of discrete distributions appeared first on The DO Loop.


Catch run-time errors in SAS/IML programs

Did you know that a SAS/IML function can recover from a run-time error? You can specify how to handle run-time errors by using a programming technique that is similar to the modern "try-catch" technique, although the SAS/IML technique is an older implementation.

Preventing errors versus handling errors

In general, SAS/IML programmers should try to detect potential problems and prevent errors from occurring. For example, before you compute the square root of a number, you should test whether the number is greater than or equal to zero. If not, you can handle the bad input. By testing the input value before you call the SQRT function, you prevent a run-time error.

However, sometimes you don't know whether a computation will fail until you attempt it. Other times, the test to determine whether an error will occur is very expensive, and it is cheaper to attempt the computation and handle the error if it occurs.

A canonical example is computing the inverse of a square matrix. It is difficult to test whether a given matrix is numerically singular. You can compute the rank of the matrix or the condition number of the matrix, but both computations are expensive because they rely on the SVD or eigenvalue decomposition. Cheaper methods include computing a determinant or using Gaussian elimination, but these methods can be numerically unstable and are not good indicators of whether a matrix is numerically singular.

Run-time error in SAS/IML modules

To understand how to handle run-time errors in SAS/IML modules, recall the following facts about how to use the execution queue in SAS/IML modules.

In summary, when a run-time error occurs in a module, the module pauses and waits for additional commands. However, if there are statements in the execution queue, those statements will be executed.

You can use these facts to handle errors. At the top of the module, use the PUSH statement to add error-handling statements into the execution queue. If an error occurs, the module pauses, sees that there are statements in the queue, and executes those statements. If one of the statements contains a RESUME statement, the module resumes execution.

Push, pause, execute, and resume

Let's apply these ideas to computing the inverse of a square matrix. The following SAS/IML function attempts to compute the inverse of the input matrix. If the matrix is singular, the computation fails and the module handles that error. The main steps of the modules are as follows:

  1. Define the statements that will be executed in the event of an error. The last executable statement should be the RESUME statement. (Technical point: The RESUME statement is only valid when the module is paused. Surround the statements with IF-THEN logic to prevent the error-handling statements from running when the module exits normally.)
  2. Push these statements into the execution queue.
  3. If an error occurs, the statements in the queue will be executed. The RESUME statement clears the error and resumes execution of the module. (If no error occurs, the queue is flushed after the RETURN statement.)
proc iml;
/* If the matrix is not invertible, return a missing value. 
   Otherwise, return the inverse.
start InvEx(A); 
   errFlag = 1;           /* set flag. Assume we will get an error :-) */   
   on_error = "if errFlag then do;  AInv = .;  resume; end;";
   call push(on_error);   /* PUSH code that will be executed if an error occurs */
   AInv = inv(A);         /* if error, AInv set to missing and function resumes */
   errFlag = 0;           /* remove flag for normal exit from module */
   return ( AInv );
A1 = {1 1,
      0 1};
B1 = InvEx(A1);
A2 = {1 1,
      1 1};
B2 = InvEx(A2);
print B1, B2;

The function is called first with a nonsingular matrix, A1. The INV function does not fail, so the module exits normally. When the module exits, it flushes the execution queue. Because ErrFlag=0, the pushed statements have no effect. B1 is equal to the inverse of A1.

The function is called next with a singular matrix, A2. The INV function encounters an error and the module pauses. When the module pauses, it executes the statements in the queue. Because ErrFlag=1, the statements set AINV to a missing value and resume the module execution. The module exits normally and the B2 value is set to missing.


In summary, you can implement statements that handle run-time errors in SAS/IML modules. The key is to push error-handling code into a queue. The error-handling statements should include a RESUME statement. If the module pauses due to an error, the statements in the queue are executed and the module resumes execution. Be aware that the queue is flushed when the module exits normally, so use a flag variable and IF-THEN logic to control how the statements behave when there is no error.

The post Catch run-time errors in SAS/IML programs appeared first on The DO Loop.

Back to Top