11
Sep

Symbolic derivatives in SAS

Did you know that you can get SAS to compute symbolic (analytical) derivatives of simple functions, including applying the product rule, quotient rule, and chain rule? SAS can form the symbolic derivatives of single-variable functions and partial derivatives of multivariable functions. Furthermore, the derivatives are output in a form that can be pasted into a SAS program. The trick is to use PROC NLIN with the LIST option.

In SAS/IML, the nonlinear optimization routines will use analytical derivatives if they are provided; otherwise, they will automatically generate numerical finite-difference approximations to the derivatives. I rarely specify derivatives because it can be time-consuming and prone to error. But recently I realized that PROC NLIN and other SAS procedures automatically generate symbolic derivatives, and you can "trick" the procedure into displaying the derivatives so that they can be used in other applications. To get PROC NLIN to output symbolic derivatives, do the following:

  • Create a data set to use for the DATA= option of PROC NLIN.
  • List the variables that you want to take derivatives of on the PARMS statement. In open code, assign a value to constants that appear in the function.
  • Specify the function to differentiate on the MODEL statement.

Symbolic derivatives of functions of one variable

Here is a one-variable example. Suppose you want to take the derivative of
f(x) = x3 + x sin(x) + exp( x2 ).
The function and its derivative is shown to the right.

The following SAS statements create a "fake" data set that defines a variable named F. The call to PROC NLIN sets up the problem. The LIST option on the PROC NLIN statement generates the 'ProgList' table, which contains the derivative function:

data _dummy;
   f = 1;                                    /* name of function */
run;
 
proc nlin data=_dummy list;
   parms x=0;                                /* list variables */
   model f = x**3 + x*sin(x) + exp(x**2);    /* define function */
   ods select ProgList;                      /* output symbolic derivatives */
run;
Symbolic derivatibves in SAS for a univariate function

The output shows the expressions for the function ("MODEL.f") and the derivative with respect to x. The derivative df/dx is written as "@MODEL.f/@x", where the symbol '@' is used for the 'd'. You can see that SAS took the simple derivative of the x3 term, applied the product rule to the x*sin(x) term, and applied the chain rule to the exp( x2 ) term. You can also see that the expressions are written in SAS notation, with "**" indicating the power operator. SAS functions are written in upper case (SIN, COS, and EXP).

Symbolic partial derivatives of multivariate functions

In exactly the same way, you can obtain partial derivatives. In a partial derivative, all variables are held constant except one. Suppose you have the function for the normal density,
f(x) = 1/(sqrt(2 π) σ) exp( -(x - μ)**2 / (2 σ**2) )
Suppose that you want to compute the partial derivatives ∂f/∂μ and ∂f/∂σ. To compute these derivatives, list μ and σ on the PARMS statement and assign values to the other symbols (x and π) that appear in the function. It does not matter what value you assign to the x and π, you are just preventing PROC NLIN from complaining that there are unassigned variables. You could assign the value 1 to both variables, but I prefer to tell the procedure that π equals 3.14. In the following program, note that PROC NLIN reuses the fake data set that defines the F variable:

proc nlin data=_dummy list;
   parms mu=0 sigma=1;                        /* list variables */
   x=1; pi = constant('pi');                  /* assign arbitrary values to constants */
   model f = exp( -(x-mu)**2 / (2*sigma**2) ) / (sqrt(2*pi)*sigma); /* define function */
   ods select ProgList;                      /* output symbolic derivatives */
run;
Symbolic partial derivatibves in SAS for a multivariate function

The derivative with respect to σ requires using the chain rule and the quotient rule. Notice that the derivative includes a call to the original function ("MODEL.f"). Notice also that there is repetition of various expressions such as -(x - μ)**2 / (2 σ**2). If you copy and paste these expressions into a SAS program that computes the derivative, the computation will be slightly inefficient because the program will evaluate the same expression multiple times. One way to improve the efficiency is to collect sub-expressions into a temporary variable, as shown in the next section.

Sub-expressions and the chain rule

An experienced statistician will recognize that the expression z = (x - μ) / σ arises naturally as the standardized variable. Using such an expression might simplify the derivatives.

The following program computes the same derivatives as before, except this time the programmer defines the expression z = (x - μ) / σ and uses z in the function. When SAS takes the derivative, it computes dz/dμ and dz/dσ as part of the computation, as shown below:

proc nlin data=_dummy list;
   parms mu=0 sigma=1;                        /* list variables */
   x=1; pi = constant('pi');                  /* assign arbitrary values to constants */
   const = sqrt(2*pi);
   z = (x-mu) / sigma;                        /* intermediate expression */
   model f = exp(-z**2 / 2) / (const*sigma);  /* define function */
   ods select ProgList;                       /* output symbolic derivatives */
run;
Symbolic derivatibves in SAS that use sub-expressions and apply the chain rule

Notice that the derivatives use several sub-expressions such as dz/dμ and dz/dσ. This computation is more efficient than the previous because it reuses previously computed quantities. For another example, define f1 = 1 / sqrt(2*pi*sigma) and f2 = exp(-z**2 / 2) and define the MODEL as f = f1*f2. You will see that each term in the derivative is efficiently evaluated in terms of previously expressions.

Discussion of symbolic derivatives

As I said previously, I do not usually specify analytical derivatives for my optimizations. I find that numerical derivatives (which are automatically computed) are fast and accurate for 99% of the optimizations that I perform.

Furthermore, some of the objective functions that I optimize do not have analytical derivatives. Others are matrix computations for which it would be difficult and time-consuming to write down the analytical derivative. (For an example, see the article on the derivative of the bivariate normal cumulative distribution.)

If you decide to specify an analytical derivative in SAS, make sure that the derivative is correct and efficient. The technique in this article will help to ensure that the derivative is correct. By strategically defining sub-expressions as in the previous section, you can help SAS generate derivatives that are computationally efficient.

Be aware that the derivatives are generated by applying the chain rule, product rule, and quotient rule without any algebraic simplifications of the result. Thus the result can be unnecessarily complicated. For an example, ask SAS to display the derivative of log( (1+x)/(1-x) ) / 2. The naive result is complicated. If you rewrite the function as log(1+x)/2 - log(1-x)/2, then the derivative is much simpler. However, in neither case will SAS simplify the derivative to obtain 1 / (1-x**2).

For more of my thoughts on using derivatives in SAS, see "Two hints for specifying derivatives." What about your thoughts? What do you think about using PROC NLIN to generate automatic derivatives of simple functions? Do you think this technique will be useful to you?

The post Symbolic derivatives in SAS appeared first on The DO Loop.

28
Aug

The singular value decomposition: A fundamental technique in multivariate data analysis

The singular value decomposition (SVD) could be called the "billion-dollar algorithm" since it provides the mathematical basis for many modern algorithms in data science, including text mining, recommender systems (think Netflix and Amazon), image processing, and classification problems. Although the SVD was mathematically discovered in the late 1800s, computers have made the SVD an indispensable tool in computational statistics and data science.

SVD: A fundamental theorem of linear algebra

Mathematically, the singular value decomposition is a fundamental theorem of linear algebra. )You could argue that it is THE fundamental theorem, but Gil Strang names a different result.) The singular value decomposition says that every n x p matrix can be written as the product of three matrices: A = U Σ VT where

  • U is an othogonal n x n matrix
  • Σ is a diagonal n x p matrix. In practice, the diagonal elements are ordered so that Σii ≥ Σjj for all i < j.
  • V is an othogonal p x p matrix and VT represents a matrix transpose.

The SVD represents the essential geometry of a linear transformation. It tells us that every linear transformation is a composition of three fundamental actions. Reading the equation from right to left:

  1. The matrix V represents a rotation or reflection of vectors in the p-dimensional domain.
  2. The matrix Σ represents a linear dilation or contraction along each of the p coordinate directions. If np, this step also canonically embeds (or projects) the p-dimensional domain into (or onto) the n-dimensional range.
  3. The matrix U represents a rotation or reflection of vectors in the n-dimensional range.

Thus the SVD specifies that every linear transformation is fundamentally a rotation or reflection, followed by a scaling, followed by another rotation or reflection. The Strang (1993) article about the fundamental theorem of linear algebra includes the following geometric interpretation of the singular value decomposition of a 2 x 2 matrix:

Geometric interpretation of the singular value decomposition (SVD) as the product of a rotation/reflection, followed by a scaling, followed by another rotation/reflection.

The diagram shows that the transformation induced by the matrix A (the long arrow across the top of the diagram) is equivalent to the composition of the three fundamental transformations, namely a rotation, a scaling, and another rotation.

SVD: The fundamental theorem of multivariate data analysis

Because of its usefulness, the singular value decomposition is a fundamental technique for multivariate data analysis. A common goal of multivariate data analysis is to reduce the dimension of the problem by choosing a small linear subspace that captures important properties of the data. The SVD is used in two important dimension-reducing operations:

  • Low-rank approximations: Recall that the diagonal elements of the Σ matrix (called the singular values) in the SVD are computed in decreasing order. The SVD has a wonderful mathematical property: if you choose some integer k ≥ 1 and let D be the diagonal matrix formed by replacing all singular values after the k_th by 0, then then matrix U D VT is the best rank-k approximation to the original matrix A.
  • Principal components analysis: The principal component analysis is usually presented in terms of eigenvectors of a correlation matrix, but you can show that the principal component analysis follows directly from the SVD. (In fact, you can derive the eigenvalue decomposition of a matrix, from the SVD.) The principal components of AT A are the columns of the V matrix; the scores are the columns of U. Dimension reduction is achieved by truncating the number of columns in U, which results in the best rank-k approximation of the data.

Compute the SVD in SAS

In SAS, you can use the SVD subroutine in SAS/IML software to compute the singular value decomposition of any matrix. To save memory, SAS/IML computes a "thin SVD" (or "economical SVD"), which means that the U matrix is an n x p matrix. This is usually what the data analyst wants, and it saves considerable memory in the usual case where the number of observations (n) is much larger than the number of variables (p). Technically speaking, the U for an "economical SVD" is suborthogonal: UT U is the identity when np and U UT is the identity when np.

As for eigenvectors, the columns of U and V are not unique, so be careful if you compare the results in SAS to the results from MATLAB or R. The following example demonstrates a singular value decomposition for a 3 x 2 matrix A. For the full SVD, the U matrix would be a 3 x 3 matrix and Σ would be a 3 x 2 diagonal matrix. For the "economical" SVD, U is 3 2 and Σ is a 2 x 2 diagonal matrix, as shown:

proc iml;
A = {0.3062 0.1768,
    -0.9236 1.0997,
    -0.4906 1.3497 };
 
call svd(U, D, V, A);  /* A = U*diag(D)*V` */
print U[f=6.3], D[f=6.3], V[f=6.3];

Notice that, to save memory, only the diagonal elements of the matrix &Signa; are returns in the vector D. You can explicitly form &Sigma = diag(D) if desired. Geometrically, the linear transformation that corresponds to V rotates a vector in the domain by about 30 degrees, the matrix D scales it by 2 and 0.5 in the coordinate directions, and U inserts it into a three-dimensional space, and applies another rotation.

My next blog post shows how the SVD enables you to reduce the dimension of a data matrix by using a low-rank approximation, which has applications to image compression and de-noising.

The post The singular value decomposition: A fundamental technique in multivariate data analysis appeared first on The DO Loop.

23
Aug

The arithmetic-geometric mean

All statisticians are familiar with the classical arithmetic mean. Some statisticians are also familiar with the geometric mean. Whereas the arithmetic mean of n numbers is the sum divided by n, the geometric mean of n nonnegative numbers is the n_th root of the product of the numbers. The geometric mean is always less than the arithmetic mean.

Did you know that there is a hybrid quantity, called the arithmetic-geometric mean, which is defined by combining the two quantities? The arithmetic-geometric mean is only defined for two positive numbers, x and y. It is defined as the limit of an alternating iterative process:

  • Define a1 = (x + y)/2 and g1 = sqrt(x y) to be the arithmetic and geometric means, respectively.
  • Iteratively define an+1 = (an + gn)/2 and gn+1 = sqrt(an gn).

Interestingly, this process always converges. The number that it converges to is called the arithmetic-geometric mean of x and y, which is denoted by AGM(x,y). The AGM is between the geometric mean and the arithmetic mean. I am not aware of a multivariate generalization of the AGM function.

In SAS you can use the SAS/IML language or PROC FCMP to create a user-defined function that computes the arithmetic-geometric mean. Since the AGM is not a vector operation, I show the PROC FCMP approach:

proc fcmp outlib=work.funcs.MathFuncs;
function AGM(x, y);
   if x<0 | y<0 then return( . );
   epsilon = 1e-10;               /* precision of computation */
   a = mean(x, y);   
   g = geomean(x,y);   
   do while (abs(a - g) > epsilon);
      a_new = mean(a, g);   
      g_new = geomean(a, g);   
      a = a_new; 
      g = g_new;
   end;
   return( mean(a, g) );
endsub;
quit;
 
/* test the function */
options cmplib=work.funcs;  /* define location of function */
data _NULL_;
   agm = AGM(1, 0.8); 
   put agm 18.14;
run;
   0.89721143211504

An example of calling the new AGM function is shown for the two numbers 1 and 0.8. The arithmetic mean of 1 and 0.8 is 0.9. The geometric mean of 1 and 0.8 is sqrt(0.8) = 0.8944. The computation shows that the arithmetic-geometric mean is 0.8972, which is between the arithmetic and geometric means.

The following program computes the graph of the AGM function and displays a contour plot of the result. Note that AGM(x, x) = x for any nonnegative value of x.

data grid;
do x = 0 to 5 by 0.1;
   do y = 0 to 5 by 0.1;
      agm = AGM(x, y);
      output;
   end;
end;
run;
 
/* see http://blogs.sas.com/content/iml/2012/07/02/create-a-contour-plot-in-sas.html */
proc sgrender data=grid template=ContourPlotParm;
dynamic _TITLE="Graph of Arithmetic-Geometric Mean AGM(x,y)"
        _X="x" _Y="y" _Z="AGM";
run;
Graph of the arithmetic-geometric mean

So, what is the arithmetic-geometric mean good for? The AGM seems mainly useful in applied mathematics and number theory for computing elliptic functions and elliptic integrals. I am not aware of any applications to statistics, perhaps because the AGM is not defined for a sample that contains n elements when n > 2.

Although the AGM function is not used in statistics, the iterative algorithm that defines the AGM is a technique that I have seen often in computational statistics. When a researcher wants two conditions to hold simultaneously, one possible method is to alternately solve for each condition and prove that this iterative process converges to a solution that satisfies both conditions simultaneously. The iterative process can be an effective way to implement an optimization problem if each sub-optimization is simple.

In multivariate statistics, this technique is used in the method of alternating least squares, which is used in several SAS procedures such as PROC TRANSREG, VARCLUS, and MDS. In linear algebra, this technique is used in the method of alternating projections. I have used this technique to compute the closest correlation matrix to an arbitrary matrix.

One last interesting fact. There is another kind of mean called the harmonic mean. You can compute the arithmetic-harmonic mean by applying the same iterative algorithm, but replace the GEOMEAN function with the HARMEAN function. The process converges and the arithmetic-harmonic mean converges is equal to the geometric mean! Thus although this two-step definition might seem contrived, it is "natural" in the sense that it produces the geometric mean (of two numbers) from the arithmetic and harmonic means.

The post The arithmetic-geometric mean appeared first on The DO Loop.

7
Aug

The curse of non-unique eigenvectors

A SAS customer asked, "I computed the eigenvectors of a matrix in SAS and in another software package. I got different answers? How do I know which answer is correct?"

I've been asked variations of this question dozens of times. The answer is usually "both answers are correct."

The mathematical root of the problem is that eigenvectors are not unique. It is easy to show this: If v is an eigenvector of the matrix A, then by definition A v = λ v for some scalar eigenvalue λ. Notice that if you define u = α v for a scalar α ≠ 0, then u is also an eigenvector because A u = α A v = α λ v = λ u. Thus a multiple of an eigenvector is also an eigenvector.

Most statistical software (including SAS) tries to partially circumvent this problem by standardizing an eigenvector to have unit length (|| v || = 1). However, note that v and -v are both eigenvectors that have the same length. Thus even a standardized eigenvector is only unique up to a ± sign, and different software might return eigenvectors that differ in sign. In fact for some problems, the same software can return different answers when run on different operating systems (Windows versus Linux), or when using vendor-supplied basic linear algebra subroutines such as the Intel Math Kernel Library (MKL).

To further complicate the issue, software might sort the eigenvalues and eigenvectors in different ways. Some software (such as MATLAB) orders eigenvalues by magnitude, which is the absolute value of the eigenvalue. Other software (such as SAS) orders eigenvalues according to the value (of the real part) of the eigenvalues. (For most statistical computations, the matrices are symmetric and positive definite (SPD). For SPD matrices, which have real nonnegative eigenvalues, these two orderings are the same.)

Eigenvectors of an example matrix

To illustrate the fact that different software and numerical algorithms can produce different eigenvectors, let's examine the eigenvectors of the following 3 x 3 matrix:

The eigenvectors of this matrix will be computed by using five different software packages: SAS, Intel's MKL, MATLAB, Mathematica, and R. The eigenvalues for this matrix are unique and are approximately 16.1, 0, and -1.1. Notice that this matrix is not positive definite, so the order of the eigenvectors will vary depending on the software. Let's compute the eigenvectors in five different ways.

Method 1: SAS/IML EIGEN Call: The following statements compute the eigenvalues and eigenvectors of M by using a built-in algorithm in SAS. This algorithm was introduce in SAS version 6 and was the default algorithm until SAS 9.4.

proc iml;
reset FUZZ;         /* print very small numbers as 0 */
M = {1  2  3,
     4  5  6,
     7  8  9};
reset EIGEN93;      /* use "9.3" algorithm; no vendor BLAS (option required for SAS 9.4m3) */
call eigen(EigVal, SASEigVec, M);
print EigVal, SASEigVec[colname=("E1":"E3")];

Notice that the eigenvalues are sorted by their real part, not by their magnitude. The eigenvectors are returned in a matrix. The i_th column of the matrix is an eigenvector for the i_th eigenvalue. Notice that the eigenvector for the largest eigenvalue (the first column) has all positive components. The eigenvector for the zero eigenvalue (the second column) has a negative component in the second coordinate. The eigenvector for the negative eigenvalue (the third column) has a negative component in the third coordinate.

Method 2: Intel MKL BLAS: Starting with SAS/IML 14.1, you can instruct SAS/IML to call the Intel Math Kernel Library for eigenvalue computation if you are running SAS on a computer that has the MKL installed. This feature is the default behavior in SAS/IML 14.1 (SAS 9.4m3), which is why the previous example used RESET NOEIGEN93 to get the older "9.3 and before" algorithm. The output for the following statements assumes that you are running SAS 9.4m3 or later and your computer has Intel's MKL.

reset NOEIGEN93;   /* use Intel MKL, if available */
call eigen(EigVal, MKLEigVec, M);
print MKLEigVec[colname=("E1":"E3")];

This is a different result than before, but it is still a valid set of eigenvectors The first and third eigenvectors are the negative of the eigenvectors in the previous experiment. The eigenvectors are sorted in the same order, but that is because SAS (for consistency with earlier releases) internally sorts the eigenvectors that the MKL returns.

Method 3: MATLAB: The following MATLAB statements compute the eigenvalue and eigenvectors for the same matrix:

M = [1, 2, 3; 4, 5, 6; 7, 8, 9];
[EigVec, EigVal] = eig(M);
EigVec =
-0.2320   -0.7858    0.4082
-0.5253   -0.0868   -0.8165
-0.8187    0.6123    0.4082

The eigenvalues are not displayed, but you can tell from the output that the eigenvalues are ordered by magnitude: 16.1, -1.1, and 0. The eigenvectors are the same as the MKL results (within rounding precision), but they are presented in a different order.

Method 4: Mathematica: This example matrix is used in the Mathematica documentation for the Eigenvectors function:

Eigenvectors[{{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}]
0.283349  -1.28335    1
0.641675  -0.141675  -2
1          1          1

This is a different result, but still correct. The symbolic computations in Mathematica do not standardize the eigenvectors to unit length. Instead, they standardize them to have a 1 in the last component. The eigenvalues are sorted by magnitude (like the MATLAB output), but the first column has opposite signs from the MATLAB output.

Method 5: R: The R documentation states that the eigen function in R calls the LAPACK subroutines. Thus I expect it to get the same result as MATLAB.

M <- matrix(c(1:9), nrow=3, byrow=TRUE)
r <- eigen(M)
r$vectors
           [,1]        [,2]       [,3]
[1,] -0.2319707 -0.78583024  0.4082483
[2,] -0.5253221 -0.08675134 -0.8164966
[3,] -0.8186735  0.61232756  0.4082483

Except for rounding, this result is the same as the MATLAB output.

Summary

This article used a simple 3 x 3 matrix to demonstrate that different software packages might produce different eigenvectors for the same input matrix. There were four different answers produced, all of which are correct. This is a result of the mathematical fact that eigenvectors are not unique: any multiple of an eigenvector is also an eigenvector! Different numerical algorithms can produce different eigenvectors, and this is compounded by the fact that you can standardize and order the eigenvectors in several ways.

Although it is hard to compare eigenvectors from different software packages, it is not impossible. First, make sure that the eigenvectors are ordered the same way. (You can skip this step for symmetric positive definite matrices.) Then make sure they are standardized to unit length. If you do those two steps, then the eigenvectors will agree up to a ± sign.

The post The curse of non-unique eigenvectors appeared first on The DO Loop.

1
Mar

Monte Carlo estimates of joint probabilities

Monte Carlo techniques have many applications, but a primary application is to approximate the probability that some event occurs. The idea is to simulate data from the population and count the proportion of times that the event occurs in the simulated data.

For continuous univariate distributions, the probability of an event is the area under a density curve. The integral of the density from negative infinity to a particular value is the definition of the cumulative distribution function (CDF) for a distribution. Instead of performing numerical integration, you can use Monte Carlo simulation to approximate the probability.

One-dimensional CDFs

In SAS software, you can use the CDF function to compute the CDF of many standard univariate distributions. For example, the statement prob = cdf("Normal", -1) computes the probability that a standard normal random variable takes on a value less than -1.

The CDF function is faster and more accurate than a Monte Carlo approximation, but let's see how the two methods compare. You can estimate the probability P(X < -1) by generating many random values from the N(0,1) distribution and computing the proportion that is less than -1, as shown in the following SAS DATA step

data _NULL_;
call streaminit(123456);
N = 10000;                       /* sample size */
do i = 1 to N;
   x = rand("Normal");           /* X ~ N(0,1) */
   cnt + (x < -1);               /* sum of counts for value less than -1 */
end;
Prob = cdf("Normal", -1);        /* P(x< -1) */
MCEst = cnt / N;                 /* Monte Carlo approximation */
put Prob=;
put MCEst=;
run;
Prob =0.1586552539
MCEst=0.1551

The Monte Carlo estimate is correct to two decimal places. The accuracy of this Monte Carlo computation is proportional to 1/sqrt(N), where N is the size of the Monte Carlo sample. Thus if you want to double the accuracy you need to quadruple the sample size.

Two-dimensional CDFs

SAS provides the PROBBNRM function for computing the CDF of a bivariate normal distribution, but does not provide a built-in function that computes the CDF for other multivariate probability distributions. However, you can use Monte Carlo techniques to approximate multivariate CDFs for any multivariate probability distribution for which you can generate random variates.

I have previously blogged about how to use the PROBBNRM function to compute the bivariate normal CDF. The following SAS/IML statements demonstrate how to use a Monte Carlo computation to approximate the bivariate normal CDF. The example uses a bivariate normal random variable Z ~ MVN(0, Σ), where Σ is the correlation matrix with Σ12 = 0.6.

The example computes the probability that a bivariate normal random variable is in the region G = {(x,y) | x<x0 and y<y0}. The program first calls the built-in PROBBNRM function to compute the probability. Then the program calls the RANDNORMAL function to generate 100,000 random values from the bivariate normal distribution. A binary vector (group) indicates whether each observation is in G. The MEAN function computes the proportion of observations that are in the region.

proc iml;
x0  = 0.3; y0 = 0.4; rho = 0.6; 
Prob = probbnrm(x0, y0, rho);      /* P(x<x0 and y<y0) */
 
call randseed(123456);
N = 1e5;                           /* sample size */
Sigma = { 1   &rho,                /* correlation matrix */
         &rho 1 };
mean = {0 0};
Z = randnormal(N, mean, Sigma);    /* sample from MVN(0, Sigma) */
group = (Z[,1] < x0 & Z[,2] < y0); /* binary vector */
MCEst = mean(group);               /* = sum(group=1) / N  */
print Prob MCEst;

You can use a scatter plot to visualize the Monte Carlo technique. The following statements create a scatter plot and use the DROPLINE statement in PROC SGPLOT to indicate the region G. Of the 100000 random observations, 49750 of them were in the region G. These observations are drawn in red. The observations that are outside the region are drawn in blue.

ods graphics / width=400px height=400px;
title "Estimate of P(x < x0  and y < y0) is 0.4978";
title2 "x0 = 0.3; y0 = 0.4; rho = 0.6";
call scatter(Z[,1], Z[,2]) group=group grid={x y}
     procopt="noautolegend aspect=1"
     option="transparency=0.9  markerattrs=(symbol=CircleFilled)"
     other="dropline x=0.3 y=0.4 / dropto=both;";

Higher dimensions

The Monte Carlo technique works well in low dimensions. As the dimensions get larger, you need to generate a lot of random variates in order to obtain an accurate estimate. For example, the following statements generate 10 million random values from the five-dimensional distribution of uncorrelated normal variates and estimate the probability of all components being less than 0:

d = 5;                         /* dimension */
N = 1e7;                       /* sample size */
mean = j(1, d, 0);             /* {0,0,...,0} */
Z = randnormal(N, mean, I(d)); /* Z ~  MVN (0, I)  */
v0 = {0 0 0 0 0};              /* cutoff values in each component */
ComponentsInRegion = (Z < v0)[,+]; /* number of components in region */
group = (ComponentsInRegion=d);    /* binary indicator vector */
MCEst = mean(group);               /* proportion of obs in region */
print (1/2**d)[label="Prob"] MCEst;

Because the normal components are independent, the joint probability is the product of the probabilities for each component: 1/25 = 0.03125. The Monte Carlo estimate is accurate to three decimal places.

The Monte Carlo technique can also handle non-rectangular regions. For example, you can compute the probability that a random variable is in a spherical region.

The Monte Carlo method is computationally expensive for high-dimensional distributions. In high dimensions (say, d > 10), you might need billions of random variates to obtain a reasonable approximation to the true probability. This is another example of the curse of dimensionality.

The post Monte Carlo estimates of joint probabilities appeared first on The DO Loop.

16
Nov

Need to log-transform a distribution? There's a SAS function for that!

At a conference last week, a presenter showed SAS statements that compute the logarithm of a probability density function (PDF). The log-PDF is a a common computation because it occurs when maximizing the log-likelihood function.

The presenter computed the expression in SAS by using an expression that looked like
y = LOG(PDF("distribution", x, params));
In other words, he computed the PDF and then transformed the density by applying the LOG function.

There is a better way. It is more efficient and more accurate to use the LOGPDF function to compute the log-PDF directly.


Special functions for log-transformed distributions #SASTip
Click To Tweet


Log-transformed distributions

SAS provides several functions for computing with log-transformed distributions. In addition to the LOGPDF function, you can use the LOGCDF function to compute probabilities for the log-distribution.

Let's use the gamma distribution to see why the LOGPDF function is usually more efficient. The gamma density function with unit scale parameter is given by
f(x; a) = xa-1 exp(-x) / Γ(a)
where Γ(a) is the value of the gamma function evaluated at a.

The log-transform of the gamma density is
log(f(x; a)) = (a-1)log(x) – x – log(Γ(a))
This formula, which is used when you compute LOGPDF("gamma", x, 2), is cheap to evaluate and does not suffer from numerical underflow when x is large. In particular, recall that in double-precision arithmetic, exp(-x) underflows if x > 709.78, so the PDF function cannot support extremely large values of x. In contrast, the formula for the log-PDF does not experience any numerical problems when x is large. The log-PDF function is also very fast to compute.

The following DATA step illustrates how to use the LOGPDF function to compute the log-gamma density. It computes the log-gamma PDF two ways: by using the LOGPDF function and by using the log-transform of the gamma PDF. The results show that the numerical values are nearly the same, but that the PDF function fails for large values of x:

data LogPDF(drop=a);
a = 2;
do x = 100, 700, 720, 1000;
   LogOfPDF = log( pdf("Gamma", x, a) ); 
   LOGPDF = logpdf("Gamma", x, a);        /* faster and more accurate */
   diff = LOGPDF - LogOfPDF;
   output;
end;
label LOGOfPDF = "LOG(PDF(x))";
run;
 
proc print noobs label; run;
Density of the log-gamma(2) distribution for large x

By the way, SAS provides other functions that are optimized for computing the log-transformation of several well known funcions and quantities:

  • The LOGBETA function computes the logarithm of the beta function. You should use logbeta(a,b) instead of log(beta(a,b)).
  • The LGAMMA function computes the logarithm of the gamma function. You should use lgamma(a) instead of log(gamma(a)).
  • The LCOMB function computes the logarithm of the number of combinations of n objects taken k at a time. You should use lcomb(n,k) instead of log(comb(n,k)).
  • The LFACT function computes the quantity log(n!) for specified values of n. You should use lfact(n) instead of log(fact(n)).

In general, when software provides a function for directly computing the logarithm of a quantity, you should use it. The special-purpose function is typically faster, more accurate, and will handle arguments that are beyond the domain of the non-specialized function.

tags: Numerical Analysis

The post Need to log-transform a distribution? There's a SAS function for that! appeared first on The DO Loop.

31
Aug

The Lambert W function in SAS

Branches of the real-valued Lambert W function

This article describes how you can evaluate the Lambert W function in SAS/IML software. The Lambert W function is defined implicitly: given a real value x, the function's value w = W(x) is the value of w that satisfies the equation w exp(w) = x. Thus W is the inverse of the function g(w) = w exp(w).

Because the function g has a minimum value at w=-1, there are two branches to the Lambert W function. Both branches are shown in the adjacent graph. The top branch is called the principal (or positive) branch and is denoted Wp. The principal branch has domain x ≥ -1/e and range W(x) ≥ -1. The lower branch, denoted by Wm, is defined on -1/e ≤ x < 0 and has range W(x) ≤ = -1. The subscript "m" stands for "minus" since the range contains only negative values. Some authors use W0 for the upper branch and W-1 for the lower branch.

Both branches are used in applications, but the lower branch Wm is useful for the statistical programmer because you can use it to simulate data from heavy-tailed distribution. Briefly: for a class of heavy-tailed distributions, you can simulate data by using the inverse CDF method, which requires evaluating the Lambert W function.

Defining the Lambert W function in SAS/IML

You can download the SAS/IML functions that evaluate each branch of the Lambert W function. Both functions use an analytical approximation for the Lambert W function, followed by one or two iterations of Halley's method to ensure maximum precision:

  • The LambertWp function implements the principal branch Wp. The algorithm constructs a direct global approximation on [-1/e, ∞) based on Boyd (1998).
  • The LambertWm function implements the second branch Wm. The algorithm follows Chapeau-Blondeau and Monir (2002) (hereafter CBM). The CBM algorithm approximates Wm(x) on three different domains. The approximation uses a series expansion on [-1/e, -0.333), a rational-function approximation on [-0.333, -0.033], and an asymptotic series expansion on (-0.033, 0).

Calling the Lambert W function in SAS/IML

Download the SAS/IML program for this article and save it to a file called LambertW.sas. Then the following SAS/IML program shows how to call the functions for the upper and lower branches and plot the graphs:

%include "C:MyPathLambertW.sas";      /* specify path to file */
proc iml;
load module=(LambertWp LambertWm);
 
title "Lambert W Function: Principal Branch";
x = do(-1/exp(1), 1, 0.01);
Wp = LambertWp(x);
call series(x, Wp) grid={x y} other="refline 0/axis=x; refline 0/axis=y";
 
title "Lambert W Function: Lower Branch";
x = do(-1/exp(1), -0.01, 0.001);
Wm = LambertWm(x);
call series(x, Wm) grid={x y};

The graphs are not shown because they are indistinguishable from the graph at the beginning of article.

You can use a simple error analysis to investigate the accuracy f these functions. After computing w = W(x), apply the inverse function g(w) = w exp(w) and see how close the result is to the input values. The LambertWp and LambertWm functions support an optional second parameter that specifies how many Halley iterations are performed. For LambertWm, only one iteration is performed by default. You can specify "2" as the second parameter to get two Halley iterations. The following statements show that the maximum error for the default computation is 2E-11 on (-1/e, -0.01). The error decreases to 5E-17 if you request two Halley iterations.
x = do(-1/exp(1), -0.01, 0.001);
Wm = LambertWm(x);            /* default precision: 1 Halley iteration */
g = Wm # exp(Wm);             /* apply inverse function */
maxError1 = max(abs(x - g));  /* maximum error */
 
Wm = LambertWm(x, 2);         /* higher precision: 2 Halley iterations */
g = Wm # exp(Wm);
maxError2 = max(abs(x - g));
print maxError1 maxError2;
t_LambertW

In summary, the SAS/IML language provides an excellent environment for implementing new functions or statistical procedures that are not built into SAS. In this article I implemented methods from journal articles for evaluating the upper and lower branches of the Lambert W function, which is the inverse function of g(w) = w exp(w). Although the Lambert W function does not have a closed-form solution, you can implement an approximation to the function and then use one or two Halley iterations to quickly converge within machine precision.

tags: Numerical Analysis

The post The Lambert W function in SAS appeared first on The DO Loop.

24
Aug

Halley's method for finding roots

Edmond Halley (1656-1742) is best known for computing the orbit and predicting the return of the short-period comet that bears his name. However, like many scientists of his era, he was involved in a variety of mathematical and scientific activities. One of his mathematical contributions is a numerical method for finding the roots of a real-valued function. The technique, which is called Halley's method, is similar to Newton's method, but converges more rapidly in the neighborhood of a root.

There are dozens of root-finding methods, but Newton's method is the gold standard for finding roots of general nonlinear functions. It converges quadratically and is easy to implement because the formula requires only the evaluation of a function and its first derivative. Other competing methods do not use derivatives at all, or they use higher-order derivatives.

Halley's method falls into the latter category. Like Newton's method, it requires an initial guess for the root. It evaluates the function and its first two derivatives at an approximate location of the root and uses that information to iteratively improve the approximation. This article compares Halley's method with Newton's method and suggests a class of functions for which Halley's method is preferable.


Halley's root-finding method: An alternative to Newton's method #SASTip
Click To Tweet


An example of finding a root in SAS/IML

Consider the function f(x) = x exp(x). If you want to find the values of x for which f(x)=c, you can recast the problem as a root-finding problem and find the roots of the function f(x)-c. For this particular function, the equation f(x)-c has one root if c ≥ 0 and two roots if -1/e < c < 0.

Halley's method: The roots of the function x*exp(x) + 1/(2e)

To be concrete, let c = -1/(2e) so that the equation has two roots. The function f(x)-c is shown to the right and the two roots are marked by red line segments. You can use the built-in FROOT function in SAS/IML to locate the roots. The FROOT function does not use Newton's method. Instead, it requires that you specify an interval on which to search for a root. From the graph, you can specify two intervals that bound roots, as follows:

proc iml;
/* The equation f(x)-c for f(x) = x exp(x) */
start func(x) global(g_target);
   return x # exp(x) - g_target;
finish;
 
g_target = -1 / (2*constant('e')); /* parameter; function has two roots */
intervals = {-3 -2,                /* one root in [-3,-2] */
             -1  0};               /* another root in [-1,0] */
roots = froot("func", intervals);  /* find roots in each interval */
print roots;
t_halley1

The FROOT function is effective and efficient. It always finds an approximate root provided that you can supply a bounding interval on which the function changes signs. However, sometimes you don't know a bounding interval, you only know an approximate value for the root. In that case, Newton-type methods are preferable.

Newton's method versus Halley's method

Halley's method is an extension of Newton's method that incorporates the second derivative of the target function. Whereas Newton's method iterates the formula xn+1 = xn - f(xn) / f′(xn), Halley's method contains an extra term in the denominator:
xn+1 = xn - f(xn) / [f′(xn+1) - 0.5 f(xn) f″(xn) / f′(xn)].
Notice that if f″(xn)=0, then Halley's method reduced to Newton's method.

For many functions, the computational cost of evaluating the extra term in Halley's formula does not offset the gain in convergence speed. In other words, it is often quicker and easier to use Newton's simpler formula than Halley's more complicated formula. However, Halley's method is worth implementing when the ratio f″(x) / f′(x) can be evaluated cheaply. The current function provides an example: f′(x) = (x+1) exp(x) and f″(x) = (x+2) exp(x), so the ratio is simply (x+2) / (x+1). This leads to the following SAS/IML functions. One function implements Newton's iteration and the other implements Halley's iteration:

/* Compute iterations of Newton's method for several initial guesses:
   f(x) = x#exp(x) - c */
start NewtonIters(x0, numIters=1) global(g_target);
   z = j(numIters+1, ncol(x0));
   z[1,] = x0;
   do i = 2 to numIters+1;
      x = z[i-1,];
      fx = x # exp(x) - g_target;         /* f(x)   */
      dfdx = (1+x) # exp(x);              /* f'(x)  */
      z[i,] = x - fx / dfdx;              /* new approximate root */
   end;
   return z;
finish;
 
/* Compute iterations of Halley's method for several initial guesses:
   f(x) = x#exp(x) - c */
start HalleyIters(x0, numIters=1) global(g_target);
   z = j(numIters+1, ncol(x0));
   z[1,] = x0;
   do i = 2 to numIters+1;
      x = z[i-1,];
      fx = x # exp(x) - g_target;         /* f(x)   */
      dfdx = (1+x) # exp(x);              /* f'(x)  */
      R = (2+x) / (1+x);                  /* ratio f''(x) / f'(x) */
      z[i,] = x - fx / (dfdx - 0.5*fx#R); /* new approximate root */
   end;
   return z;
finish;

Notice that the functions are practically identical. Halley's method computes an extra term (R) and includes that term in the iterative formula that converges to the root. I wrote the function in vectorized form so that you can pass in multiple initial guesses and the functions will iterate all guesses simultaneously. The following statements provide four initial guesses and apply both Newton's and Halley's method:

x0 = {-3 -2 -0.5 0.25};        /* multiple initial guesses */
N = NewtonIters(x0, 5);        /* compute 5 Newton iterations */
H = HalleyIters(x0, 3);        /* compute 3 Halley iterations */                
rNames = "Iter=0":"Iter=5";
print N[L="Newton's Method" c=("Guess1":"Guess4") r=rNames];
print H[L="Halley's Method" c=("Guess1":"Guess4") r=rNames];
t_halley2

The tables show that Halley's method tends to converge to a root faster than Newton's method. For the four initial conditions, Newton's method required three, four, or five iterations to converge to within 1e-6 of the root. In contrast, Haley's method used three or fewer iterations to reach the same level of convergence.

Again, for Halley's method to be useful in practice, the ratio f″(x) / f′(x) must be easy to evaluate. To generalize this example, the ratio simplifies if the function is the product of a simple function and an exponential: f(x) = P(x)*exp(x). For example, if P(x) is a polynomial, then f″ / f′ is a rational function.

In summary, Halley's method is a powerful alternative to Newton's method for finding roots of a function f for which the ratio f″(x) / f′(x) has a simple expression. In that case, Halley's root-finding method is easy to implement and converges to roots of f faster than Newton's method for the same initial guess.

tags: Numerical Analysis

The post Halley's method for finding roots appeared first on The DO Loop.

18
May

All I really need to know about Newton's method I learned in primary school

I was eleven years old when I first saw Newton's method.

No, I didn't go to a school for geniuses. I didn't even know it was Newton's method until decades later. However, in sixth grade I learned an iterative algorithm that taught me (almost) everything I need to know about Newton's method for finding the roots of functions.

Newton's method and an iterative square root algorithm

The algorithm I learned in the sixth grade is an iterative procedure for computing square roots by hand. It seems like magic because it estimates a square root from an arbitrary guess. Given a positive number S, you can guess any positive number x0 and then apply the "magic formula" xn+1 = (xn + S / xn)/2 until the iterates converge to the square root of S. For details, see my article about the Babylonian algorithm.

It turns out that this Babylonian square-root algorithm is a special case of Newton's general method for finding the roots of functions. To see this, define the function f(x) = x2 - S. Notice that the square root of S is the positive root of f. Newton's method says that you can find roots by forming the function Q(x) = x - f(x) / f′(x), where f′ is the derivative of f, which is 2x. With a little bit of algebra, you can show that Q(x) = (x + S / x)/2, which is exactly the "magic formula" in the Babylonian algorithm.

Therefore, the square root algorithm is exactly Newton's method applied to a special quadratic polynomial whose root is the square root of S. The Newton iteration process is visualized by the following graph (click to enlarge), which shows the iteration history of the initial guess 10 when S = 20.

Iteration of initial guess under Babylonian square-root algorithm

Everything I need to know about Newton's method...

It turns out that almost everything I need to know about Newton's method, I learned in sixth grade. The Babylonian method for square roots provides practical tips about how to use Newton's method for finding roots:

  • You need to make an initial guess.
  • If you make a good initial guess, the iteration process is shorter.
  • When you get close to a solution, the number of correct digits doubles for each iteration. This is quadratic convergence in action!
  • Avoid inputs where the function is not defined. For the Babylonian method, you can't plug in x=0. In Newton's method, you need to avoid inputs where the derivative is close to zero.
  • Different initial guesses might iterate to different roots. In the square root algorithm, a negative initial guess converges to the negative square root.

All I really need to know about Newton's method I learned in sixth grade.
Click To Tweet


In fact, I'll go further. The Babylonian method teaches important lessons in numerical analysis:

  • Root-finding is a fundamental tool that helps you solve all kinds of numerical problems.
  • Many problems that do not have exact answers can be solved with iterative methods.
  • Iterative methods often linearize the problem and use the linear formulation as part of the algorithm.

So you see, all I really needed to know about Newton's method I learned in sixth grade!

tags: Numerical Analysis

The post All I really need to know about Newton's method I learned in primary school appeared first on The DO Loop.

4
Nov

Trap and cap: Avoid division-by-zero and domain errors when evaluating functions

Statistical programmers often need to evaluate complicated expressions that contain square roots, logarithms, and other functions whose domain is restricted. Similarly, you might need to evaluate a rational expression in which the denominator of the expression can be zero. In these cases, it is important to avoid evaluating a function at any point that is outside the function's domain.

This article shows a technique that I call "trap and cap." The "trap" part of the technique requires using logical conditions to prevent your software from evaluating an expression that is outside the domain of the function. The "cap" part is useful for visualizing functions that have singularities or whose range spans many orders of magnitude.

This technique is applicable to any computer programming language, and it goes by many names. It is one tool in the larger toolbox of defensive programming techniques.

This article uses a SAS/IML function to illustrate the technique, but the same principles apply to the SAS DATA step and PROC FCMP. For an example that uses FCMP, see my article about defining a log transformation in PROC FCMP.

ERROR: Invalid argument to function

The following SAS/IML function involves a logarithm and a rational expression. If you naively try to evaluate the function on the interval [-2, 3], an error occurs:

proc iml;
/* Original function module: No checks on domain of function */
start Func(x);
   numer = exp(-2*x) # (4*x-3)##3 # log(3*x);       /* not defined for x <= 0 */
   denom = x-1;                                     /* zero when x=1 */
   return( numer / denom );
finish;
 
x = do(-2, 3, 0.01);
f = Func(x);
ERROR: (execution) Invalid argument to function.

 count     : number of occurrences is 200
 operation : LOG at line 881 column 40

The error says that there were 200 invalid arguments to the LOG function. The program stops running when it encounters the error, so it doesn't encounter the second problem, which is that the expression numer / denom is undefined when denom is zero.

Experienced SAS programmers might know that the DATA step automatically plugs in missing values and issue warnings for the LOG function. However, I don't like warnings any more than I like errors: I want my program to run cleanly!

Trap bad input values

Because the function contains a logarithm, the function is undefined for x ≤ 0. Because the function is a rational expression and the denominator is zero when x = 1, the function is also undefined at that value.

The goal of the "trap" technique is to prevent an expression from being evaluated at points where it is undefined. Accordingly, the best approach is to compute the arguments of any function that has a restricted domain (log, sqrt, arc-trigonometric functions,...) and then use the LOC function to determine the input values that are within the domain of all the relevant functions. The function is only evaluated at input values that pass all domain tests, as shown in the following example:

proc iml;
/* Trap bad domain values; return missing values when x outside domain */
start Func(x);
   f = j(nrow(x), ncol(x), .);    /* initialize return values to missing */
   /* trap bad domain values */
   d1 = 3*x;                      /* argument to LOG */
   d2 = x-1;                      /* denominator */
   domainIdx = loc( d1>0 & d2^=0 );     /* find values inside domain */
   if ncol(domainIdx)=0 then return(f); /* all inputs invalid; return missing */
 
   t = x[domainIdx];              /* elements of t are valid */
   numer = exp(-2*t) # (4*t-3)##3 # log(3*t);
   denom = t-1;
   f[domainIdx] = numer / denom;  /* evaluate f(t) */
   return( f );
finish;
 
x = do(-2, 3, 0.01);
f = Func(x);

This call does not encounter any errors. The function returns missing values for input values that are outside its domain.

In practice, I advise that you avoid testing floating-point numbers for equality with zero. You can use the ABS function to check that quantities are not close to zero. Consequently, a preferable trapping condition follows:

   domainIdx = loc( d1>0 & abs(d2)>1e-14 );  /* find values inside domain */

I also recommend that you use the CONSTANT function to check for domain values that would result in numerical underflows or overflows, which is especially useful if your expression contains an exponential term.

Cap it! Graphing functions with singularities

You have successfully evaluated the function even though it has two singularities. Let's see what happens if you naively graph the function. The function has a horizontal asymptote at y=0 and vertical asymptotes at x=0 and x=1, so you can use the REFLINE statement in PROC SGPLOT to draw reference lines that aid the visualization. In SAS/IML, you can use the SCATTER subroutine to create a scatter plot, and pass the REFLINE statements by using the OTHER= option, as follows:

refStmt = "refline 0/axis=y; refline 0 1/axis=x;"; /* add asymptotes and axes */
call scatter(x, f) other=refStmt;
trapcap1

The graph is not very useful (look at the vertical scale!) because the magnitude of the function at one point dwarfs the function values at other points. Consequently, you don't obtain a good visualization of the function's shape. This is often the case when visualizing a function that has singularities.

To obtain a better visualization, manually cap the minimum and maximum value of the graph of the function. In the SGPLOT procedure, you can use the MIN= and MAX= options on the YAXIS statement to cap the view range. If you are using the CALL SCATTER routine in SAS/IML, you can add the YAXIS statement to the OTHER= option, as follows:

/* cap the view at +/-5 */
title "Sketch of Function";
title2 "-5 < y < 5";
capStmt = "yaxis min=-5 max=5;";
call scatter(x,f) other=capStmt + refStmt;
trapcap2

This latest graph provides a good overview of the function's shape and behavior. For intervals on which the function is continuous, you can get a nicer graph if you use the SERIES subroutine to connect the points and hide the markers. The following statements re-evaluate the function on the interval (0, 1) and caps the vertical scale by using -1 < y < 1:

title "Zoom into [0, 1]";
title2 "-1 < y < 1";
x = do(0, 1, 0.01);
f = Func(x);
capStmt = "yaxis min=-1 max=1;";
call series(x,f) other=capStmt + refStmt;

The resulting graph shows that the function has two roots in the interval (0, 1) and a local maximum near x=0.43. A visualization like this is often an important step in finding the roots or local extrema of a function.

In summary, the trap-and-cap technique enables you evaluate functions that have singularities or that have a restricted domain. For SAS/IML functions, you can use the LOC function to keep only the input values that are in the domain of the function. (In a scalar language, such as the SAS DATA step, you can use IF-THEN/ELSE statements instead.) For functions with singularities, you can use the YAXIS statement in PROC SGPLOT to ensure that a graph of the function reveals interesting features and not just extreme values.

tags: Numerical Analysis, Optimization, Tips and Techniques

The post Trap and cap: Avoid division-by-zero and domain errors when evaluating functions appeared first on The DO Loop.

Back to Top