5
Jul

## Compute derivatives for nonparametric regression models

SAS enables you to evaluate a regression model at any location within the range of the data. However, sometimes you might be interested in how the predicted response is increasing or decreasing at specified locations. You can use finite differences to compute the slope (first derivative) of a regression model. This numerical approximation technique is most useful for nonparametric regression models that cannot be simply written in terms of an analytical formula.

### A nonparametric model of drug absorption

The following data are the hypothetical concentrations of a drug in a patient's bloodstream at times (measured in hours) during a 72-hour period after the drug is administered. For a real drug, a pharmacokinetic researcher might construct a parametric model and fit the model by using PROC NLMIXED. The following example uses the EFFECT statement to fit a regression model that uses cubic splines. Other nonparametric models in SAS include loess curves, generalized additive models, adaptive regression, and thin-plate splines.

```data Drug; input Time Concentration @@; datalines; 1 3 3 7 6 19 12 73 18 81 24 71 36 38 42 28 48 20 72 12 ;   proc glmselect data=Drug; effect spl = spline(Time/ naturalcubic basis=tpf(noint) knotmethod=percentiles(5)); model Concentration = spl / selection=none; /* fit model by using spline effects */ store out=SplineModel; /* store model for future scoring */ quit;```

Because the data are not evenly distributed in time, a graph of the spline fit evaluated at the data points does not adequately show the response curve. Notice that the call to PROC GLMSELECT used a STORE statement to store the model to an item store. You can use PROC PLM to score the model on a uniform grid of values to visualize the regression model:

```/* use uniform grid to visualize curve */ data ScoreData; do Time = 0 to 72; output; end; run;   /* score the model on the uniform grid */ proc plm restore=SplineModel noprint; score data=ScoreData out=ScoreResults; run;   /* merge fitted curve with original data and plot the fitted curve */ data All; set Drug ScoreResults; run;   title "Observed and Predicted Blood Content of Drug"; proc sgplot data=All noautolegend; scatter x=Time y=Concentration; series x=Time y=Predicted / name="fit" legendlabel="Predicted Response"; keylegend "fit" / position=NE location=inside opaque; xaxis grid values=(0 to 72 by 12) label="Hours"; yaxis grid; run;```

### Finite difference approximations for derivatives of nonparametric models

A researcher might be interested in knowing the slope of the regression curve at certain time points. The slope indicates the rate of change of the response variable (the blood-level concentration). Because nonparametric regression curves do not have explicit formulas, you cannot use calculus to compute a derivative. However, you can use finite difference formulas to compute a numerical derivative at any point.

There are several finite difference formulas for the first derivative. The forward and backward formulas are less accurate than the central difference formula. Let h be a small value. Then the approximate derivative of a function f at a point t is given by
f′(t) ≈ [ f(t + h) – f(th) ] / 2h

The formula says that you can approximate the slope at t by evaluating the model at the points t ± h. You can use the DIF function to compute the difference between the response function at adjacent time points and divide that difference by 2h. The code that scores the model is similar to the more-familiar case of scoring on a uniform grid. The following statements evaluate the derivative of the model at six-hour intervals.
```/* compute derivatives at specified points */ data ScoreData; h = 0.5e-5; do t = 6 to 48 by 6; /* siz-hour intervals */ Time = t - h; output; Time = t + h; output; end; keep t Time h; run;   /* score the model at each time point */ proc plm restore=SplineModel noprint; score data=ScoreData out=ScoreOut; /* Predicted column contains f(x+h) and f(x-h) */ run;   /* compute first derivative by using central difference formula */ data Deriv; set ScoreOut; Slope = dif(Predicted) / (2*h); /* [f(x+h) - f(x-h)] / 2h */ Time = t; /* estimate slope at this time */ if mod(_N_,2)=0; /* process observations in pairs; drop even obs */ drop h t; run;   proc print data=Deriv; run;```

The output shows that after six hours the drug is entering the bloodstream at 6.8 units per hour. By 18 hours, the rate of absorption has slowed to 0.6 units per hour. After 24 hours, the rate of absorption is negative, which means that the blood-level concentration is decreasing. At approximately 30 hours, the drug is leaving the bloodstream at the rate of -3.5 units per hour.

This technique generalizes to other nonparametric models. If you can score the model, you can use the central difference formula to approximate the first derivative in the interior of the data range.

For more about numerical derivatives, including a finite-difference approximation of the second derivative, see Warren Kuhfeld's article on derivatives for penalized B-splines. Warren's article is focused on how to obtain the predicted values that are generated by the built-in regression models in PROC SGPLOT (LOESS and PBSPLINE), but it contains derivative formulas that apply to any regression curve.

The post Compute derivatives for nonparametric regression models appeared first on The DO Loop.

7
Mar

## Fit a distribution from quantiles

Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are "most likely" to have produced the observed data. However, how can you fit a distribution if you do not have access to the data?

This question was asked by a SAS programmer who wanted to fit a gamma distribution by using sample quantiles of the data. In particular, the p[rogrammer said, "we have the 50th and 90th percentile" of the data and "want to find the parameters for the gamma distribution [that fit] our data."

This is an interesting question. Recall that the method of moments uses sample moments (mean, variance, skewness,...) to estimate parameters in a distribution. When you use the method of moments, you express the moments of the distribution in terms of the parameters, set the distribution's moments equal to the sample moments, and solve for the parameter values for which the equation is true.

In a similar way, you can fit a distribution matching quantiles: Equate the sample and distributional quantiles and solve for the parameters of the distribution. This is sometimes called quantile-matching estimation (QME). Because the quantiles involve the cumulative distribution function (CDF), the equation does not usually have a closed-form solution and must be solved numerically.

### Fit a two-parameter distribution from two quantiles

To answer the programmer's question, suppose you do not have the original data, but you are told that the 50th percentile (median) of the data is x = 4 and the 90th percentile is x = 8. You suspect that the data are distributed according to a gamma distribution, which has a shape parameter (α) and a scale parameter (β). To use quantile-matching estimation, set F(4; α, β) = 0.5 and F(8; α, β) = 0.9, where F is the cumulative distribution of the Gamma(α, β) distribution. You can then solve for the values of (α, β) that satisfy the equations. You will get a CDF that matches the quantiles of the data, as shown to the right.

I have previously written about four ways to solve nonlinear equations in SAS. One way is to use PROC MODEL, as shown below:

```data initial; alpha=1; beta=1; /* initial guess for finding root */ p1=0.5; X1 = 4; /* eqn for 1st quantile: F(X1; alpha, beta) = p1 */ p2=0.9; X2 = 8; /* eqn for 2nd quantile: F(X2; alpha, beta) = p2 */ run;   proc model data=initial; eq.one = cdf("Gamma", X1, alpha, beta) - p1; /* find root of eq1 */ eq.two = cdf("Gamma", X2, alpha, beta) - p2; /* and eq2 */ solve alpha beta / solveprint out=solved outpredict; run;quit;   proc print data=solved noobs; var alpha beta; run;```

The output indicates that the parameters (α, β) = (2.96, 1.52) are the values for which the Gamma(α, β) quantiles match the sample quantiles. You can see this by graphing the CDF function and adding reference lines at the 50th and 90th percentiles, as shown at the beginning of this section. The following SAS code creates the graph:

```/* Graph the CDF function to verify that the solution makes sense */ data Check; set solved; /* estimates of (alpha, beta) from solving eqns */ do x = 0 to 12 by 0.2; CDF = cdf("gamma", x, alpha, beta); output; end; run;   title "CDF of Gamma Distribution"; title2 "Showing 50th and 90th Percentiles"; proc sgplot data=Check; series x=x y=CDF / curvelabel; dropline y=0.5 X=4 / dropto=both; /* first percentile */ dropline y=0.9 X=8 / dropto=both; /* second percentile */ yaxis values=(0 to 1 by 0.1) label="Cumulative Probability"; xaxis values=(0 to 12 by 2); run;```

### Least squares estimates for matching quantiles

The previous section is relevant when you have as many sample quantiles as parameters. If you have more sample quantiles than parameters, then the system is overconstrained and you probably want to compute a least squares solution. If there are m sample quantiles, the least squares solution is the set of parameters that minimizes the sum of squares Σim (piF(xi; α, β))2.

For example, the following DATA step contains five sample quantiles. The observation (p,q) = (0.1, 1.48) indicates that the 10th percentile is x=1.48. The second observation indicates that the 25th percentile is x=2.50. The last observation indicates that the 90th percentile is x=7.99. You can use PROC NLIN to find a least squares solution to the quantile-matching problem, as follows:

```data SampleQntls; input p q; /* p is cumul probability; q is p_th sample quantile */ datalines; 0.1 1.48 0.25 2.50 0.5 4.25 0.75 6.00 0.9 7.99 ;   /* least squares fit of parameters */ proc nlin data=SampleQntls /* sometimes the NOHALVE option is useful */ outest=PE(where=(_TYPE_="FINAL")); parms alpha 2 beta 2; bounds 0 < alpha beta; model p = cdf("Gamma", q, alpha, beta); run;   proc print data=PE noobs; var alpha beta; run;```

The solution indicates the parameter values (α, β) = (2.72, 1.70) minimize the sum of squares between the observed and theoretical quantiles. The following graph shows the observed quantiles overlaid on the CDF of the fitted Gamma(α, β) distribution. Alternatively, you can graph the quantile-quantile plot of the observed and fitted quantiles.

### Weighted least squares estimates for matching quantiles

For small samples, quantiles in the tail of a distribution have a large standard error, which means that the observed quantile might not be close to the theoretical quantile. One way to handle that uncertainty is to compute a weighted regression analysis where each sample quantile is weighted by the inverse of its variance. According to Stuart and Ord (Kendallâ€™s Advanced Theory of Statistics, 1994, section 10.10), the standard error of the p_th sample quantile in a sample of size n is σ2 = p(1-p) / (n fp)2), where ξp is the p_th quantile of the distribution and f is the probability density function.

In PROC NLIN, you can perform weighted analysis by using the automatic variable _WEIGHT_. The following statements define the variance of the p_th sample quantile and define weights equal to the inverse variance. Notice the NOHALVE option, which can be useful for iteratively reweighted least squares problems. The option eliminates the requirement that the weighted sum of squares must decrease at every iteration.

```/* weighted least squares fit where w[i] = 1/variance[i] */ proc nlin data=SampleQntls NOHALVE outest=WPE(where=(_TYPE_="FINAL")); parms alpha 2 beta 2; bounds 0 < alpha beta; N = 80; /* sample size */ xi = quantile("gamma", p, alpha, beta); /* quantile of distrib */ f = pdf("Gamma", xi, alpha, beta); /* density at quantile */ variance = p*(1-p) / (N * f**2); /* variance of sample quantiles */ _weight_ = 1 / variance; /* weight for each observation */ model p = cdf("Gamma", q, alpha, beta); run;```

The parameter estimates for the weighted analysis are slightly different than for the unweighted analysis. The following graph shows the CDF for the weighted estimates, which does not pass as close to the 75th and 90th percentiles as does the CDF for the unweighted estimates. This is because the PDF of the gamma distribution is relatively small for those quantiles, which causes the regression to underweight those sample quantiles.

In summary, this article shows how to use SAS to fit distribution parameters to observed quantiles by using quantile-matching estimation (QME). If the number of quantiles is the same as the number of parameters, you can numerically solve for the parameters for which the quantiles of the distribution equal the sample quantiles. If you have more quantiles than parameters, you can compute a least squares estimate of the parameters. Because quantile estimates in the tail of a distribution have larger uncertainty, you might want to underweight those quantiles. One way to do that is to run a weighted least squares regression where the weights are inversely proportional to the variance of the sample quantiles.

The post Fit a distribution from quantiles appeared first on The DO Loop.

28
Feb

## Solve a system of nonlinear equations with SAS

This article shows how to use SAS to solve a system of nonlinear equations. When there are n unknowns and n equations, this problem is equivalent to finding a multivariate root of a vector-valued function F(x) = 0 because you can always write the system as
f1(x1, x2, ..., xn) = 0
f2(x1, x2, ..., xn) = 0
. . .
fn(x1, x2, ..., xn) = 0
Here the fi are the nonlinear component functions, F is the vector (f1, f2, ..., fn), and x is the vector (x1, x2, ..., xn).

In two dimensions, the solution can be visualized as the intersection of two planar curves. An example for n = 2 is shown at the right. The two curves meet at the solution (x, y) = (1, 2).

There are several ways to solve a system of nonlinear equations in SAS, including:

• In SAS/IML software, you can use the NLPLM or NLPHQN methods to solve the corresponding least-squares problem. Namely, find the value of x that minimizes || F(x) ||.
• In SAS/ETS software, you can use the SOLVE statement in PROC MODEL to solve the system.
• In SAS/STAT software, you can use the NLIN procedure to solve the system.
• In SAS/OR software, you can use PROC OPTMODEL to solve the system.

When n = 1, the problem is one-dimensional. You can use the FROOT function in SAS/IML software to find the root of a one-dimensional function. You can also use the SOLVE function in conjunction with PROC FCMP.

This article shows how to find a root for the following system of three equations:
f1(x, y, z) = log(x) + exp(-x*y) - exp(-2)
f2(x, y, z) = exp(x) - sqrt(z)/x - exp(1) + 2
f3(x, y, z) = x + y - y*z + 5
You can verify that the value (x, y, z)=(1, 2, 4) is an exact root of this system.

### Solve a system of nonlinear equations in SAS/IML

You can use the NLPLM or NLPHQN methods in SAS/IML to solve nonlinear equations. You need to define a function that returns the value of the function as a row vector. This is very important: the function must return a row vector! If the domain of any component of the function is restricted (for example, because of LOG or SQRT functions), you can define a linear constraint matrix. You then supply an initial guess and call the NLPLM routine to solve the least-squares problem that minimizes 1/2 (f12 + ... + fn2). Obviously the minimum occurs when each component is zero, that is, when (x,y,z) is a root of the vector-valued function. You can solve for the root as follows:

```proc iml; start Fun(var); x = var[1]; y = var[2]; z = var[3]; f = j(1, 3, .); /* return a ROW VECTOR */ f[1] = log(x) + exp(-x*y) - exp(-2); f[2] = exp(x) - sqrt(z)/x - exp(1) + 2; f[3] = x + y - y*z + 5; return (f); finish;   /* x[1] x[2] x[3] constraints. Lower bounds in 1st row; upper bounds in 2nd row */ con = {1e-6 . 1e-6, /* x[1] > 0 and x[3] > 0; no bounds on y */ . . .}; x0 = {1 1 1}; /* initial guess */ optn = {3 /* solve least square problem that has 3 components */ 1}; /* amount of printing */ call nlphqn(rc, Soln, "Fun", x0, optn) blc=con; /* or use NLPLM */ print Soln; quit;```

The NLPHQN routine converges to the solution (1, 2, 4). Notice that the first element of the optn vector must contain n, the number of equations in the system.

### Solve a system of nonlinear equations with PROC MODEL

If you have access to SAS/ETS software, PROC MODEL provides a way to solve simultaneous equations. You first create a SAS data set that contains an initial guess for the solution. You then define the equations in PROC MODEL and use the SOLVE statement to solve the system, as follows:

```data InitialGuess; x=1; y=1; z=1; /* initial guess for Newton's method */ run;   proc model data=InitialGuess; bounds 0 < x z; eq.one = log(x) + exp(-x*y) - exp(-2); eq.two = exp(x) - sqrt(z)/x - exp(1) + 2; eq.three = x + y - y*z + 5; solve x y z / solveprint out=solved outpredict; run;quit;   title "Solution from PROC MODEL in SAS/ETS"; proc print data=solved noobs; var x y z; run;```

A nice feature of PROC MODEL is that it automatically generates symbolic derivatives and uses them in the solution of the simultaneous equations. If you want to use derivatives in PROC IML, you must specify them yourself. Otherwise, the NLP routines use numerical finite-difference approximations.

### Solve a system of nonlinear equations with PROC NLIN

You can solve a system of equations by using only SAS/STAT software, but you need to know a trick. My colleague who supports PROC NLIN says he has "seen this trick before" but does not know who first thought of it. I saw it in a 2000 paper by Nam, Cho, and Shim (in Korean).

Because PROC NLIN is designed to solve regression problems, you need to recast the problem in terms of a response variable, explanatory variables, and parameters. Recall that ordinary least squares regression enables you to solve a linear system such as
0 = C1*v1 + C2*v2 + C3*v3
where the left-hand side is a response vector (the zero vector), the C_i are regression coefficients, and the v_i are explanatory variables. (You need three or more observations to solve this regression problem.) PROC NLIN enables you to solve more complex regression problems. In particular, the coefficients can be nonlinear functions of parameters. For example, if the parameters are (x,y,z), you can solve the following system:
0 = C1(x,y,z)*v1 + C2(x,y,z)*v2 + C3(x,y,z)*v3.

To solve this nonlinear system of equations, you can choose the explanatory variables to be coordinate basis functions: v1=(1,0,0), v2=(0,1,0), and v3=(0,0,1). These three observations define three equations for three unknown parameters. In general, if you have n equations in n unknowns, you can specify n coordinate basis functions.

To accommodate an arbitrary number of equations, the following data step generates n basis vectors, where n is given by the value of the macro variable numEqns. The BasisVectors data set contains a column of zeros (the LHS variable):

```%let numEqns = 3; data BasisVectors; LHS = 0; array v[&numEqns]; do i = 1 to dim(v); do j = 1 to dim(v); v[j] = (i=j); /* 1 when i=j; 0 otherwise */ end; output; end; drop i j; run;   title "Solution from PROC NLIN in SAS/STAT"; proc nlin data=BasisVectors; parms x 1 y 1 z 1; /* initial guess */ bounds 0 < x z; /* linear constraints */ eq1 = log(x) + exp(-x*y) - exp(-2); eq2 = exp(x) - sqrt(z)/x - exp(1) + 2; eq3 = x + y - y*z + 5; model LHS = eq1*v1 + eq2*v2 + eq3*v3; ods select EstSummary ParameterEstimates; run;```

The problem contains three parameters and the data contains three observations. Consequently, the standard errors and confidence intervals are not meaningful. The parameter estimates are the solution to the nonlinear simultaneous equations.

### Solve a system of nonlinear equations with PROC OPTMODEL

With PROC OPTMODEL in SAS/OR software, you can express the system in a natural syntax. You can either minimize the objective function F = 0.5 * (f1**2 + f2**2 + f3**2) or solve the system directly by specifying constraints but not an objective function, as follows:

```title "Solution from PROC OPTMODEL in SAS/OR"; proc optmodel; var x init 1, y init 1, z init 1; /* -or- var x >= 1e-6 init 1, y init 1, z >= 0 init 1; to specify bounds */ con c1: log(x) + exp(-x*y) = exp(-2); con c2: exp(x) - sqrt(z)/x = exp(1) - 2; con c3: x + y - y*z = -5; solve noobjective; print x y z; quit;```

The solution is (x,y,z)=(1,2,4) and is not shown.

### Summary

In summary, there are multiple ways to solve systems of nonlinear equations in SAS. My favorite ways are the NLPHQN function in SAS/IML and the SOLVE statement in PROC MODEL in SAS/ETS. However, you can also use PROC NLIN in SAS/STAT software or PROC OPTMODEL in SAS/OR. When you need to solve a system of simultaneous nonlinear equations in SAS, you can choose whichever method is most convenient for you.

The post Solve a system of nonlinear equations with SAS appeared first on The DO Loop.

19
Feb

## Compute the quantiles of any distribution

Your statistical software probably provides a function that computes quantiles of common probability distributions such as the normal, exponential, and beta distributions. Because there are infinitely many probability distributions, you might encounter a distribution for which a built-in quantile function is not implemented. No problem! This article shows how to numerically compute the quantiles of any probability distribution from the definition of the cumulative distribution (CDF).

In SAS, the QUANTILE function computes the quantiles for about 25 distributions. This article shows how you can use numerical root-finding methods (and possibly numerical integration) in SAS/IML software to compute the quantile function for ANY continuous distribution. I have previously written about related topics and particular examples, such as the following:

### The quantile is the root of an integral equation

Computing a quantile would make a good final exam question for an undergraduate class in numerical analysis. Although some distributions have an explicit CDF, many distributions are defined only by a probability density function (the PDF, f(x)) and numerical integration must be used to compute the cumulative distribution (the CDF, F(x)). A canonical example is the normal distribution. I've previously shown how to use numerical integration to compute a CDF from a PDF by using the definition F(x) = ∫ f(t) dt, where the lower limit of the integral is –∞ and the upper limit is x.

Whether the CDF is defined analytically or through numerical integration, the quantile for p is found implicitly as the solution to the equation F(x) = p, where p is a probability in the interval (0,1). This is illustrated by the graph at the right.

Equivalently, you can define G(x; p) = F(x) – p so that the quantile is the root of the equation G(x; p) = 0. For well-behaved densities that occur in practice, a numerical root is easily found because the CDF is monotonically increasing. (If you like pathological functions, see the Cantor staircase distribution.)

### Example: Create a custom distribution

SAS/IML software provides the QUAD subroutine, which provides numerical integration, and the FROOT function, which solves for roots. Thus SAS/IML is an ideal computational environment for computing quantiles for custom distributions.

As an example, consider a distribution that is a mixture of an exponential and a normal distribution:
F(x) = 0.4 Fexp(x; 0.5) + 0.6 Φ(x; 10, 2),
where Fexp(x; 0.5) is the exponential distribution with scale parameter 0.5 and Φ(x; 10, 2) is the normal CDF with mean 20 and standard deviation 2. In this case, you do not need to use numerical integration to compute the CDF. You can compute the CDF as a linear combination of the exponential and normal CDFs, as shown in the following SAS/IML function:

```/* program to numerically find quantiles for a custom distribution */ proc iml; /* Define the cumulative distribution function here. */ start CustomCDF(x); F = 0.4*cdf("Exponential", x, 0.5) + 0.6*cdf("Normal", x, 10, 2); return F; finish;```

The previous section shows the graph of the CDF on the interval [0, 16]. The vertical and horizontal lines correspond to the first, second and third quartiles of the distribution. The quartiles are close to the values Q1 ≈ 0.5, Q2 ≈ 8, and Q3 ≈ 10.5. The next section shows how to compute the quantiles.

### Compute quantiles for an arbitrary distribution

As long as you can define a function that evaluates the CDF, you can find quantiles. For unbounded distributions, it is usually helpful to plot the CDF so that you can visually estimate an interval that contains the quantile. (For bounded distributions, the support of the distribution contains all quantiles.) For the mixture distribution in the previous section, it is clear that the quantiles are in the interval [0, 16].

The following program finds arbitrary quantiles for whichever CDF is evaluated by the CustomCDF function. To find quantiles for a different function, you can modify the CustomCDF and change the interval on which to find the quantiles. You do not need to modify the RootFunc or CustomQuantile functions.

```/* Express CDF(x)=p as the root for the function CDF(x)-p. */ start RootFunc(x) global(gProb); return CustomCDF(x) - gProb; /* quantile for p is root of CDF(x)-p */ finish;   /* You need to provide an interval on which to search for the quantiles. */ start CustomQuantile(p, Interval) global(gProb); q = j(nrow(p), ncol(p), .); /* allocate result matrix */ do i = 1 to nrow(p)*ncol(p); /* for each element of p... */ gProb = p[i]; /* set global variable */ q[i] = froot("RootFunc", Interval); /* find root (quantile) */ end; return q; finish;   /* Example: look for quartiles in interval [0, 16] */ probs = {0.25 0.5 0.75}; /* Q1, Q2, Q3 */ intvl = {0 16}; /* interval on which to search for quantiles */ quartiles = CustomQuantile(probs, intvl); print quartiles[colname={Q1 Q2 Q3}];```

In summary, you can compute an arbitrary quantile of an arbitrary continuous distribution if you can (1) evaluate the CDF at any point and (2) numerically solve for the root of the equation CDF(x)-p for a probability value, p. Because the support of the distribution is arbitrary, the implementation requires that you provide an interval [a,b] that contains the quantile.

The computation should be robust and accurate for non-pathological distributions provided that the density is not tiny or zero at the value of the quantile. Although this example is illustrated in SAS, the same method will work in other software.

The post Compute the quantiles of any distribution appeared first on The DO Loop.

20
Nov

## Simulate data from the beta-binomial distribution in SAS

This article shows how to simulate beta-binomial data in SAS and how to compute the density function (PDF). The beta-binomial distribution is a discrete compound distribution. The "binomial" part of the name means that the discrete random variable X follows a binomial distribution with parameters N (number of trials) and p, but there is a twist: The parameter p is not a constant value but is a random variable that follows the Beta(a, b) distribution.

The beta-binomial distribution is used to model count data where the counts are "almost binomial" but have more variance than can be explained by a binomial model. Therefore this article also compares the binomial and beta-binomial distributions.

### Simulate data from the beta-binomial distribution

To generate a random value from the beta-binomial distribution, use a two-step process. The first step is to draw p randomly from the Beta(a, b) distribution. Then you draw x from the binomial distribution Bin(p, N). The beta-binomial distribution is not natively supported by the RAND function SAS, but you can call the RAND function twice to simulate beta-binomial data, as follows:

```/* simulate a random sample from the beta-binomial distribution */ %let SampleSize = 1000; data BetaBin; a = 6; b = 4; nTrials = 10; /* parameters */ call streaminit(4321); do i = 1 to &SampleSize; p = rand("Beta", a, b); /* p[i] ~ Beta(a,b) */ x = rand("Binomial", p, nTrials); /* x[i] ~ Bin(p[i], nTrials) */ output; end; keep x; run;```

The result of the simulation is shown in the following bar chart. The expected values are overlaid. The next section shows how to compute the expected values.

### The PDF of the beta-binomial distribution

The Wikipedia article about the beta-binomial distribution contains a formula for the PDF of the distribution. Since the distribution is discrete, some references prefer to use "PMF" (probability mass function) instead of PDF. Regardless, if X is a random variable that follows the beta-binomial distribution then the probability that X=x is given by

where B is the complete beta function.

The binomial coefficients ("N choose x") and the beta function are defined in terms of factorials and gamma functions, which get big fast. For numerical computations, it is usually more stable to compute the log-transform of the quantities and then exponentiate the result. The following DATA step computes the PDF of the beta-binomial distribution. For easy comparison with the distribution of the simulated data, the DATA step also computes the expected count for each value in a random sample of size N. The PDF and the simulated data are merged and plotted on the same graph by using the VBARBASIC statement in SAS 9.4M3. The graph was shown in the previous section.

```data PDFBetaBinom; /* PMF function */ a = 6; b = 4; nTrials = 10; /* parameters */ do x = 0 to nTrials; logPMF = lcomb(nTrials, x) + logbeta(x + a, nTrials - x + b) - logbeta(a, b); PMF = exp(logPMF); /* probability that X=x */ EX = &SampleSize * PMF; /* expected value in random sample */ output; end; keep x PMF EX; run;   /* Merge simulated data and PMF. Overlay PMF on data distribution. */ data All; merge BetaBin PDFBetaBinom(rename=(x=t)); run;   title "The Beta-Binomial Distribution"; title2 "Sample Size = &SampleSize"; proc sgplot data=All; vbarbasic x / barwidth=1 legendlabel='Simulated Sample'; /* requires SAS 9.4M3 */ scatter x=t y=EX / legendlabel='Expected Value' markerattrs=GraphDataDefault(symbol=CIRCLEFILLED size=10); inset "nTrials = 10" "a = 6" "b = 4" / position=topleft border; yaxis grid; xaxis label="x" integer type=linear; /* force TYPE=LINEAR */ run;```

### Compare the binomial and beta-binomial distributions

One application of the beta-binomial distribution is to model count data that are approximately binomial but have more variance ("thicker tails") than the binomial model predicts. The expected value of a Beta(a, b) distribution is a/(a + b), so let's compare the beta-binomial distribution to the binomial distribution with p = a/(a + b).

The following graph overlays the two PDFs for a = 6, b = 4, and nTrials = 10. The blue distribution is the binomial distribution with p = 6/(6 + 4) = 0.6. The pink distribution is the beta-binomial. You can see that the beta-binomial distribution has a shorter peak and thicker tails than the corresponding binomial distribution. The expected value for both distributions is 6, but the variance of the beta-binomial distribution is greater. Thus you can use the beta-binomial distribution as an alternative to the binomial distribution when the data exhibit greater variance than expected under the binomial model (a phenomenon known as overdispersion).

### Summary

The beta-binomial distribution is an example of a compound distribution. You can simulate data from a compound distribution by randomly drawing the parameters from some distribution and then using those random parameters to draw the data. For the beta-binomial distribution, the probability parameter p is drawn from a beta distribution and then used to draw x from a binomial distribution where the probability of success is the value of p. You can use the beta-binomial distribution to model data that have greater variance than expected under the binomial model.

The post Simulate data from the beta-binomial distribution in SAS appeared first on The DO Loop.

11
Sep

## Symbolic derivatives in SAS

Did you know that you can get SAS to compute symbolic (analytical) derivatives of simple functions, including applying the product rule, quotient rule, and chain rule? SAS can form the symbolic derivatives of single-variable functions and partial derivatives of multivariable functions. Furthermore, the derivatives are output in a form that can be pasted into a SAS program. The trick is to use PROC NLIN with the LIST option.

In SAS/IML, the nonlinear optimization routines will use analytical derivatives if they are provided; otherwise, they will automatically generate numerical finite-difference approximations to the derivatives. I rarely specify derivatives because it can be time-consuming and prone to error. But recently I realized that PROC NLIN and other SAS procedures automatically generate symbolic derivatives, and you can "trick" the procedure into displaying the derivatives so that they can be used in other applications. To get PROC NLIN to output symbolic derivatives, do the following:

• Create a data set to use for the DATA= option of PROC NLIN.
• List the variables that you want to take derivatives of on the PARMS statement. In open code, assign a value to constants that appear in the function.
• Specify the function to differentiate on the MODEL statement.

### Symbolic derivatives of functions of one variable

Here is a one-variable example. Suppose you want to take the derivative of
f(x) = x3 + x sin(x) + exp( x2 ).
The function and its derivative is shown to the right.

The following SAS statements create a "fake" data set that defines a variable named F. The call to PROC NLIN sets up the problem. The LIST option on the PROC NLIN statement generates the 'ProgList' table, which contains the derivative function:

```data _dummy; f = 1; /* name of function */ run;   proc nlin data=_dummy list; parms x=0; /* list variables */ model f = x**3 + x*sin(x) + exp(x**2); /* define function */ ods select ProgList; /* output symbolic derivatives */ run;```

The output shows the expressions for the function ("MODEL.f") and the derivative with respect to x. The derivative df/dx is written as "@MODEL.f/@x", where the symbol '@' is used for the 'd'. You can see that SAS took the simple derivative of the x3 term, applied the product rule to the x*sin(x) term, and applied the chain rule to the exp( x2 ) term. You can also see that the expressions are written in SAS notation, with "**" indicating the power operator. SAS functions are written in upper case (SIN, COS, and EXP).

### Symbolic partial derivatives of multivariate functions

In exactly the same way, you can obtain partial derivatives. In a partial derivative, all variables are held constant except one. Suppose you have the function for the normal density,
f(x) = 1/(sqrt(2 π) σ) exp( -(x - μ)**2 / (2 σ**2) )
Suppose that you want to compute the partial derivatives ∂f/∂μ and ∂f/∂σ. To compute these derivatives, list μ and σ on the PARMS statement and assign values to the other symbols (x and π) that appear in the function. It does not matter what value you assign to the x and π, you are just preventing PROC NLIN from complaining that there are unassigned variables. You could assign the value 1 to both variables, but I prefer to tell the procedure that π equals 3.14. In the following program, note that PROC NLIN reuses the fake data set that defines the F variable:

```proc nlin data=_dummy list; parms mu=0 sigma=1; /* list variables */ x=1; pi = constant('pi'); /* assign arbitrary values to constants */ model f = exp( -(x-mu)**2 / (2*sigma**2) ) / (sqrt(2*pi)*sigma); /* define function */ ods select ProgList; /* output symbolic derivatives */ run;```

The derivative with respect to σ requires using the chain rule and the quotient rule. Notice that the derivative includes a call to the original function ("MODEL.f"). Notice also that there is repetition of various expressions such as -(x - μ)**2 / (2 σ**2). If you copy and paste these expressions into a SAS program that computes the derivative, the computation will be slightly inefficient because the program will evaluate the same expression multiple times. One way to improve the efficiency is to collect sub-expressions into a temporary variable, as shown in the next section.

### Sub-expressions and the chain rule

An experienced statistician will recognize that the expression z = (x - μ) / σ arises naturally as the standardized variable. Using such an expression might simplify the derivatives.

The following program computes the same derivatives as before, except this time the programmer defines the expression z = (x - μ) / σ and uses z in the function. When SAS takes the derivative, it computes dz/dμ and dz/dσ as part of the computation, as shown below:

```proc nlin data=_dummy list; parms mu=0 sigma=1; /* list variables */ x=1; pi = constant('pi'); /* assign arbitrary values to constants */ const = sqrt(2*pi); z = (x-mu) / sigma; /* intermediate expression */ model f = exp(-z**2 / 2) / (const*sigma); /* define function */ ods select ProgList; /* output symbolic derivatives */ run;```

Notice that the derivatives use several sub-expressions such as dz/dμ and dz/dσ. This computation is more efficient than the previous because it reuses previously computed quantities. For another example, define f1 = 1 / sqrt(2*pi*sigma) and f2 = exp(-z**2 / 2) and define the MODEL as f = f1*f2. You will see that each term in the derivative is efficiently evaluated in terms of previously expressions.

### Discussion of symbolic derivatives

As I said previously, I do not usually specify analytical derivatives for my optimizations. I find that numerical derivatives (which are automatically computed) are fast and accurate for 99% of the optimizations that I perform.

Furthermore, some of the objective functions that I optimize do not have analytical derivatives. Others are matrix computations for which it would be difficult and time-consuming to write down the analytical derivative. (For an example, see the article on the derivative of the bivariate normal cumulative distribution.)

If you decide to specify an analytical derivative in SAS, make sure that the derivative is correct and efficient. The technique in this article will help to ensure that the derivative is correct. By strategically defining sub-expressions as in the previous section, you can help SAS generate derivatives that are computationally efficient.

Be aware that the derivatives are generated by applying the chain rule, product rule, and quotient rule without any algebraic simplifications of the result. Thus the result can be unnecessarily complicated. For an example, ask SAS to display the derivative of log( (1+x)/(1-x) ) / 2. The naive result is complicated. If you rewrite the function as log(1+x)/2 - log(1-x)/2, then the derivative is much simpler. However, in neither case will SAS simplify the derivative to obtain 1 / (1-x**2).

For more of my thoughts on using derivatives in SAS, see "Two hints for specifying derivatives." What about your thoughts? What do you think about using PROC NLIN to generate automatic derivatives of simple functions? Do you think this technique will be useful to you?

The post Symbolic derivatives in SAS appeared first on The DO Loop.

28
Aug

## The singular value decomposition: A fundamental technique in multivariate data analysis

The singular value decomposition (SVD) could be called the "billion-dollar algorithm" since it provides the mathematical basis for many modern algorithms in data science, including text mining, recommender systems (think Netflix and Amazon), image processing, and classification problems. Although the SVD was mathematically discovered in the late 1800s, computers have made the SVD an indispensable tool in computational statistics and data science.

### SVD: A fundamental theorem of linear algebra

Mathematically, the singular value decomposition is a fundamental theorem of linear algebra. )You could argue that it is THE fundamental theorem, but Gil Strang names a different result.) The singular value decomposition says that every n x p matrix can be written as the product of three matrices: A = U Σ VT where

• U is an othogonal n x n matrix
• Σ is a diagonal n x p matrix. In practice, the diagonal elements are ordered so that Σii ≥ Σjj for all i < j.
• V is an othogonal p x p matrix and VT represents a matrix transpose.

The SVD represents the essential geometry of a linear transformation. It tells us that every linear transformation is a composition of three fundamental actions. Reading the equation from right to left:

1. The matrix V represents a rotation or reflection of vectors in the p-dimensional domain.
2. The matrix Σ represents a linear dilation or contraction along each of the p coordinate directions. If np, this step also canonically embeds (or projects) the p-dimensional domain into (or onto) the n-dimensional range.
3. The matrix U represents a rotation or reflection of vectors in the n-dimensional range.

Thus the SVD specifies that every linear transformation is fundamentally a rotation or reflection, followed by a scaling, followed by another rotation or reflection. The Strang (1993) article about the fundamental theorem of linear algebra includes the following geometric interpretation of the singular value decomposition of a 2 x 2 matrix:

The diagram shows that the transformation induced by the matrix A (the long arrow across the top of the diagram) is equivalent to the composition of the three fundamental transformations, namely a rotation, a scaling, and another rotation.

### SVD: The fundamental theorem of multivariate data analysis

Because of its usefulness, the singular value decomposition is a fundamental technique for multivariate data analysis. A common goal of multivariate data analysis is to reduce the dimension of the problem by choosing a small linear subspace that captures important properties of the data. The SVD is used in two important dimension-reducing operations:

• Low-rank approximations: Recall that the diagonal elements of the Σ matrix (called the singular values) in the SVD are computed in decreasing order. The SVD has a wonderful mathematical property: if you choose some integer k ≥ 1 and let D be the diagonal matrix formed by replacing all singular values after the k_th by 0, then then matrix U D VT is the best rank-k approximation to the original matrix A.
• Principal components analysis: The principal component analysis is usually presented in terms of eigenvectors of a correlation matrix, but you can show that the principal component analysis follows directly from the SVD. (In fact, you can derive the eigenvalue decomposition of a matrix, from the SVD.) The principal components of AT A are the columns of the V matrix; the scores are the columns of U. Dimension reduction is achieved by truncating the number of columns in U, which results in the best rank-k approximation of the data.

### Compute the SVD in SAS

In SAS, you can use the SVD subroutine in SAS/IML software to compute the singular value decomposition of any matrix. To save memory, SAS/IML computes a "thin SVD" (or "economical SVD"), which means that the U matrix is an n x p matrix. This is usually what the data analyst wants, and it saves considerable memory in the usual case where the number of observations (n) is much larger than the number of variables (p). Technically speaking, the U for an "economical SVD" is suborthogonal: UT U is the identity when np and U UT is the identity when np.

As for eigenvectors, the columns of U and V are not unique, so be careful if you compare the results in SAS to the results from MATLAB or R. The following example demonstrates a singular value decomposition for a 3 x 2 matrix A. For the full SVD, the U matrix would be a 3 x 3 matrix and Σ would be a 3 x 2 diagonal matrix. For the "economical" SVD, U is 3 2 and Σ is a 2 x 2 diagonal matrix, as shown:

```proc iml; A = {0.3062 0.1768, -0.9236 1.0997, -0.4906 1.3497 };   call svd(U, D, V, A); /* A = U*diag(D)*V` */ print U[f=6.3], D[f=6.3], V[f=6.3];```

Notice that, to save memory, only the diagonal elements of the matrix &Signa; are returns in the vector D. You can explicitly form &Sigma = diag(D) if desired. Geometrically, the linear transformation that corresponds to V rotates a vector in the domain by about 30 degrees, the matrix D scales it by 2 and 0.5 in the coordinate directions, and U inserts it into a three-dimensional space, and applies another rotation.

My next blog post shows how the SVD enables you to reduce the dimension of a data matrix by using a low-rank approximation, which has applications to image compression and de-noising.

The post The singular value decomposition: A fundamental technique in multivariate data analysis appeared first on The DO Loop.

23
Aug

## The arithmetic-geometric mean

All statisticians are familiar with the classical arithmetic mean. Some statisticians are also familiar with the geometric mean. Whereas the arithmetic mean of n numbers is the sum divided by n, the geometric mean of n nonnegative numbers is the n_th root of the product of the numbers. The geometric mean is always less than the arithmetic mean.

Did you know that there is a hybrid quantity, called the arithmetic-geometric mean, which is defined by combining the two quantities? The arithmetic-geometric mean is only defined for two positive numbers, x and y. It is defined as the limit of an alternating iterative process:

• Define a1 = (x + y)/2 and g1 = sqrt(x y) to be the arithmetic and geometric means, respectively.
• Iteratively define an+1 = (an + gn)/2 and gn+1 = sqrt(an gn).

Interestingly, this process always converges. The number that it converges to is called the arithmetic-geometric mean of x and y, which is denoted by AGM(x,y). The AGM is between the geometric mean and the arithmetic mean. I am not aware of a multivariate generalization of the AGM function.

In SAS you can use the SAS/IML language or PROC FCMP to create a user-defined function that computes the arithmetic-geometric mean. Since the AGM is not a vector operation, I show the PROC FCMP approach:

```proc fcmp outlib=work.funcs.MathFuncs; function AGM(x, y); if x<0 | y<0 then return( . ); epsilon = 1e-10; /* precision of computation */ a = mean(x, y); g = geomean(x,y); do while (abs(a - g) > epsilon); a_new = mean(a, g); g_new = geomean(a, g); a = a_new; g = g_new; end; return( mean(a, g) ); endsub; quit;   /* test the function */ options cmplib=work.funcs; /* define location of function */ data _NULL_; agm = AGM(1, 0.8); put agm 18.14; run;```
` 0.89721143211504`

An example of calling the new AGM function is shown for the two numbers 1 and 0.8. The arithmetic mean of 1 and 0.8 is 0.9. The geometric mean of 1 and 0.8 is sqrt(0.8) = 0.8944. The computation shows that the arithmetic-geometric mean is 0.8972, which is between the arithmetic and geometric means.

The following program computes the graph of the AGM function and displays a contour plot of the result. Note that AGM(x, x) = x for any nonnegative value of x.

```data grid; do x = 0 to 5 by 0.1; do y = 0 to 5 by 0.1; agm = AGM(x, y); output; end; end; run;   /* see http://blogs.sas.com/content/iml/2012/07/02/create-a-contour-plot-in-sas.html */ proc sgrender data=grid template=ContourPlotParm; dynamic _TITLE="Graph of Arithmetic-Geometric Mean AGM(x,y)" _X="x" _Y="y" _Z="AGM"; run;```

So, what is the arithmetic-geometric mean good for? The AGM seems mainly useful in applied mathematics and number theory for computing elliptic functions and elliptic integrals. I am not aware of any applications to statistics, perhaps because the AGM is not defined for a sample that contains n elements when n > 2.

Although the AGM function is not used in statistics, the iterative algorithm that defines the AGM is a technique that I have seen often in computational statistics. When a researcher wants two conditions to hold simultaneously, one possible method is to alternately solve for each condition and prove that this iterative process converges to a solution that satisfies both conditions simultaneously. The iterative process can be an effective way to implement an optimization problem if each sub-optimization is simple.

In multivariate statistics, this technique is used in the method of alternating least squares, which is used in several SAS procedures such as PROC TRANSREG, VARCLUS, and MDS. In linear algebra, this technique is used in the method of alternating projections. I have used this technique to compute the closest correlation matrix to an arbitrary matrix.

One last interesting fact. There is another kind of mean called the harmonic mean. You can compute the arithmetic-harmonic mean by applying the same iterative algorithm, but replace the GEOMEAN function with the HARMEAN function. The process converges and the arithmetic-harmonic mean converges is equal to the geometric mean! Thus although this two-step definition might seem contrived, it is "natural" in the sense that it produces the geometric mean (of two numbers) from the arithmetic and harmonic means.

The post The arithmetic-geometric mean appeared first on The DO Loop.

7
Aug

## The curse of non-unique eigenvectors

A SAS customer asked, "I computed the eigenvectors of a matrix in SAS and in another software package. I got different answers? How do I know which answer is correct?"

I've been asked variations of this question dozens of times. The answer is usually "both answers are correct."

The mathematical root of the problem is that eigenvectors are not unique. It is easy to show this: If v is an eigenvector of the matrix A, then by definition A v = λ v for some scalar eigenvalue λ. Notice that if you define u = α v for a scalar α ≠ 0, then u is also an eigenvector because A u = α A v = α λ v = λ u. Thus a multiple of an eigenvector is also an eigenvector.

Most statistical software (including SAS) tries to partially circumvent this problem by standardizing an eigenvector to have unit length (|| v || = 1). However, note that v and -v are both eigenvectors that have the same length. Thus even a standardized eigenvector is only unique up to a ± sign, and different software might return eigenvectors that differ in sign. In fact for some problems, the same software can return different answers when run on different operating systems (Windows versus Linux), or when using vendor-supplied basic linear algebra subroutines such as the Intel Math Kernel Library (MKL).

To further complicate the issue, software might sort the eigenvalues and eigenvectors in different ways. Some software (such as MATLAB) orders eigenvalues by magnitude, which is the absolute value of the eigenvalue. Other software (such as SAS) orders eigenvalues according to the value (of the real part) of the eigenvalues. (For most statistical computations, the matrices are symmetric and positive definite (SPD). For SPD matrices, which have real nonnegative eigenvalues, these two orderings are the same.)

### Eigenvectors of an example matrix

To illustrate the fact that different software and numerical algorithms can produce different eigenvectors, let's examine the eigenvectors of the following 3 x 3 matrix:

The eigenvectors of this matrix will be computed by using five different software packages: SAS, Intel's MKL, MATLAB, Mathematica, and R. The eigenvalues for this matrix are unique and are approximately 16.1, 0, and -1.1. Notice that this matrix is not positive definite, so the order of the eigenvectors will vary depending on the software. Let's compute the eigenvectors in five different ways.

Method 1: SAS/IML EIGEN Call: The following statements compute the eigenvalues and eigenvectors of M by using a built-in algorithm in SAS. This algorithm was introduce in SAS version 6 and was the default algorithm until SAS 9.4.

```proc iml; reset FUZZ; /* print very small numbers as 0 */ M = {1 2 3, 4 5 6, 7 8 9}; reset EIGEN93; /* use "9.3" algorithm; no vendor BLAS (option required for SAS 9.4m3) */ call eigen(EigVal, SASEigVec, M); print EigVal, SASEigVec[colname=("E1":"E3")];```

Notice that the eigenvalues are sorted by their real part, not by their magnitude. The eigenvectors are returned in a matrix. The i_th column of the matrix is an eigenvector for the i_th eigenvalue. Notice that the eigenvector for the largest eigenvalue (the first column) has all positive components. The eigenvector for the zero eigenvalue (the second column) has a negative component in the second coordinate. The eigenvector for the negative eigenvalue (the third column) has a negative component in the third coordinate.

Method 2: Intel MKL BLAS: Starting with SAS/IML 14.1, you can instruct SAS/IML to call the Intel Math Kernel Library for eigenvalue computation if you are running SAS on a computer that has the MKL installed. This feature is the default behavior in SAS/IML 14.1 (SAS 9.4m3), which is why the previous example used RESET NOEIGEN93 to get the older "9.3 and before" algorithm. The output for the following statements assumes that you are running SAS 9.4m3 or later and your computer has Intel's MKL.

```reset NOEIGEN93; /* use Intel MKL, if available */ call eigen(EigVal, MKLEigVec, M); print MKLEigVec[colname=("E1":"E3")];```

This is a different result than before, but it is still a valid set of eigenvectors The first and third eigenvectors are the negative of the eigenvectors in the previous experiment. The eigenvectors are sorted in the same order, but that is because SAS (for consistency with earlier releases) internally sorts the eigenvectors that the MKL returns.

Method 3: MATLAB: The following MATLAB statements compute the eigenvalue and eigenvectors for the same matrix:

```M = [1, 2, 3; 4, 5, 6; 7, 8, 9]; [EigVec, EigVal] = eig(M);```
```EigVec = -0.2320 -0.7858 0.4082 -0.5253 -0.0868 -0.8165 -0.8187 0.6123 0.4082```

The eigenvalues are not displayed, but you can tell from the output that the eigenvalues are ordered by magnitude: 16.1, -1.1, and 0. The eigenvectors are the same as the MKL results (within rounding precision), but they are presented in a different order.

Method 4: Mathematica: This example matrix is used in the Mathematica documentation for the Eigenvectors function:

`Eigenvectors[{{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}]`
```0.283349 -1.28335 1 0.641675 -0.141675 -2 1 1 1```

This is a different result, but still correct. The symbolic computations in Mathematica do not standardize the eigenvectors to unit length. Instead, they standardize them to have a 1 in the last component. The eigenvalues are sorted by magnitude (like the MATLAB output), but the first column has opposite signs from the MATLAB output.

Method 5: R: The R documentation states that the eigen function in R calls the LAPACK subroutines. Thus I expect it to get the same result as MATLAB.

```M <- matrix(c(1:9), nrow=3, byrow=TRUE) r <- eigen(M) r\$vectors```
``` [,1] [,2] [,3] [1,] -0.2319707 -0.78583024 0.4082483 [2,] -0.5253221 -0.08675134 -0.8164966 [3,] -0.8186735 0.61232756 0.4082483```

Except for rounding, this result is the same as the MATLAB output.

### Summary

This article used a simple 3 x 3 matrix to demonstrate that different software packages might produce different eigenvectors for the same input matrix. There were four different answers produced, all of which are correct. This is a result of the mathematical fact that eigenvectors are not unique: any multiple of an eigenvector is also an eigenvector! Different numerical algorithms can produce different eigenvectors, and this is compounded by the fact that you can standardize and order the eigenvectors in several ways.

Although it is hard to compare eigenvectors from different software packages, it is not impossible. First, make sure that the eigenvectors are ordered the same way. (You can skip this step for symmetric positive definite matrices.) Then make sure they are standardized to unit length. If you do those two steps, then the eigenvectors will agree up to a ± sign.

The post The curse of non-unique eigenvectors appeared first on The DO Loop.

1
Mar

## Monte Carlo estimates of joint probabilities

Monte Carlo techniques have many applications, but a primary application is to approximate the probability that some event occurs. The idea is to simulate data from the population and count the proportion of times that the event occurs in the simulated data.

For continuous univariate distributions, the probability of an event is the area under a density curve. The integral of the density from negative infinity to a particular value is the definition of the cumulative distribution function (CDF) for a distribution. Instead of performing numerical integration, you can use Monte Carlo simulation to approximate the probability.

### One-dimensional CDFs

In SAS software, you can use the CDF function to compute the CDF of many standard univariate distributions. For example, the statement prob = cdf("Normal", -1) computes the probability that a standard normal random variable takes on a value less than -1.

The CDF function is faster and more accurate than a Monte Carlo approximation, but let's see how the two methods compare. You can estimate the probability P(X < -1) by generating many random values from the N(0,1) distribution and computing the proportion that is less than -1, as shown in the following SAS DATA step

```data _NULL_; call streaminit(123456); N = 10000; /* sample size */ do i = 1 to N; x = rand("Normal"); /* X ~ N(0,1) */ cnt + (x < -1); /* sum of counts for value less than -1 */ end; Prob = cdf("Normal", -1); /* P(x< -1) */ MCEst = cnt / N; /* Monte Carlo approximation */ put Prob=; put MCEst=; run;```
```Prob =0.1586552539 MCEst=0.1551```

The Monte Carlo estimate is correct to two decimal places. The accuracy of this Monte Carlo computation is proportional to 1/sqrt(N), where N is the size of the Monte Carlo sample. Thus if you want to double the accuracy you need to quadruple the sample size.

### Two-dimensional CDFs

SAS provides the PROBBNRM function for computing the CDF of a bivariate normal distribution, but does not provide a built-in function that computes the CDF for other multivariate probability distributions. However, you can use Monte Carlo techniques to approximate multivariate CDFs for any multivariate probability distribution for which you can generate random variates.

I have previously blogged about how to use the PROBBNRM function to compute the bivariate normal CDF. The following SAS/IML statements demonstrate how to use a Monte Carlo computation to approximate the bivariate normal CDF. The example uses a bivariate normal random variable Z ~ MVN(0, Σ), where Σ is the correlation matrix with Σ12 = 0.6.

The example computes the probability that a bivariate normal random variable is in the region G = {(x,y) | x<x0 and y<y0}. The program first calls the built-in PROBBNRM function to compute the probability. Then the program calls the RANDNORMAL function to generate 100,000 random values from the bivariate normal distribution. A binary vector (group) indicates whether each observation is in G. The MEAN function computes the proportion of observations that are in the region.

```proc iml; x0 = 0.3; y0 = 0.4; rho = 0.6; Prob = probbnrm(x0, y0, rho); /* P(x<x0 and y<y0) */   call randseed(123456); N = 1e5; /* sample size */ Sigma = { 1 &rho, /* correlation matrix */ &rho 1 }; mean = {0 0}; Z = randnormal(N, mean, Sigma); /* sample from MVN(0, Sigma) */ group = (Z[,1] < x0 & Z[,2] < y0); /* binary vector */ MCEst = mean(group); /* = sum(group=1) / N */ print Prob MCEst;```

You can use a scatter plot to visualize the Monte Carlo technique. The following statements create a scatter plot and use the DROPLINE statement in PROC SGPLOT to indicate the region G. Of the 100000 random observations, 49750 of them were in the region G. These observations are drawn in red. The observations that are outside the region are drawn in blue.

```ods graphics / width=400px height=400px; title "Estimate of P(x < x0 and y < y0) is 0.4978"; title2 "x0 = 0.3; y0 = 0.4; rho = 0.6"; call scatter(Z[,1], Z[,2]) group=group grid={x y} procopt="noautolegend aspect=1" option="transparency=0.9 markerattrs=(symbol=CircleFilled)" other="dropline x=0.3 y=0.4 / dropto=both;";```

### Higher dimensions

The Monte Carlo technique works well in low dimensions. As the dimensions get larger, you need to generate a lot of random variates in order to obtain an accurate estimate. For example, the following statements generate 10 million random values from the five-dimensional distribution of uncorrelated normal variates and estimate the probability of all components being less than 0:

```d = 5; /* dimension */ N = 1e7; /* sample size */ mean = j(1, d, 0); /* {0,0,...,0} */ Z = randnormal(N, mean, I(d)); /* Z ~ MVN (0, I) */ v0 = {0 0 0 0 0}; /* cutoff values in each component */ ComponentsInRegion = (Z < v0)[,+]; /* number of components in region */ group = (ComponentsInRegion=d); /* binary indicator vector */ MCEst = mean(group); /* proportion of obs in region */ print (1/2**d)[label="Prob"] MCEst;```

Because the normal components are independent, the joint probability is the product of the probabilities for each component: 1/25 = 0.03125. The Monte Carlo estimate is accurate to three decimal places.

The Monte Carlo technique can also handle non-rectangular regions. For example, you can compute the probability that a random variable is in a spherical region.

The Monte Carlo method is computationally expensive for high-dimensional distributions. In high dimensions (say, d > 10), you might need billions of random variates to obtain a reasonable approximation to the true probability. This is another example of the curse of dimensionality.

The post Monte Carlo estimates of joint probabilities appeared first on The DO Loop.