16
Aug

Use a bar chart to visualize pairwise correlations

Visualizing the correlations between variables often provides insight into the relationships between variables. I've previously written about how to use a heat map to visualize a correlation matrix in SAS/IML, and Chris Hemedinger showed how to use Base SAS to visualize correlations between variables.

Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.

In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.

ods exclude all;
proc corr data=sashelp.Heart;      /* pairwise correlation */
   var _NUMERIC_;
   ods output PearsonCorr = Corr;  /* write correlations, p-values, and sample sizes to data set */
run;
ods exclude none;

The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:

proc iml;
use Corr;
   read all var "Variable" into ColNames;  /* get names of variables */
   read all var (ColNames) into mCorr;     /* matrix of correlations */
   ProbNames = "P"+ColNames;               /* variables for p-values are named PX, PY, PZ, etc */
   read all var (ProbNames) into mProb;    /* matrix of p-values */
close Corr;
 
call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames
     colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix";
Heat map of correlations between variables

The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)

These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.

Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)

numCols = ncol(mCorr);                /* number of variables */
numPairs = numCols*(numCols-1) / 2;
length = 2*nleng(ColNames) + 5;       /* max length of new ID variable */
PairNames = j(NumPairs, 1, BlankStr(length));
i = 1;
do row= 2 to numCols;                 /* construct the pairwise names */
   do col = 1 to row-1;
      PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]);
      i = i + 1;
   end;
end;
 
lowerIdx = loc(row(mCorr) > col(mCorr));  /* indices of lower-triangular elements */
Corr = mCorr[ lowerIdx ];
Prob = mProb[ lowerIdx ];
Significant = choose(Prob > 0.05, "No ", "Yes");  /* use alpha=0.05 signif level */
 
create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"};
append;
close;
QUIT;

You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.

proc sort data=CorrPairs;  by Corr;  run;
 
ods graphics / width=600px height=800px;
title "Pairwise Correlations";
proc sgplot data=CorrPairs;
hbar PairNames / response=Corr group=Significant;
refline 0 / axis=x;
yaxis discreteorder=data display=(nolabel) 
      labelattrs=(size=6pt) fitpolicy=none 
      offsetmin=0.012 offsetmax=0.012  /* half of 1/k, where k=number of catgories */
      colorbands=even colorbandsattrs=(color=gray transparency=0.9);
xaxis grid display=(nolabel);
keylegend / position=topright location=inside across=1;
run;
Bar chart of pairwise correlations between variables

The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.

The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).

Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.

The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.

14
Aug

What is rank correlation?

When someone refers to the correlation between two variables, they are probably referring to the Pearson correlation, which is the standard statistic that is taught in elementary statistics courses. Elementary courses do not usually mention that there are other measures of correlation.

Why would anyone want a different estimate of correlation? Well, the Pearson correlation, which is also known as the product-moment correlation, uses empirical moments of the data (means and standard deviations) to estimate the linear association between two variables. However, means and standard deviations can be unduly influenced by outliers in the data, so the Pearson correlation is not a robust statistic.

A simple robust alternative to the Pearson correlation is called the Spearman rank correlation, which is defined as the Pearson correlation of the ranks of each variable. (If a variable contains tied values, replace those values by their average rank.) The Spearman rank correlation is simple to compute and conceptually easy to understand. Some advantages of the rank correlation are

  • The rank correlation is always in the interval [-1, 1]. For "tame" data, the Spearman and Pearson correlations are close to each other. In fact, if X and Y are bivariate normal random variables with Pearson correlation ρ, then the Spearman correlation is 6/π arcsin(ρ/2), which is very close to the identity function on [-1, 1].
  • The rank correlation is robust to outliers. For example, the data set X={1, 2, 2, 5} has the same ranks as the set Y={1, 2, 2, 500}. Therefore for any third variable Z, the rank correlation between X and Z is the same as the rank correlation between Y and Z.
  • The rank correlation is invariant under any monotonic increasing transformation of the data, such as LOG, EXP, and SQRT. In the previous example, the rank correlation between Z and X is the same as the rank correlation between Z and the log-transform of X, which is {log(1), log(2), log(2), log(5)}. This is in contrast to the Pearson correlation, which is only invariant under affine transformations with positive scaling factors (X → a*X + b, where a > 0).
  • The rank correlation can be used for any ordinal variable. For example, if the variable X has the ordinal values {"Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied"}, and the variable Y has the ordinal values {"Low", "Medium", "High"}, then you can compute a rank correlation between X and Y.

Compute rank correlation in SAS

PROC CORR in SAS supports several measures of correlation, including the Pearson and Spearman correlations. For data without outliers, the two measures are often similar. For example, the following call to PROC CORR computes the Spearman rank correlation between three variables in the Sashelp.Class data set:

/* Compute PEARSON and SPEARMAN rank correlation by using PROC CORR in SAS */
proc corr data=sashelp.class noprob nosimple PEARSON SPEARMAN;
   var height weight age;
run;
Spearman rank correlation compared to the Pearson correlation in SAS

According to both statistics, these variables are very positively correlated, with correlations in the range [0.7, 0.88]. Notice that the rank correlations (the lower table) are similar to the Pearson correlations for these data. However, if the data contain outliers, the rank correlation estimate is less influenced by the magnitude of the outliers.

Compute rank correlation manually

As mentioned earlier, the Spearman rank correlation is conceptually easy to understand. It consists of two steps: compute the ranks of each variable and compute the Pearson correlation between the ranks. It is instructive to reproduce each step in the Spearman computation. You can use PROC RANK in SAS to compute the ranks of the variables, then use PROC CORR with the PEARSON option to compute the Pearson correlation of the ranks. If the data do not contain any missing values, then the following statements implement to two steps that compute the Spearman rank correlation:

/* Compute the Spearman rank correlation "manually" by explicitly computing ranks */
/* First compute ranks; use average rank for ties */
proc rank data=sashelp.class out=classRank ties=mean;
   var height weight age;
   ranks RankHeight RankWeight RankAge;
run;
 
/* Then compute Pearson correlation on the ranks */
proc corr data=classRank noprob nosimple PEARSON;
   var RankHeight RankWeight RankAge;
run;

The resulting table of correlations is the same as in the previous section and is not shown. Although PROC CORR can compute the rank correlation directly, it is comforting that these two steps produce the same answer. Furthermore, this two-step method can be useful if you decide to implement a rank-based statistic that is not produced by any SAS procedure. This two-step method is also the way to compute the Spearman correlation of character ordinal variables because PROC CORR does not analyze character variables. However, PROC RANK supports both character and numeric variables.

If you have missing values in your data, then make sure you delete the observations that contain missing values before you call PROC RANK. Equivalently, you can use a WHERE statement to omit the missing values. For example, you could insert the following statement into the PROC RANK statements:
   where height^=. & weight^=. & age^=.;

Compute rank correlation in SAS/IML software

In the SAS/IML language, the CORR function computes the Spearman rank correlation directly, as follows. The results are the same as the results from PROC CORR, and are not shown.

proc iml;
use sashelp.class;
   read all var {height weight age} into X;
close;
 
RankCorr = corr(X, "Spearman");  /* compute rank correlation */

If you ever need to compute a rank-based statistic manually, you can also use the RANKTIE function to compute the ranks of the elements in a numerical vector, such as
   ranktie(X[ ,1], "Mean");

Summary

The Spearman rank correlation is a robust measure of the linear association between variables. It is related to the classical Pearson correlation because it is defined as the Pearson correlation between the ranks of the individual variables. It has some very nice properties, including being robust to outliers and being invariant under monotonic increasing transformations of the data. For other measures of correlation that are supported in SAS, see the PROC CORR documentation.

The post What is rank correlation? appeared first on The DO Loop.

9
Aug

Robust principal component analysis in SAS

Recently, I was asked whether SAS can perform a principal component analysis (PCA) that is robust to the presence of outliers in the data. A PCA requires a data matrix, an estimate for the center of the data, and an estimate for the variance/covariance of the variables. Classically, these estimates are the mean and the Pearson covariance matrix, respectively, but if you replace the classical estimates by robust estimates, you can obtain a robust PCA.

This article shows how to implement a classical (non-robust) PCA by using the SAS/IML language and how to modify that classical analysis to create a robust PCA.

A classical principal component analysis in SAS

the FACTOR procedure.

Let's use PROC PRINTCOMP perform a simple PCA. The PROC PRINCOMP results will be the basis of comparison when we implement the PCA in PROC IML. The following example is taken from

proc princomp data=Crime /* add COV option for covariance analysis */
              outstat=PCAStats_CORR out=Components_CORR 
              plots=score(ncomp=4);
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
   id State;
   ods select Eigenvectors ScreePlot '2 by 1' '4 by 3';
run;
 
proc print data=Components_CORR(obs=5);
   var Prin:;
run;

To save space, the output from PROC PRINCOMP is not shown, but it includes a table of the eigenvalues and principal component vectors (eigenvectors) of the correlation matrix, as well as a plot of the scores of the observations, which are the projection of the observations onto the principal components. The procedure writes two data sets: the eigenvalues and principal components are contained in the PCAStats_CORR data set and the scores are contained in the Components_CORR data set.

A classical principal component analysis in SAS/IML

Assume that the data consists of n observations and p variables and assume all values are nonmissing. If you are comfortable with multivariate analysis, a principal component analysis is straightforward: the principal components are the eigenvectors of a covariance or correlation matrix, and the scores are the projection of the centered data onto the eigenvector basis. However, if it has been a few years since you studied PCA, Abdi and Williams (2010) provides a nice overview of the mathematics. The following matrix computations implement the classical PCA in the SAS/IML language:

proc iml;
reset EIGEN93;       /* use "9.3" algorithm, not vendor BLAS (SAS 9.4m3) */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
  read all var varNames into X;  /* X = data matrix (assume nonmissing) */
close;
 
S = cov(X);                      /* estimate of covariance matrix */
R = cov2corr(S);                 /* = corr(X) */
call eigen(D, V, R);             /* D=eigenvalues; columns of V are PCs */
scale = T( sqrt(vecdiag(S)) );   /* = std(X) */
B = (X - mean(X)) / scale;       /* center and scale data about the mean */
scores = B * V;                  /* project standardized data onto the PCs */
 
print V[r=varNames c=("Prin1":"Prin7") L="Eigenvectors"];
print (scores[1:5,])[c=("Prin1":"Prin7") format=best9. L="Scores"];
Classical principal component analysis in SAS

The principal components (eigenvectors) and scores for these data are identical to the same quantities that were produced by PROC PRINCOMP. In the preceding program I could have directly computed R = corr(X) and scale = std(X), but I generated those quantities from the covariance matrix because that is the approach used in the next section, which computes a robust PCA.

If you want to compute the PCA from the covariance matrix, you would use S in the EIGEN call, and define B = X - mean(X) as the centered (but not scaled) data matrix.

A robust principal component analysis

This section is based on a similar robust PCA computation in Wicklin (2010). The main idea behind a robust PCA is that if there are outliers in the data, the covariance matrix will be unduly influenced by those observations. Therefore you should use a robust estimate of the covariance matrix for the eigenvector computation. Also, the arithmetic mean is unduly influenced by the outliers, so you should center the data by using a robust estimate of center before you form the scores.

SAS/IML software provides

/* get robust estimates of location and covariance */
N = nrow(X); p = ncol(X);        /* number of obs and variables */
optn = j(8,1,.);                 /* default options for MCD */
optn[4]= floor(0.75*N);          /* override default: use 75% of data for robust estimates */
call MCD(rc,est,dist,optn,x);    /* compute robust estimates */
RobCov = est[3:2+p, ];           /* robust estimate of covariance */
RobLoc = est[1, ];               /* robust estimate of location */
 
/* use robust estimates to perform PCA */
RobCorr = cov2corr(RobCov);      /* robust correlation */
call eigen(D, V, RobCorr);       /* D=eigenvalues; columns of V are PCs */
RobScale = T( sqrt(vecdiag(RobCov)) ); /* robust estimates of scale */
B = (x-RobLoc) / RobScale;       /* center and scale data */
scores = B * V;                  /* project standardized data onto the PCs */

Notice that the SAS/IML code for the robust PCA is very similar to the classical PCA. The only difference is that the robust estimates of covariance and location (from the MCD call) are used instead of the classical estimates.

If you want to compute the robust PCA from the covariance matrix, you would use RobCov in the EIGEN call, and define B = X - RobLoc as the centered (but not scaled) data matrix.

Comparing the classical and robust PCA

Classical and robust principal component scores for crime data, computed in SAS

You can create a score plot to compare the classical scores to the robust scores. The Getting Started example in the PROC PRINCOMP documentation shows the classical scores for the first three components. The documentation suggests that Nevada, Massachusetts, and New York could be outliers for the crime data.

You can write the robust scores to a SAS data set and used the SGPLOT procedure to plot the scores of the classical and robust PCA on the same scale. The first and third component scores are shown to the right, with abbreviated state names serving as labels for the markers. (Click to enlarge.) You can see that the robust first-component scores for California and Nevada are separated from the other states, which makes them outliers in that dimension. (The first principal component measures overall crime rate.) The robust third-component scores for New York and Massachusetts are also well-separated and are outliers for the third component. (The third principal component appears to contrast murder with rape and auto theft with other theft.)

Because the crime data does not have huge outliers, the robust PCA is a perturbation of the classical PCA, which makes it possible to compare the analyses. For data that have extreme outliers, the robust analysis might not resemble its classical counterpart.

If you run an analysis like this on your own data, remember that eigenvectors are not unique and so there is no guarantee that the eigenvectors for the robust correlation matrix will be geometrically aligned with the eigenvectors from the classical correlation matrix. For these data, I multiplied the second and fourth robust components by -1 because that seems to make the score plots easier to compare.

Summary

In summary, you can implement a robust principal component analysis by using robust estimates for the correlation (or covariance) matrix and for the "center" of the data. The MCD subroutine in SAS/IML language is one way to compute a robust estimate of the covariance and location of multivariate data. The SAS/IML language also provides ways to compute eigenvectors (principal components) and project the (centered and scaled) data onto the principal components.

You can download the SAS program that computes the analysis and creates the graphs.

As discussed in Hubert, Rousseeuw, and Branden (2005), the MCD algorithm is most useful for 100 or fewer variables. They describe an alternative computation that can support a robust PCA in higher dimensions.

References

The post Robust principal component analysis in SAS appeared first on The DO Loop.

2
Aug

Dimension reduction: Guidelines for retaining principal components

Last week I blogged about the broken-stick problem in probability, which reminded me that the broken-stick model is one of the many techniques that have been proposed for choosing the number of principal components to retain during a principal component analysis. Recall that for a principal component analysis (PCA) of p variables, a goal is to represent most of the variation in the data by using k new variables, where hopefully k is much smaller than p. Thus PCA is known as a dimension-reduction algorithm.

Many researchers have proposed methods for choosing the number of principal components. Some methods are heuristic, others are statistical. No method is perfect. Often different techniques result in different suggestions.

This article uses SAS to implement the broken stick model and compares that method with three other simple rules for dimension reduction. A few references are provided at the end of this article.

A principal component analysis by using PROC PRINCOMP

Let's start with an example. In SAS, you can use the PRINCOMP procedure to conduct a principal component analysis. The following example is taken from the Getting Started example in the PROC PRINCOMP documentation. The program analyzes seven crime rates for the 50 US states in 1977. (The documentation shows how to generate the data set.) The following call generates a scree plot, which shows the proportion of variance explained by each component. It also writes the Eigenvalues table to a SAS data set:

proc princomp data=Crime plots=scree;
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
   id State;
   ods output Eigenvalues=Evals;
run;
Eigenvalues for a principal component analysis in SAS
Scree plot of eigenvalues for a principal component analysis in SAS

The panel shows two graphs that plot the numbers in the "Eigenvalues of the Correlation Matrix" table. The plot on the left is the scree plot, which is a graph of the eigenvalues. The sum of the eigenvalues is 7, which is the number of variables in the analysis. If you divide each eigenvalue by 7, you obtain the proportion of variance that each principal component explains. The graph on the right plots the proportions and the cumulative proportions.

The scree plot as a guide to retaining components

The scree plot is my favorite graphical method for deciding how many principal components to keep. If the scree plot contains an "elbow" (a sharp change in the slopes of adjacent line segments), that location might indicate a good number of principal components (PCs) to retain. For this example, the scree plot shows a large change in slopes at the second eigenvalue and a smaller change at the fourth eigenvalue. From the graph of the cumulative proportions, you can see that the first two PCs explain 76% of the variance in the data, whereas the first four PCs explain 91%.

If "detect the elbow" is too imprecise for you, a more precise algorithm is to start at the right-hand side of the scree plot and look at the points that lie (approximately) on a straight line. The leftmost point along the trend line indicates the number of components to retain. (In geology, "scree" is rubble at the base of a cliff; the markers along the linear trend represent the rubble that can be discarded.) For the example data, the markers for components 4–7 are linear, so components 1–4 would be kept. This rule (and the scree plot) was proposed by Cattell (1966) and revised by Cattell and Jaspers (1967).

How does the broken-stick model choose components?

D. A. Jackson (1993) says that the broken-stick method is one of the better methods for choosing the number of PCs. The method provides "a good combination of simplicity of calculation and accurate evaluation of dimensionality relative to the other statistical approaches" (p. 2212). The broken-stick model retains components that explain more variance than would be expected by randomly dividing the variance into p parts. As I discussed last week, if you randomly divide a quantity into p parts, the expected proportion of the kth largest piece is (1/p)Σ(1/i) where the summation is over the values i=k..p. For example, if p=7 then
E1 = (1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.37,
E2 = (1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.228,
E3 = (1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.156, and so forth.

I think of the "expected proportions" as corresponding to a null model that contains uncorrelated (noise) variables. If you plot the eigenvalues of the correlation matrix against the broken-stick proportions, the observed proportions that are higher than the expected proportions indicate of how many principal component to keep.

The broken-stick model for retaining components

Broken-stick method for retaining principal components

The plot to the right shows the scree plot overlaid on a dashed curve that indicates the expected proportions that result from a broken-stick model. An application of the broken-stick model keeps one PC because only the first observed proportion of variance is higher than the corresponding broken-stick proportion.

How can you compute the points on the dashed curve? The expected proportions in the broken-stick model for p variables are proportional to the cumulative sums of the sequence of ratios {1/p, 1/(p-1), ..., 1}. You can use the CUSUM function in SAS/IML to compute a cumulative sum of a sequence, as shown below. Notice that the previous call to PROC PRINCOMP used the ODS OUTPUT statement to create a SAS data set that contains the values in the Eigenvalue table. The SAS/IML program reads in that data and compares the expected proportions to the observed proportions. The number of components to retain is computed as the largest integer k for which the first k components each explain more variance than the broken-stick model (null model).

proc iml;
use Evals;  read all var {"Number" "Proportion"};  close;
 
/* Broken Stick (Joliffe 1986; J. E. Jackson, p. 47) */
/* For random p-1 points in [0,1], expected lengths of the p subintervals are: */
p = nrow(Proportion);
g = cusum(1 / T(p:1)) / p;   /* expected lengths of intervals (smallest to largest) */
ExpectedLen = g[p:1];        /* reverse order: largest to smallest */
 
keep = 0;                    /* find first k for which ExpectedLen[i] < Proportion[i] if i<=k */
do i = 1 to p while(ExpectedLen[i] < Proportion[i]);
   keep = i;
end;
print Proportion ExpectedLen keep;
Broken-stick rule for retaining principal components

As seen in the graph, only the first component is retained under the broken-stick model.

Average of eigenvalues

The average-eigenvalue test (Kaiser-Guttman test) retains the eigenvalues that exceed the average eigenvalue. For a p x p correlation matrix, the sum of the eigenvalues is p, so the average value of the eigenvalues is 1. To account for sampling variability, Jolliffe (1972) suggested a more liberal criterion: retain eigenvalues greater than 0.7 times the average eigenvalue. These two suggestions are implemented below:

/* Average Root (Kaiser 1960; Guttman 1954; J. E. Jackson, p. 47) */
mean = mean(Proportion);
keepAvg = loc( Proportion >= mean )[<>];
 
/* Scaled Average Root (Joliffe 1972; J. E. Jackson, p. 47-48) */
keepScaled = loc( Proportion >= 0.7*mean )[<>];
print keepAvg keepScaled;
Average eigenvalue rule for retaining principal components

Create the broken-stick graph

For completeness, the following statement write the broken-stick proportions to a SAS data set and call PROC SGPLOT to overlay the proportion of variance for the observed data on the broken-stick model:

/* write expected proportions for broken-stick model */
create S var {"Number" "Proportion" "ExpectedLen"}; append; close;
quit;
 
title "Broken Stick Method for Retaining Principal Components";
proc sgplot data=S;
label ExpectedLen = "Broken-Stick Rule"  Proportion = "Proportion of Variance"
      Number = "Number of Components";
   series x=Number y=ExpectedLen / lineattrs=(pattern=dash);
   series x=Number y=Proportion / lineattrs=(thickness=2);
   keylegend / location=inside position=topright across=1;
   xaxis grid;
   yaxis label = "Proportion of Variance";
run;

Summary

Sometimes people ask why PROC PRINCOMP doesn't automatically choose the "correct" number of PCs to use for dimension reduction. This article describes four popular heuristic rules, all which give different answers! The rules in this article are the scree test (2 or 4 components), the broken-stick rule (1 component), the average eigenvalue rule (2 components), and the scaled eigenvalue rule (3 components).

So how should a practicing statistician decide how many PCs to retain? First, remember that these guidelines do not tell you how many components to keep, they merely make suggestions. Second, recognize that any reduction of dimension requires a trade-off between accuracy (high dimensions) and interpretability (low dimensions). Third, these rules—although helpful—cannot replace domain-specific knowledge of the data. Try each suggestion and see if the resulting model contains the features in the data that are important for your analysis.

Further reading

The post Dimension reduction: Guidelines for retaining principal components appeared first on The DO Loop.

19
Jul

A quantile definition for skewness

Skewness is a measure of the asymmetry of a univariate distribution. I have previously shown how to compute the skewness for data distributions in SAS. The previous article computes Pearson's definition of skewness, which is based on the standardized third central moment of the data.

Moment-based statistics are sensitive to extreme outliers. A single extreme observation can radically change the mean, standard deviation, and skewness of data. It is not surprising, therefore, that there are alternative definitions of skewness. One robust definition of skewness that is intuitive and easy to compute is a quantile definition, which is also known as the Bowley skewness or Galton skewness.

A quantile definition of skewness

The quantile definition of skewness uses Q1 (the lower quartile value), Q2 (the median value), and Q3 (the upper quartile value). You can measure skewness as the difference between the lengths of the upper quartile (Q3-Q2) and the lower quartile (Q2-Q1), normalized by the length of the interquartile range (Q3-Q1). In symbols, the quantile skewness γQ is

Definition of quantile skewness (Bowley skewness)

You can visualize this definition by using the figure to the right. Figure that shows the relevant lengths used to define the quantile skewness (Bowley skewness) For a symmetric distribution, the quantile skewness is 0 because the length Q3-Q2 is equal to the length Q2-Q1. If the right length (Q3-Q2) is larger than the left length (Q2-Q1), then the quantile skewness is positive. If the left length is larger, then the quantile skewness is negative. For the extreme cases when Q1=Q2 or Q2=Q3, the quantile skewness is ±1. Consequently, whereas the Pearson skewness can be any real value, the quantile skewness is bounded in the interval [-1, 1]. The quantile skewness is not defined if Q1=Q3, just as the Pearson skewness is not defined when the variance of the data is 0.

There is an intuitive interpretation for the quantile skewness formula. Recall that the relative difference between two quantities R and L can be defined as their difference divided by their average value. In symbols, RelDiff = (R - L) / ((R+L)/2). If you choose R to be the length Q3-Q2 and L to be the length Q2-Q1, then quantile skewness is half the relative difference between the lengths.

Compute the quantile skewness in SAS

It is instructive to simulate some skewed data and compute the two measures of skewness. The following SAS/IML statements simulate 1000 observations from a Gamma(a=4) distribution. The Pearson skewness of a Gamma(a) distribution is 2/sqrt(a), so the Pearson skewness for a Gamma(4) distribution is 1. For a large sample, the sample skewness should be close to the theoretical value. The QNTL call computes the quantiles of a sample.

/* compute the quantile skewness for data */
proc iml;
call randseed(12345);
x = j(1000, 1);
call randgen(x, "Gamma", 4);
 
skewPearson = skewness(x);           /* Pearson skewness */
call qntl(q, x, {0.25 0.5 0.75});    /* sample quartiles */
skewQuantile = (q[3] -2*q[2] + q[1]) / (q[3] - q[1]);
print skewPearson skewQuantile;
The Pearson and Bowley skewness statistics for skewed data

For this sample, the Pearson skewness is 1.03 and the quantile skewness is 0.174. If you generate a different random sample from the same Gamma(4) distribution, the statistics will change slightly.

Relationship between quantile skewness and Pearson skewness

In general, there is no simple relationship between quantile skewness and Pearson skewness for a data distribution. (This is not surprising: there is also no simple relationship between a median and a mean, nor between the interquartile range and the standard deviation.) Nevertheless, it is interesting to compare the Pearson skewness to the quantile skewness for a particular probability distribution.

For many probability distributions, the Pearson skewness is a function of the parameters of the distribution. To compute the quantile skewness for a probability distribution, you can use the quantiles for the distribution. The following SAS/IML statements compute the skewness for the Gamma(a) distribution for varying values of a.

/* For Gamma(a), the Pearson skewness is skewP = 2 / sqrt(a).  
   Use the QUANTILE function to compute the quantile skewness for the distribution. */
skewP = do(0.02, 10, 0.02);                  /* Pearson skewness for distribution */
a = 4 / skewP##2;        /* invert skewness formula for the Gamma(a) distribution */
skewQ = j(1, ncol(skewP));                   /* allocate vector for results       */
do i = 1 to ncol(skewP);
   Q1 = quantile("Gamma", 0.25, a[i]);
   Q2 = quantile("Gamma", 0.50, a[i]);
   Q3 = quantile("Gamma", 0.75, a[i]);
   skewQ[i] = (Q3 -2*Q2 + Q1) / (Q3 - Q1);  /* quantile skewness for distribution */
end;
 
title "Pearson vs. Quantile Skewness";
title2 "Gamma(a) Distributions";
call series(skewP, skewQ) grid={x y} label={"Pearson Skewness" "Quantile Skewness"};
Pearson skewness versus quantile skewness for the Gamma distribution

The graph shows a nonlinear relationship between the two skewness measures. This graph is for the Gamma distribution; other distributions would have a different shape. If a distribution has a parameter value for which the distribution is symmetric, then the graph will go through the point (0,0). For highly skewed distributions, the quantile skewness will approach ±1 as the Pearson skewness approaches ±∞.

Alternative quantile definitions

Several researchers have noted that there is nothing special about using the first and third quartiles to measure skewness. An alternative formula (sometimes called Kelly's coefficient of skewness) is to use deciles: γKelly = ((P90 - P50) - (P50 - P10)) / (P90 - P10). Hinkley (1975) considered the q_th and (1-q)_th quantiles for arbitrary values of q.

Conclusions

The quantile definition of skewness is easy to compute. In fact, you can compute the statistic by hand without a calculator for small data sets. Consequently, the quantile definition provides an easy way to quickly estimate the skewness of data. Since the definition uses only quantiles, the quantile skewness is robust to extreme outliers.

At the same time, the Bowley-Galton quantile definition has several disadvantages. It uses only the central 50% of the data to estimate the skewness. Two different data sets that have the same quartile statistics will have the same quantile skewness, regardless of the shape of the tails of the distribution. And, as mentioned previously, the use of the 25th and 75th percentiles are somewhat arbitrary.

Although the Pearson skewness is widely used in the statistical community, it is worth mentioning that the quantile definition is ideal for use with a box-and-whisker plot. The Q1, Q2, and Q2 quartiles are part of every box plot. Therefore you can visually estimate the quantile skewness as the relative difference between the lengths of the upper and lower boxes.

The post A quantile definition for skewness appeared first on The DO Loop.

17
Jul

3 ways to visualize prediction regions for classification problems

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

Prediction regions for the binary classification problem. Graph created in SAS.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

This article shows three ways to visualize prediction regions in SAS:

  1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
  2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
  3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   store work.LogiModel;                /* save model to item store */
run;
 
data Grid;                              /* create grid in (x1,x2) coords */
do x1 = 0 to 1 by 0.02;
   do x2 = -7.5 to 7.5 by 0.3;
      output;
   end;
end;
run;
 
proc plm restore=work.LogiModel;        /* use PROC PLM to score model on a grid */
   score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink;  /* evaluate the model on new data */
run;

The polygon method

Parameter estimates for  logistic model

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   effectplot contour(x=x1 y=x2);       /* 2. contour plot with scatter plot overlay */
run;

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

The background grid method

Prediction region for a classification problem with two outcomes

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.

Summary

This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.

5
Jul

Test for the equality of two proportions in SAS

A SAS customer asked how to use SAS to conduct a Z test for the equality of two proportions. He was directed to the SAS Usage Note "Testing the equality of two or more proportions from independent samples." The note says to "specify the CHISQ option in the TABLES statement of PROC FREQ to compute this test," and then adds "this is equivalent to the well-known Z test for comparing two independent proportions."

You might wonder why a chi-square test for association is equivalent to a Z test for the equality of proportions. You might also wonder if there is a direct way to test the equality of proportions. This article implements the well-known test for proportions in the DATA step and compares the results to the chi-square test results. It also shows how to get this test directly from PROC FREQ by using the RISKDIFF option.

A chi-square test for association in SAS

The SAS Usage Note poses the following problem: Suppose you want to compare the proportions responding "Yes" to a question in independent samples of 100 men and 100 women. The number of men responding "Yes" is observed to be 30 and the number of women responding Yes was 45.

You can create the data by using the following DATA step, then call PROC FREQ to analyze the association between the response variable and gender.

data Prop;
length Group $12 Response $3;
input Group Response N;
datalines;
Men          Yes  30
Men          No   70
Women        Yes  45
Women        No   55
;
 
proc freq data=Prop order=data;
   weight N;
   tables Group*Response / chisq;
run;
Test for association of two categorical variables in SAS

As explained in the PROC FREQ documentation, the Pearson chi-square statistic indicates an association between the variables in the 2 x 2 table. The results show that the chi-square statistic (for 1 degree of freedom) is 4.8, which corresponds to a p-value of 0.0285. The test indicates that we should reject the null hypothesis of no association at the 0.05 significance level.

As stated in the SAS Usage Note, this association test is equivalent to a Z test for whether the proportion of males who responded "Yes" equals the proportion of females who responded "Yes." The equivalence relies on a fact from probability theory: a chi-square random variable with 1 degree of freedom is the square of a random variable from the standard normal distribution. Thus the square root of the chi-square statistic is the Z statistic (up to a sign) that you get from the test of equality of two proportion. Therefore the Z statistic should be z = ±sqrt(4.8) = ±2.19. The p-value is unchanged.

Z test for the equality of two proportions: A DATA step implmentation

For comparison, you can implement the classical Z test by applying the formulas from a textbook or from the course material from Penn State, which includes a section about comparing two proportions. The following DATA step implements the Z test for equality of proportions:

/* Implement the Z test for pre-summarized statistics. Specify the group proportions and sizes. 
   For formulas, see https://onlinecourses.science.psu.edu/stat414/node/268 */
%let alpha = 0.05;
%let N1    = 100;   /* total trials in Group1 */
%let Event1=  30;   /* Number of events in Group1  */
%let N2    = 100;   /* total trials in Group2 */
%let Event2=  45;   /* Number of events in Group2  */
 
%let Side  =   2;   /* use L, U, or 2 for lower, upper, or two-sided test */
title "Test of H0: p1=p2 vs Ha: p1^=p2"; /* change for Side=L or U */
 
data zTestProp;
p1Hat = &Event1 / &N1;                /* observed proportion in Group1 */
var1  = p1Hat*(1-p1Hat) / &N1;        /* variance in Group1 */
p2Hat = &Event2 / &N2;                /* observed proportion in Group2 */
var2  = p2Hat*(1-p2Hat) / &N2;        /* variance in Group2 */
/* use pooled estimate of p for test */
Diff = p1Hat - p2Hat;                 /* estimate of p1 = p2 */
pHat = (&Event1 + &Event2) / (&N1 + &N2);
pVar = pHat*(1-pHat)*(1/&N1 + 1/&N2); /* pooled variance */
SE   = sqrt(pVar);                    /* estimate of standard error */
Z = Diff / SE;    
 
Side = "&Side";
if Side="L" then                      /* one-sided, lower tail */
   pValue = cdf("normal", z);
else if Side="U" then                 /* one-sided, upper tail */
   pValue = sdf("normal", Z);         /* SDF = 1 - CDF */
else if Side="2" then
   pValue = 2*(1-cdf("normal", abs(Z))); /* two-sided */
format pValue PVALUE6.4 Z 7.4;
label pValue="Pr < Z";
drop var1 var2 pHat pVar;
run;
 
proc print data=zTestProp label noobs; run;
Test for equality of two independent proportions in SAS

The DATA step obtains a test statistic of Z = –2.19, which is one of the square roots of the chi-square statistic in the PROC FREQ output. Notice also that the p-value from the DATA step matches the p-value from the PROC FREQ output.

Test equality of proportions by using PROC FREQ

There is actually a direct way to test for the equality of two independent proportions: use the RISKDIFF option in the TABLES statement in PROC FREQ. In the documentation, binomial proportions are called "risks," so a "risk difference" is a difference in proportions. (Also, a "relative risk" (the RELRISK option) measures the ratio of two proportions.) Equality of proportions is equivalent to testing whether the difference of proportions (risks) is zero.

As shown in the documentation, PROC FREQ supports many options for comparing proprtions. You can use the following suboptions to reproduce the classical equality of proportions test:

  1. EQUAL requests an equality test for the difference in proportion. By default, the Wald interval (METHOD=WALD) is used, but you can choose other intervals.
  2. VAR=NULL specifies how to estimate the variance for the Wald interval.
  3. (optional) CL=WALD outputs the Wald confidence interval for the difference.

Combining these options gives the following direct computation of the difference between two proportions:

proc freq data=Prop order=data;
   weight N;
   tables Group*Response / riskdiff(equal var=null cl=wald); /* Wald test for equality */
run;
Test for difference of proportions in SAS

The 95% (Wald) confidence interval is shown in the first table. The confidence interval is centered on the point estimate of the difference (-0.15). The interval does not contain 0, so the difference is significantly different from 0 at the 0.05 significance level.

The second table show the result of the Wald equality test. The "ASE (H0)" row gives the estimate for the (asymptotic) standard error, assuming the null hypothesis. The Z score and the two-sided p-value match the values from the DATA step computation, and the interpretation is the same.

Summary

In summary, the SAS Usage Note correctly states that the chi-square test of association is equivalent to the Z test for the equality of proportion. To run the Z test explicitly, this article uses the SAS DATA step to implement the test when you have summary statistics. As promised, the Z statistic is one of the square roots of the chi-square statistic and the p-values are the same. The DATA step removes some of the mystery regarding the equivalence between these two tests.

However, writing DATA step code cannot match the convenience of a procedure. For raw or pre-summarized data, you can use the RISKDIFF option in PROC FREQ to run the same test (recast as a difference of proportions or "risks"). To get exactly the same confidence intervals and statistics as the classical test (which is called the Wald test), you need to add a few suboptions. The resulting output matches the DATA step computations.

The post Test for the equality of two proportions in SAS appeared first on The DO Loop.

28
Jun

The average bootstrap sample omits 36.8% of the data

Suppose you roll six identical six-sided dice. Chance are that you will see at least one repeated number. The probability that you will see six unique numbers is very small: only 6! / 6^6 ≈ 0.015.

This example can be generalized. If you draw a random sample with replacement from a set of n items, duplicate values occur much more frequently than most people think. This is the source of the famous "birthday matching problem" in probability, which shows that the probability of duplicate birthdays among a small group of people is higher than most people would guess. (Among 23 people, the probability of a duplicate birthday is more than 50%.)

This fact has implications for bootstrap resampling. Recall that if a sample has n observations, then a bootstrap sample is obtained by sampling n times with replacement from the data. Since most bootstrap samples contain a duplicate of at least one observation, it is also true that most samples omit at least one observation. That raises the question: On average, how many of the original observations are not present in an average bootstrap sample?

The average bootstrap sample

I'll give you the answer: an average bootstrap sample contains 63.2% of the original observations and omits 36.8%. The book by Chernick and LaBudde (2011, p. 199) states the following result about bootstrap resamples: "If the sample size is large and we [generate many]  bootstrap samples, we will find that on average, approximately 36.8% of the original observations will be missing from the individual bootstrap samples. Another way to look at this is that for any particular observation, approximately 36.8% of the bootstrap samples will not contain it." (Emphasis added.)

You can use elementary probability to derive this result. Suppose the original data contains n observations. A bootstrap sample is generated by sampling with replacement from the data. The probability that a particular observation is not chosen from a set of n observations is 1 - 1/n, so the probability that the observation is not chosen n times is (1 - 1/n)^n. This is the probability that the observation does not appear in a bootstrap sample.

You might remember from calculus that the limit as n → ∞ of (1 - 1/n)^n is 1/e. Therefore, when n is large, the probability that an observation is not chosen is approximately 1/e ≈ 0.368.

Simulation: the proportion of observations in bootstrap samples

If you prefer simulation to calculation, you can simulate many bootstrap samples and count the proportion of samples that do not contain observation 1, observation 2, and so forth. The average of those proportions should be close to 1/e.

The theoretical result is only valid for large values of n, but let's start with n=6 so that we can print out intermediate results during the simulation. The following SAS/IML program uses the SAMPLE function to simulate rolling six dice a total of 10 times. Each column of the matrix M contains the result of one trial:

options linesize=128;
proc iml;
call randseed(54321);
n = 6;
NumSamples = 10;
x = T(1:n);                     /* the original sample {1,2,...,n} */
M = sample(x, NumSamples//n);  /* each column is a draw with replacement */

The table shows that each sample (column) contains duplicates. The first column does not contain the values {4,5,6}. The second column does not contain the value 6.

Given these samples, what proportion of samples does not contain 1? What proportion does not contain 2, and so forth? For this small simulation, we can answer these questions by visual inspection. The number 1 does not appear in the sample S5, so it does not appear in 0.1 of the samples. The number 2 does not appears in S3, S5, S8, S9, or S10, so it does not appear in 0.5 of the samples. The following SAS/IML statements count the proportion of columns that do not contain each number and then takes the average of those proportions:

/* how many samples do not contain x[1], x[2], etc */
cnt = j(n, 1, .);     /* allocate space for results */
do i = 1 to n;
   Y = (M=x[i]);      /* binary matrix */
   s = Y[+, ];        /* count for each sample (column) */
   cnt[i] = sum(s=0); /* number of samples that do not contain x[i] */
end;
prop = cnt / NumSamples;
avg_prop = mean(prop);
print avg_prop;

The result says that, on the average, a sample does not contain 0.35 of the data values. Let's increase the sample size and the number of bootstrap samples. Change the parameters to the following and rerun the program:

n = 75;
NumSamples = 10000;

The new estimate is 0.365, which is close to the theoretical value of 1/e. If you like, you can plot a histogram of the n proportions that make up the average:

title "Proportion of Samples That Do Not Contain an Observation";
title2 "n=75; NumSamples=10000";
call histogram(prop) label="Proprtion"
         other="refline "+char(avg_prop)+"/axis=x;";
Distribution of the proportion of bootstrap samples that do not contain an observation (n=75)

Elementary statistical theory tells us that the proportions are approximately normally distributed with mean p=1/e and standard deviation sqrt(p(1-p)/NumSamples) ≈ 0.00482. The mean and standard deviation of the simulated proportions are very close to the theoretical values.

In conclusion, when you draw n items with replacement from a large sample of size n, on average the sample contains 63.2% of the original observations and omits 36.8%. In other words, the average bootstrap sample omits 36.8% of the original data.

The post The average bootstrap sample omits 36.8% of the data appeared first on The DO Loop.

14
Jun

Two ways to compute maximum likelihood estimates in SAS

In a previous article, I showed two ways to define a log-likelihood function in SAS. This article shows two ways to compute maximum likelihood estimates (MLEs) in SAS: the nonlinear optimization subroutines in SAS/IML and the NLMIXED procedure in SAS/STAT. To illustrate these methods, I will use the same data sets from my previous post. One data set contains binomial data, the other contains data that are lognormally distributed.

Maximum likelihood estimates for binomial data from SAS/IML

I previously wrote a step-by-step description of how to compute maximum likelihood estimates in SAS/IML. SAS/IML contains many algorithms for nonlinear optimization, including the NLPNRA subroutine, which implements the Newton-Raphson method.

In my previous article I used the LOGPDF function to define the log-likelihood function for the binomial data. The following statements define bounds for the parameter (0 < p < 1) and provides an initial guess of p0=0.5:

/* Before running program, create Binomial and LN data sets from previous post */
 
/* Example 1: MLE for binomial data */
/* Method 1: Use SAS/IML optimization routines */
proc iml;
/* log-likelihood function for binomial data */
start Binom_LL(p) global(x, NTrials);
   LL = sum( logpdf("Binomial", x, p, NTrials) );
   return( LL );
finish;
 
NTrials = 10;    /* number of trials (fixed) */
use Binomial; read all var "x"; close;
 
/* set constraint matrix, options, and initial guess for optimization */
con = { 0,      /* lower bounds: 0 < p     */
        1};     /* upper bounds:     p < 1 */
opt = {1,       /* find maximum of function   */
       2};      /* print some output      */
p0  = 0.5;      /* initial guess for solution */
call nlpnra(rc, p_MLE, "Binom_LL", p0, opt, con);
print p_MLE;
Maximum likelihood estimate for binomial data

The NLPNRA subroutine computes that the maximum of the log-likelihood function occurs for p=0.56, which agrees with the graph in the previous article. We conclude that the parameter p=0.56 (with NTrials=10) is "most likely" to be the binomial distribution parameter that generated the data.

Maximum likelihood estimates for binomial data from PROC NLMIXED

If you've never used PROC NLMIXED before, you might wonder why I am using that procedure, since this problem is not a mixed modeling regression. However, you can use the NLMIXED procedure for general maximum likelihood estimation. In fact, I sometimes joke that SAS could have named the procedure "PROC MLE" because it is so useful for solving maximum likelihood problems.

PROC NLMIXED has built-in support for computing maximum likelihood estimates of data that follow the Bernoulli (binary), binomial, Poisson, negative binomial, normal, and gamma distributions. (You can also use PROC GENMOD to fit these distributions; I have shown an example of fitting Poisson data.)

The syntax for PROC NLMIXED is very simple for the Binomial data. You use the PARMS statement to supply an initial guess for the parameter p. On the MODEL statement, you declare that you want to model the X variable as Binom(p) where NTrials=10. Be sure to ALWAYS check the documentation for the correct syntax for a binomial distribution. Some functions like PDF, CDF, and RAND use the p parameter as the first argument: Binom(p, NTrials). Some procedure (like the MCMC and NLMIXED) use the p parameter as the second argument: Binom(NTrials, p).

/* Method 2: Use PROC NLMIXED solve using built-in modeling syntax */
proc nlmixed data=Binomial;
   parms p = 0.5;             * initial value for parameter;
   NTrials = 10;
   model x ~ binomial(NTrials, p);
run;
Maximum likelihood estimates for binomial data by using PROC NLMIXED in SAS

Notice that the output from PROC NLMIXED contains the parameter estimate, standard error, and 95% confidence intervals. The parameter estimate is the same value (0.56) as was found by the NLPNRA routine in SAS/IML. The confidence interval confirms what we previously saw in the graph of the log-likelihood function: the function is somewhat flat near the optimum, so a 95% confidence interval is wide: [0.49, 0.63].

Maximum likelihood estimates for lognormal data

You can use similar syntax to compute MLEs for lognormal data. The SAS/IML syntax is similar to the binomial example, so it is omitted. To view it, download the complete SAS program that computes these maximum likelihood estimates.

PROC NLMIXED does not support the lognormal distribution as a built-in distribution, which means that you need to explicitly write out the log-likelihood function and specify it in the GENERAL function on the MODEL statement. Whereas in SAS/IML you have to use the SUM function to sum the log-likelihood over all observations, the syntax for PROC NLMIXED is simpler. Just as the DATA step has an implicit loop over all observations, the NLMIXED procedure implicitly sums the log-likelihood over all observations. You can use the LOGPDF function, or you can explicitly write the log-density formula for each observation.

If you look up the lognormal distribution in the list of "Standard Definition" in the PROC MCMC documentation, you will see that one parameterization of the lognormal PDF in terms of the log-mean μ and log-standard-deviation σ is
f(x; μ, σ) = (1/(sqrt(2π σ x) exp(-(log(x)-μ)**2 / (2σ**2))
When you take the logarithm of this quantity, you get two terms, or three if you use the rules of logarithms to isolate quantities that do not depend on the parameters:

proc nlmixed data=LN;
parms mu 1 sigma 1;                 * initial values of parameters;
bounds 0 < sigma;                   * bounds on parameters;
sqrt2pi = sqrt(2*constant('pi'));
LL = -log(sigma) 
     - log(sqrt2pi*x)               /* this term is constant w/r/t (mu, sigma) */
     - (log(x)-mu)**2  / (2*sigma**2);
/* Alternative: LL = logpdf("Lognormal", x, mu, sigma); */
model x ~ general(LL);
run;
Maximum likelihood estimates for lognormal data by using PROC NLMIXED in SAS

The parameter estimates are shown, along with standard errors and 95% confidence intervals. The maximum likelihood estimates for the lognormal data are (μ, σ) = (1.97, 0.50). You will get the same answer if you use the LOGPDF function (inside the comment) instead of the "manual calculation." You will also get the same estimates if you omit the term log(sqrt2pi*x) because that term does not depend on the MLE parameters.

In conclusion, you can use nonlinear optimization in the SAS/IML language to compute MLEs. This approach is especially useful when the computation is part of a larger computational program in SAS/IML. Alternatively, the NLMIXED procedure makes it easy to compute MLEs for discrete and continuous distributions. For some simple distributions, the log-likelihood functions are built into PROC NLMIXED. For others, you can specify the log likelihood yourself and find the maximum likelihood estimates by using the GENERAL function.

The post Two ways to compute maximum likelihood estimates in SAS appeared first on The DO Loop.

12
Jun

Two simple ways to construct a log-likelihood function in SAS

Maximum likelihood estimation (MLE) is a powerful statistical technique that uses optimization techniques to fit parametric models. The technique finds the parameters that are "most likely" to have produced the observed data. SAS provides many tools for nonlinear optimization, so often the hardest part of maximum likelihood is writing down the log-likelihood function. This article shows two simple ways to construct the log-likelihood function in SAS. For simplicity, this article describes fitting the binomial and lognormal distributions to univariate data.

Always use the log-likelihood function!

Although the method is known as maximum likelihood estimation, in practice you should optimize the log-likelihood function, which is numerically superior to work with. For an introduction to MLE, including the definitions of the likelihood and log-likelihood functions, see the Penn State Online Statistics Course, which is a wonderful reference.

MLE assumes that the observed data x={x1, x2, ..., xn} are independently drawn from some population. You want to find the most likely parameters θ = (θ1,...,θk) such that the data are fit by the probability density function (PDF), f(x; θ). Since the data are independent, the probability of observing the data is the product Πi f(xi, θ), which is the likelihood function L(θ | x). If you take the logarithm, the product becomes a sum. The log-likelihood function is
LL(θ | x) = Σi log( f(xi, θ) )

This formula is the key. It says that the log-likelihood function is simply the sum of the log-PDF function evaluated at the data values. Always use this formula. Do not ever compute the likelihood function (the product) and then take the log, because the product is prone to numerical errors, including overflow and underflow.

Two ways to construct the log-likelihood function

There are two simple ways to construct the log-likelihood function in SAS:

Example: The log-likelihood function for the binomial distribution

A coin was tossed 10 times and the number of heads was recorded. This was repeated 20 times to get a sample. A student wants to fit the binomial model X ~ Binom(p, 10) to estimate the probability p of the coin landing on heads. For this problem, the vector of MLE parameters θ is merely the one parameter p.

Recall that if you are using SAS/IML to optimize an objective function, the parameter that you are trying to optimize should be the only argument to the function, and all other parameters should be specified on the GLOBAL statement. Thus one way to write a SAS/IML function for the binomial log-likelihood function is as follows:

proc iml;
/* Method 1: Use LOGPDF. This method works in DATA step as well */
start Binom_LL1(p) global(x, NTrials);
   LL = sum( logpdf("Binomial", x, p, NTrials) );
   return( LL );
finish;
 
/* visualize log-likelihood function, which is a function of p */
NTrials = 10;    /* number of trials (fixed) */
use Binomial; read all var "x"; close;
 
p = do(0.01, 0.99, 0.01);      /* vector of parameter values */
LL = j(1, ncol(p), .);
do i = 1 to ncol(LL);
   LL[i] = Binom_LL1( p[i] );  /* evaluate LL for a sequence of p */
end;
 
title "Graph of Log-Likelihood Function";
title2 "Binomial Distribution, NTrials=10";
call series(p, LL) grid={x y} xvalues=do(0,1,0.1)
                   label={"Probability of Sucess (p)", "Log Likelihood"};
Graph of log-likelihood function for the binomial distribution

Notice that the data is constant and does not change. The log likelihood is considered to be a function of the parameter p. Therefore you can graph the function for representative values of p, as shown. The graph clearly shows that the log likelihood is maximal near p=0.56, which is the maximum likelihood estimate. The graph is fairly flat near its optimal value, which indicates that the estimate has a wide standard error. A 95% confidence interval for the parameter is also wide. If the sample contained 100 observations instead of only 20, the log-likelihood function might have a narrower peak.

Notice also that the LOGPDF function made this computation very easy. You do not need to worry about the actual formula for the binomial density. All you have to do is sum the log-density at the data values.

In contrast, the second method requires a little more work, but can handle any distribution for which you can compute the density function. If you look up the formula for the binomial PDF in the MCMC documentation, you see that
PDF(x; p, NTrials) = comb(NTrials, x) * p**x * (1-p)**(NTrials-x)
where the COMB function computes the binomial coefficient "NTrials choose x." There are three terms in the PDF that are multiplied together. Therefore when you apply the LOG function, you get the sum of three terms. You can use the LCOMB function in SAS to evaluate the logarithm of the binomial coefficients in an efficient manner, as follows:

/* Method 2: Manually compute log likelihood by using formula */
start Binom_LL2(p) global(x, NTrials);
   LL = sum(lcomb(NTrials, x)) + log(p)*sum(x) + log(1-p)*sum(NTrials-x);
   return( LL );
finish;
 
LL2 = Binom_LL2(p);      /* vectorized function, so no need to loop */

The second formulation has an advantage in a vector language such as SAS/IML because you can write the function so that it can evaluate a vector of values with one call, as shown. It also has the advantage that you can modify the function to eliminate terms that do not depend on the parameter p. For example, if your only goal is maximize the log-likelihood function, you can omit the term sum(lcomb(NTrials, x)) because that term is a constant with respect to p. That reduces the computational burden. Of course, if you omit the term then you are no longer computing the exact binomial log likelihood.

Example: The log-likelihood function for the lognormal distribution

In a similar way, you can use the LOGPDF or the formula for the PDF to define the log-likelihood function for the lognormal distribution. For brevity, I will only show the SAS/IML functions, but you can download the complete SAS program that defines the log-likelihood function and computes the graph.

The following SAS/IML modules show two ways to define the log-likelihood function for the lognormal distribution. For the lognormal distribution, the vector of parameters θ = (μ, σ) contains two parameters.

/* Method 1: use LOGPDF */
start LogNormal_LL1(param) global(x);
   mu = param[1];
   sigma = param[2];
   LL = sum( logpdf("Lognormal", x, mu, sigma) );
   return( LL );
finish;
 
/* Method 2: Manually compute log likelihood by using formula
   PDF(x; p, NTrials) = comb(NTrials,x) # p##x # (1-p)##(NTrials-x)
*/
start LogNormal_LL2(param) global(x);
   mu = param[1];
   sigma = param[2];
   twopi = 2*constant('pi');
   LL = -nrow(x)/2*log(twopi*sigma##2) 
        - sum( (log(x)-mu)##2 )/(2*sigma##2)
        - sum(log(x));  /* this term is constant w/r/t (mu, sigma) */
   return( LL );
finish;

The function that uses the LOGPDF function is simple to write. The second method is more complicated because the lognormal PDF is more complicated than the binomial PDF. Nevertheless, the complete log-likelihood function only requires a few SAS/IML statements.

For completeness, the contour plot on this page shows the log-likelihood function for 200 simulated observations from the Lognormal(2, 0.5) distribution. The parameter estimates are (μ, σ) = (1.97, 0.5).

Graph of the log-likelihood function for the lognormal distribution

Summary

This article has shown two simple ways to define a log-likelihood function in SAS. You can sum the values of the LOGPDF function evaluated at the observations, or you can manually apply the LOG function to the formula for the PDF function. The log likelihood is regarded as a function of the parameters of the distribution, even though it also depends on the data. For distributions that have one or two parameters, you can graph the log-likelihood function and visually estimate the value of the parameters that maximize the log likelihood.

Of course, SAS enables you to numerically optimize the log-likelihood function, thereby obtaining the maximum likelihood estimates. My next blog post shows two ways to obtain maximum likelihood estimates in SAS.

The post Two simple ways to construct a log-likelihood function in SAS appeared first on The DO Loop.

Back to Top