PROC REGRESSION: A simple explanation of options and results

From sasCommunity
Jump to: navigation, search
This is a work in progress. You can contribute to this article.

Introduction to PROC REG

PROC REG is one of the many statistical procedures in SAS which can be used to create linear regression model. This is the first of many subsequent procedures for linear models and is one of the most comprehensive ones handling linear regression (Or more appropriately, ordinary least squares regression).

PROC REG is related to several other procedures which deal with regression models.

  • CATMOD
  • GENMOD
  • GLM
  • LOGISTIC
  • MIXED
  • NLIN
  • ORTHOREG
  • PROBIT
  • RSREG
  • TRANSREG
PROC REG supports the following functionality. PROC REG creates following plots.
- multiple MODEL declarations in a single PROC REG step

- nine model-selection methods including stepwise and Mallow's CP
- interactive in nature
- linear equality restrictions on model parameters
- testing linear hypotheses and multivariate hypotheses
- collinearity diagnostics
- predicted values, residuals, studentized residuals, confidence limits, and influence statistics
- correlation or crossproduct input accepted
- requested statistics available as output data sets

- plot summary statistics and diagnostic statistics

- produce normal quantile-quantile (Q-Q) and probability-probability (P-P) plots
- special shorthand options to plot ridge traces and confidence intervals
- display the fitted model equation, summary statistics, and reference lines on the plot
- control the graphics appearance with PLOT statement options and with global graphics statements
- "paint" or highlight line-printer scatter plots
- produce partial regression leverage line-printer plots

Introduction to Linear Regression

(Extracted from SAS) Regression analysis is the analysis of the relationship between a response or outcome variable and another set of variables. The relationship is expressed through a statistical model equation that predicts a response variable (also called a dependent variable or criterion) from a function of regressor variables (also called independent variables, predictors, explanatory variables, factors, or carriers) and parameters. In a linear regression model the predictor function is linear in the parameters (but not necessarily linear in the regressor variables). The parameters are estimated so that a measure of fit is optimized.

[1]

Theory of Linear Regression

(Extracted from wikipedia)

In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In linear regression, data are modeled using linear functions, and unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis.

Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.

Linear regression has many practical uses. Most applications of linear regression fall into one of the following two broad categories:

  • If the goal is prediction, or forecasting, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.
  • Given a variable y and a number of variables X1, ..., Xp that may be related to y, then linear regression analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y, thus once one of them is known, the others are no longer informative.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression. Conversely, the least squares approach can be used to fit models that are not linear models. Thus, while the terms “least squares” and linear model are closely linked, they are not synonymous.

[2]

Regression with the REG and GLM Procedures

(Extracted from SAS) In terms of the assumptions about the basic model and the estimation principles, the REG and GLM procedures are very closely related. Both procedures estimate parameters by ordinary or weighted least squares and assume homoscedastic, uncorrelated model errors with zero mean. An assumption of normality of the model errors is not necessary for parameter estimation, but it is implied in confirmatory inference based on the parameter estimates—that is, the computation of tests, p-values, and confidence and prediction intervals.

The GLM procedure supports a CLASS statement for the levelization of classification variables on the parameterization of classification variables in statistical models. Classification variables are accommodated in the REG procedure by the inclusion of the necessary dummy regressor variables.

Most of the statistics based on predicted and residual values that are available in PROC REG are also available in PROC GLM. However, PROC GLM does not produce collinearity diagnostics, influence diagnostics, or scatter plots. In addition, PROC GLM allows only one model and fits the full model.

Both procedures are interactive, in that they do not stop after processing a RUN statement. The procedures accept statements until a QUIT statement is submitted.

[3]

Missing Value Management

PROC REG constructs only one crossproducts matrix for the variables in all regressions. If any variable needed for any regression is missing, the observation is excluded from all estimates. If you include variables with missing values in the VAR statement, the corresponding observations are excluded from all analyses, even if you never include the variables in a model. PROC REG assumes that you might want to include these variables after the first RUN statement and deletes observations with missing values.

One very common error encountered by users is the all missing observations error. This is usually caused by two or more variables with huge number of missing observations interacting with one another to cause every single observations to have at least one missing value.

[4]

Interactive Approach for Regression Modeling

PROC REG enables you to change interactively both the model and the data used to compute the model, and to produce and highlight scatter plots. See the section Using PROC REG Interactively for an overview of interactive analysis that uses PROC REG. The following statements can be used interactively (without reinvoking PROC REG): ADD, DELETE, MODEL, MTEST, OUTPUT, PAINT, PLOT, PRINT, REFIT, RESTRICT, REWEIGHT, and TEST. All interactive features are disabled if there is a BY statement.

The ADD, DELETE, and REWEIGHT statements can be used to modify the current MODEL. Every use of an ADD, DELETE, or REWEIGHT statement causes the model label to be modified by attaching an additional number to it. This number is the cumulative total of the number of ADD, DELETE, or REWEIGHT statements following the current MODEL statement.

[5]

Options

Statement Options

Data Set Options Description

DATA =
OUTEST =

OUTSSCP =
COVOUT =
OUTEST =
EDF =

OUTSEB =

OUTSTB =

OUTVIF =

PCOMIT =

PRESS =
RIDGE =

RSQUARE =
TABLEOUT=

names a data set to use for the regression
outputs a data set that contains parameter estimates and other
model fit summary statistics
outputs a data set that contains sums of squares and crossproducts
outputs the covariance matrix for parameter estimates to the
data set
outputs the number of regressors, the error degrees of freedom,
and the model R-squared to the OUTEST= data set
outputs standard errors of the parameter estimates to the
OUTEST= data set
outputs standardized parameter estimates to the OUTEST= data
set. Use only with the RIDGE= or PCOMIT= option.
outputs the variance inflation factors to the OUTEST= data set.
Use only with the RIDGE= or PCOMIT= option.
performs incomplete principal component analysis and outputs
estimates to the OUTEST= data set
outputs the PRESS statistic to the OUTEST= data set
performs ridge regression analysis and outputs estimates to the
OUTEST= data set
same effect as the EDF option
outputs standard errors, confidence limits, and associated test
statistics of the parameter estimates to the OUTEST= data set

ODS Graphics Options

ODS Graphics Options Description

PLOTS

produces ODS graphical displays


Traditional Graphics Options

Graphics Options Description

ANNOTATE
GOUT

specifies an annotation data set
specifies the graphics catalog in which graphics output is saved

Display Options

Display Options Description

CORR

SIMPLE

USCCP
ALL
NOPRINT
LINEPRINTER

displays correlation matrix for variables listed in MODEL and
VAR statements
displays simple statistics for each variable listed in MODEL and
VAR statements
displays uncorrected sums of squares and crossproducts matrix
displays all statistics (CORR, SIMPLE, and USSCP)
suppresses output
creates plots requested as line printer plot

Other Options

Other Options Description

ALPHA
SINGULAR

sets significance value for confidence and prediction intervals and tests
sets criterion for checking for singularity

Model Selection Choices

None

This method is the default and provides no model selection capability. The complete model specified in the MODEL statement is used to fit the model. For many regression analyses, this might be the only method you need.

Forward

The forward-selection technique begins with no variables in the model. For each of the independent variables, the FORWARD method calculates F statistics that reflect the variable’s contribution to the model if it is included. The p-values for these F statistics are compared to the SLENTRY= value that is specified in the MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If no F statistic has a significance level greater than the SLENTRY= value, the FORWARD selection stops. Otherwise, the FORWARD method adds the variable that has the largest F statistic to the model. The FORWARD method then calculates F statistics again for the variables still remaining outside the model, and the evaluation process is repeated. Thus, variables are added one by one to the model until no remaining variable produces a significant F statistic. Once a variable is in the model, it stays.

Backward

The backward elimination technique begins by calculating F statistics for a model, including all of the independent variables. Then the variables are deleted from the model one by one until all the variables remaining in the model produce F statistics significant at the SLSTAY= level specified in the MODEL statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the variable showing the smallest contribution to the model is deleted.

Stepwise

The stepwise method is a modification of the forward-selection technique and differs in that variables already in the model do not necessarily stay there. As in the forward-selection method, variables are added one by one to the model, and the F statistic for a variable to be added must be significant at the SLENTRY= level. After a variable is added, however, the stepwise method looks at all the variables already included in the model and deletes any variable that does not produce an F statistic significant at the SLSTAY= level. Only after this check is made and the necessary deletions are accomplished can another variable be added to the model. The stepwise process ends when none of the variables outside the model has an F statistic significant at the SLENTRY= level and every variable in the model is significant at the SLSTAY= level, or when the variable to be added to the model is the one just deleted from it.

Maximum R2 Improvement

The maximum R2 improvement technique does not settle on a single model. Instead, it tries to find the "best" one-variable model, the "best" two-variable model, and so forth, although it is not guaranteed to find the model with the largest R2 for each size.

The MAXR method begins by finding the one-variable model producing the highest R2. Then another variable, the one that yields the greatest increase in R2, is added. Once the two-variable model is obtained, each of the variables in the model is compared to each variable not in the model. For each comparison, the MAXR method determines if removing one variable and replacing it with the other variable increases R2. After comparing all possible switches, the MAXR method makes the switch that produces the largest increase in R2. Comparisons begin again, and the process continues until the MAXR method finds that no switch could increase R2. Thus, the two-variable model achieved is considered the "best" two-variable model the technique can find. Another variable is then added to the model, and the comparing-and-switching process is repeated to find the "best" three-variable model, and so forth.

The difference between the STEPWISE method and the MAXR method is that all switches are evaluated before any switch is made in the MAXR method. In the STEPWISE method, the "worst" variable might be removed without considering what adding the "best" remaining variable might accomplish. The MAXR method might require much more computer time than the STEPWISE method.

Minimum R2 Improvement

The MINR method closely resembles the MAXR method, but the switch chosen is the one that produces the smallest increase in R2. For a given number of variables in the model, the MAXR and MINR methods usually produce the same "best" model, but the MINR method considers more models of each size.

R2 Selection

The RSQUARE method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. You can specify the largest and smallest number of independent variables to appear in a subset and the number of subsets of each size to be selected. The RSQUARE method can efficiently perform all possible subset regressions and display the models in decreasing order of R2 magnitude within each subset size. Other statistics are available for comparing subsets of different sizes. These statistics, as well as estimated regression coefficients, can be displayed or output to a SAS data set.

The subset models selected by the RSQUARE method are optimal in terms of R2 for the given sample, but they are not necessarily optimal for the population from which the sample is drawn or for any other sample for which you might want to make predictions. If a subset model is selected on the basis of a large R2 value or any other criterion commonly used for model selection, then all regression statistics computed for that model under the assumption that the model is given a priori, including all statistics computed by PROC REG, are biased.

While the RSQUARE method is a useful tool for exploratory model building, no statistical method can be relied on to identify the "true" model. Effective model building requires substantive theory to suggest relevant predictors and plausible functional forms for the model.

The RSQUARE method differs from the other selection methods in that RSQUARE always identifies the model with the largest R2 for each number of variables considered. The other selection methods are not guaranteed to find the model with the largest R2. The RSQUARE method requires much more computer time than the other selection methods, so a different selection method such as the STEPWISE method is a good choice when there are many independent variables to consider.

Adjusted R2 Selection

This method is similar to the RSQUARE method, except that the adjusted R2 statistic is used as the criterion for selecting models, and the method finds the models with the highest adjusted R2 within the range of sizes.

Mallows' Cp Selection

This method is similar to the ADJRSQ method, except that Mallows’ Cp statistic is used as the criterion for model selection. Models are listed in ascending order of Cp.

Notes

Two previous procedures, PROC RSQUARED and STEPWISE have been merged into PROC REG with the respective model selection in SAS.

[6]


Criteria Used in Model-Selection Methods

When many significance tests are performed, each at a level of, for example, 5%, the overall probability of rejecting at least one true null hypothesis is much larger than 5%. If you want to guard against including any variables that do not contribute to the predictive power of the model in the population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE methods.

In most applications, many of the variables considered have some predictive power, however small. If you want to choose the model that provides the best prediction computed using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated with the given sample size, so you should use a moderate significance level, perhaps in the range of 10% to 25%.

[7]

Problems and limitations of Model Selection

The use of model-selection methods can be time-consuming in some cases because there is no built-in limit on the number of independent variables, and the calculations for a large number of independent variables can be lengthy. The recommended limit on the number of independent variables for the MINR method is , where is the value of the INCLUDE= option.

For the RSQUARE, ADJRSQ, or CP method, with a large value of the BEST= option, adding one more variable to the list from which regressors are selected might significantly increase the CPU time. Also, the time required for the analysis is highly dependent on the data and on the values of the BEST=, START=, and STOP= options.

There are many problems with the use of STEPWISE, FORWARD and BACKWARD selection techniques. Many of the problems lie in the theoretical formulation of the model selection approach which essentially gets violated severely in the selection tests. Given such short comings, more advance techniques such as LASSO and Elastic Nets are preferred.

[8] [9]

Testing for Heteroscedasticity

Heteroscedasticity is one of the major violations of the assumption of linear regression models. In PROC REG, this can be tested easily.

[10]


--Murphy Choy From Singapore 16:37, 13 April 2011 (UTC)