23
Mar

SAS introduces the blended classroom

We all have different learning styles. Some learn best by seeing and doing; others by listening to lectures in a traditional classroom; still others simply by diving in and asking questions along the way. Traditional face-to-face classroom instruction, real-time classes over the Internet, or self-paced instruction with exercises, SAS Education [...]

The post SAS introduces the blended classroom appeared first on SAS Learning Post.

24
Aug

Flexibility of SAS Enterprise Miner

Do you use an array of tools to perform predictive analytics on your data? Is your current tool not flexible enough to accommodate some of your requirements? SAS Enterprise Miner may be your solution. With growing number of data mining applications, having a tool which can do variety of analysis […]

The post Flexibility of SAS Enterprise Miner appeared first on The SAS Training Post.

17
Aug

R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$y_i=f_i(\boldsymbol{x}|\boldsymbol{\beta})+\varepsilon_i,$$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$ and the predictor or the independent variable is the $\boldsymbol{x}$ term defined in the mean function $f_i(\boldsymbol{x}|\boldsymbol{\beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$y_i=\beta_0+\beta_1x_i+\varepsilon_i.$$ To obtain the (best) estimate of $\beta_0$ and $\beta_1$, we solve for the least residual sum of squares (RSS) given by, $$S=\sum_{i=1}^{n}\varepsilon_i^2=\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2.$$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

The following is the plot of the residual sum of squares of the data base on the SLR model over $\beta_0$ and $\beta_1$, note that we standardized the variables first before plotting it,
If you are interested on the codes of the above figure, please click here. To minimize this elliptic paraboloid, differentiation has to be done with respect to the parameters, and then equate this to zero to obtain the stationary point, and finally solve for $\beta_0$ and $\beta_1$. For more on derivation of the estimates of the parameters see reference 1.

Simple Linear Regression in R

In R, we can fit the model using the lm function, which stands for linear models, i.e.

Formula, defined above as {response ~ predictor}, is a handy method for fitting model to the data in R. Mathematically, our model is $$weight = \beta_0 + \beta_1 (height) + \varepsilon.$$ The summary of it is obtain by running model %>% summary or for non-magrittr user summary(model), given the model object defined in the previous code,

The Coefficients section above returns the estimated coefficients of the model, and these are $\beta_0 = -87.51667$ and $\beta_1=3.45000$ (it should be clear that we used the unstandardized variables for obtaining these estimates). The estimates are both significant base on the p-value under .05 and even in .01 level of the test. Using the estimated coefficients along with the residual standard error we can now construct the fitted line and it's confidence interval as shown below.
 Fig 1. Plot of the Data and the Predicted Values in R.

Simple Linear Regression in Python

In Python, there are two modules that have implementation of linear regression modelling, one is in scikit-learn (sklearn) and the other is in Statsmodels (statsmodels). For example we can model the above data using sklearn as follows:

Above output is the estimate of the parameters, to obtain the predicted values and plot these along with the data points like what we did in R, we can wrapped the functions above into a class called linear_regression say, that requires Seaborn package for neat plotting, see the codes below:

Using this class and its methods, fitting the model to the data is coded as follows:

The predicted values of the data points is obtain using the predict method,

And Figure 2 below shows the plot of the predicted values along with its confidence interval and data points.
 Fig 2. Plot of the Data and the Predicted Values in Python.
If one is only interested on the estimates of the model, then LinearRegression of scikit-learn is sufficient, but if the interest on other statistics such as that returned in R model summary is necessary, the said module can also do the job but might need to program other necessary routine. statsmodels, on the other hand, returns complete summary of the fitted model as compared to the R output above, which is useful for studies with particular interest on these information. So that modelling the data using simple linear regression is done as follows:

Clearly, we could spare time with statsmodels, especially in diagnostic checking involving test statistics such as Durbin-Watson and Jarque-Bera tests. We can of course add some plotting for diagnostic, but I prefer to discuss that on a separate entry.

Simple Linear Regression in SAS

Now let's consider running the data in SAS, I am using SAS Studio and in order to import the data, I saved it as a CSV file first with columns height and weight. Uploaded it to SAS Studio, in which follows are the codes below to import the data.

Next we fit the model to the data using the REG procedure,

 Number of Observations Read 15 15
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F ValuePr > F
Model13332.700003332.700001433.02<.0001
Error1330.233332.32564
Corrected Total143362.93333
 Root MSE R-Square 1.52501 0.9910 136.733 0.9903 1.11531
Parameter Estimates
VariableDFParameter
Estimate
Standard
Error
t ValuePr > |t|
Intercept1-87.516675.93694-14.74<.0001
height13.450000.0911437.86<.0001
Now that's a lot of output, probably the complete one. But like I said, I am not going to discuss each of these values and plots as some of it are used for diagnostic checking (you can read more on that in reference 1, and in other applied linear regression books). For now, let's just confirm the coefficients obtained -- both the estimates are the same with that in R and Python.

Multiple Linear Regression (MLR)

To extend SLR to MLR, we'll demonstrate this by simulation. Using the formula-based lm function of R, assuming we have $x_1$ and $x_2$ as our predictors, then following is how we do MLR in R:

Although we did not use intercept in simulating the data, but the obtained estimates for $\beta_1$ and $\beta_2$ are close to the true parameters (.35 and .56). The intercept, however, will help us capture the noise term we added in simulation.

Next we'll try MLR in Python using statsmodels, consider the following:

It should be noted that, the estimates in R and in Python should not (necessarily) be the same since these are simulated values from different software. Finally, we can perform MLR in SAS as follows:

 Number of Observations Read 100 100
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F ValuePr > F
Model2610.86535305.43268303.88<.0001
Error9797.495211.00511
Corrected Total99708.36056
 Root MSE R-Square 1.00255 0.8624 244.073 0.8595 0.41076
Parameter Estimates
VariableDFParameter
Estimate
Standard
Error
t ValuePr > |t|
Intercept118.0129911.101161.620.1079
X110.317700.0181817.47<.0001
X210.582760.0335817.35<.0001

Conclusion

In conclusion, SAS saves a lot of work, since it returns complete summary of the model, no doubt why companies prefer to use this, besides from their active customer support. R and Python, on the other hand, despite the fact that it is open-source, it can well compete with the former, although it requires programming skills to achieved all of the SAS outputs, but I think that's the exciting part of it -- it makes you think, and manage time. The achievement in R and Python is of course fulfilling. Hope you've learned something, feel free to share your thoughts on the comment below.

Reference

1. Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. John Wiley & Sons, Inc. United States of America.
2. Scikit-learn Documentation
3. Statsmodels Documentation
4. SAS Documentation
5. Delwiche, Lora D., and Susan J. Slaughter. 2012. The Little SAS® Book: A Primer, Fifth Edition. Cary, NC: SAS Institute Inc.
6. Regression with SAS. Institute for Digital Research and Education. UCLA. Retrieved August 13, 2015.
7. Python Plotly Documentation
26
Apr

18
Dec

%SVD macro with BY-Processing

For the Regularized Discriminant Analysis Cross Validation, we need to compute SVD for each pair of $$(\lambda, \gamma)$$, and the factorization result will be feed to the downdating algorithm to obtain leave one out variance-covariance matrix $$\hat{\Sigma}_{k\v}(\lambda, \gamma)$$. The code is a direct modification of my original SVD macro @here. It is much more than simply put "BY-" statement in PROC PRINCOMP, but still easy to do. I used the within class CSSCP from the Iris data shipped with SAS and cross verified with R. %CallR macro is used here.

The screenshot below shows the SAS macro generates the same result as from R. Here is the test code.
dm log 'clear';proc discrim data=sashelp.iris noclassify noprint               outstat=iris_stat(WHERE=(_TYPE_ = "CSSCP" & Species ^= " "))      ;   class Species;   var   SepalLength SepalWidth  PetalLength  PetalWidth;run;            %svd_bystmt(sigma, output_v, output_s, output_u, &covars, Species, , nfac=0);data _null_;     file "d:\data\sigma.csv"  dlm=",";  set iris_stat;  if _n_=1 then do;     put 'Species'      ','             '_NAME_'       ','             'SepalLength'  ','             'SepalWidth'   ','               'PetalLength'  ','               'PetalWidth';  end;  put Species  _NAME_ SepalLength SepalWidth  PetalLength  PetalWidth;run;%RScript(d:\data\rscript.r)cards4;dat=read.csv("d:\\data\\sigma.csv", header=1, as.is=1)head(dat)sigma=dat[, c(-1, -2)]sigmasvd=by(sigma, dat$Species, svd)sigmasvd$Setosa$dsigmasvd$Virginica$dsigmasvd$Setosa$v;;;;run;%CallR(d:/data/rscript.r, d:/data/rlog1.txt);data _null_; infile "d:\data\rlog1.txt"; input; put _infile_;run;data _null_; do until (eof1); set output_s end=eof1; put @1 Species= @24 Number= @40 Singuval= 8.2; end; put " "; do until (eof2); set output_v end=eof2; where Species="Setosa"; put Prin1 8.5 Prin2 8.5 Prin3 8.5 Prin4 8.5; end; stop;run; %macro SVD_bystmt( input_dsn, output_V, output_S, output_U, input_vars, by_var, ID_var, nfac=0 );%local blank para EV USCORE n pos dsid nobs nstmt shownote showsource nbylevel bystmt;%let shownote=%sysfunc(getoption(NOTES));%let showsource=%sysfunc(getoption(SOURCE));options nonotes nosource;%let blank=%str( );%let EV=EIGENVAL;%let USCORE=USCORE;%if &by_var eq &blank %then %do; %let bystmt = ␣ %let nbylevel = 1;%end;%else %do; %let bystmt = by &by_var;%end;%let n=%sysfunc(countW(&input_vars));%let dsid=%sysfunc(open(&input_dsn));%let nobs=%sysfunc(attrn(&dsid, NOBS));%let dsid=%sysfunc(close(&dsid));%if &nfac eq 0 %then %do; %let nstmt=␣ %let nfac=&n;%end; %else %do; %let x=%sysfunc(notdigit(&nfac, 1)); %if &x eq 0 %then %do; %let nfac=%sysfunc(min(&nfac, &n)); %let nstmt=%str(n=&nfac); %end; %else %do; %put ERROR: Only accept non-negative integer.; %goto exit; %end;%end;/* calculate U=XV/S */%if &output_U ne %str() %then %do; %let outstmt= out=&output_U.(keep=&by_var &ID_var Prin:);%end;%else %do; %let outstmt=␣%end;/* build index for input data set */proc datasets library=work nolist; modify &input_dsn; index create &by_var;run;quit;%let options=noint cov noprint &nstmt;proc princomp data=&input_dsn /* out=&input_dsn._score */ &outstmt outstat=&input_dsn._stat(where=(_type_ in ("&USCORE", "&EV"))) &options; &bystmt; var &input_vars;run;/* need to check if the by_var is Char or Numeric, then set the format accordingly*/data _null_; set &input_dsn; type=vtype(&by_var); if type="C" then call symput("bylevelfmtvar", "$bylevelfmt");  else     call symput("bylevelfmtvar", "bylevelfmt");run;data __bylevelfmt;     set &input_dsn._stat  end=eof;  &bystmt;  retain _nbylevel 0;  retain  fmtname "&bylevelfmtvar";  if first.&by_var then do;        _nbylevel+1;  start=&by_var;    label=_nbylevel;     output;  keep label start fmtname  ;  end;  if eof then call symput("_nbylevel", _nbylevel);run;proc format cntlin=__bylevelfmt;run;%put &_nbylevel;%if &output_V ne &blank %then %do; proc transpose data=&input_dsn._stat(where=(_TYPE_="&USCORE"))                 out=&output_V.(rename=(_NAME_=variable))                name=_NAME_;   &bystmt;      var &input_vars;      id _NAME_;      format &input_vars 8.6; run;%end;/* recompute Proportion */%if &output_S ne %str() %then %do; data &output_S;       set &input_dsn._stat ;    &bystmt;       where _TYPE_="EIGENVAL";       array _s{*} &input_vars;       array _x{&nfac, 3} _temporary_;     Total=sum(of &input_vars, 0);    _t=0;    do _i=1 to &nfac;       _x[_i, 1]=_s[_i];           _x[_i, 2]=_s[_i]/Total;        if _i=1 then              _x[_i, 3]=_x[_i, 2];        else              _x[_i, 3]=_x[_i-1, 3]+_x[_i, 2];       _t+sqrt(_x[_i, 2]);    end;    do _i=1 to &nfac;       Number=_i;         EigenValue=_x[_i, 1]; Proportion=_x[_i, 2]; Cumulative=_x[_i, 3];       S=sqrt(_x[_i, 2])/_t;  SinguVal=sqrt(_x[_i, 1] * &nobs/&_nbylevel);       keep &by_var Number EigenValue  Proportion Cumulative  S SinguVal;       output;    end; run;%end;%if &output_U ne %str() %then %do;  data &output_U;      array _S{&_nbylevel, &nfac}  _temporary_;        if _n_=1 then do;         do until (eof);            set  &output_S(keep=&by_var Number SinguVal)  end=eof;      where Number<=&nfac;      %if &by_var eq &blank %then %do;          _row=1;      %end;      %else %do;                   _row = input(put(&by_var, &bylevelfmtvar..), best.);      %end;            _S[_row, number]=SinguVal;             if abs(_S[_row, number]) < CONSTANT('MACEPS') then _S[_row, j]=CONSTANT('BIG');         end;     end;     set &output_U;     array _A{*}  Prin1-Prin&nfac;     %if &by_var eq &blank %then %do;        _row=1;  %end;        %else %do;           _row = input(put(&by_var, &bylevelfmtvar..), best.);  %end;     do _j=1 to dim(_A);         _A[_j]=_A[_j]/_S[_row, _j];     end;     keep &by_var &ID_var Prin1-Prin&nfac ; run;%end;proc datasets library=work nolist;     modify &input_dsn;  index delete &by_var;run;quit;%exit: options &shownote  &showsource;%mend;
15
Dec

Experient downdating algorithm for Leave-One-Out CV in RDA

In this post, I want to demonstrate a piece of experiment code for downdating algorithm for Leave-One-Out (LOO) Cross Validation in Regularized Discriminant Analysis [1].

In LOO CV, the program needs to calculate the inverse of $$\hat{\Sigma}_{k\v}(\lambda, \gamma)$$ for each observation v to obtain its new scoring function at a given pair of regularizing pair parameter $$(\lambda, \gamma)$$, where $$k\v$$ indicates class $$k$$ excluding observation $$v$$.

As mentioned in [1], to obtain the inverse of this variance covariance matrix for class k, it is not sufficient to just remove observation $$v$$ and recalculate it, but instead, we try to calculate $$W_{k\v}(\lambda, \gamma)\hat{\Sigma}^{-1}_{k\\v}(\lambda, \gamma)$$, which is equivalent to compute the inverse of :

$W_k(\lambda)\hat{\Sigma}^{-1}_k (\lambda) - \gamma \|Z_v\|^2 \cdot I - (1-\gamma)Z_v Z_v^{t}$

In this formula, the first element $$W_k(\lambda)\hat{\Sigma}^{-1}_k (\lambda)$$ does not involve quantities varying along with observation $$v$$ and can be pre-computed. The latter two can be easily updated on the fly as we scan through each individual observations.

With this in mind, we use the iris data for demonstration. Note that the SAS code here is for illustrating the concept of downdating algorithm above for a given pair of [ (\lambda, \gamma) ]. The computing is divided into several steps. (1) We first use PROC DISCRIM to calculate CSSCP matrix, total weight and variable mean by class. These quantities correspond to $$\Sigma_k$$, $$\bar{X_k}$$ and $$W_k$$ in the original paper. (2) We then use the %SVD Macro @here to obtain key elements for later updating.

Note that in order to obtain CSSCP, MEAN and total weights, in addition to PROC DISCRIM, we can also use PROC CORR. The advantage is that we can keep all computing within the framework of SAS/Base so to benefit budget-sensitive business and maximize compatibility. The down side is that the data set need to sorted or indexed to obtain CSSCP by class and the data has to be passed twice, one for pooled CSSCP and the other for class specific CSSCP.

(to be completed...)
 /* study downdating algorithm */%let lambda = 0.3;%let gamma = 0.6;%let target = Species;%let covars = SepalLength SepalWidth  PetalLength  PetalWidth;/* step 0. Obtain key statistics from DISCRIM */proc discrim data=iris  outstat=stat noprint  noclassify;      class ⌖   var   &covars;run;/* step 1. Obtain \Sigma_k(\lambda) & \Sigma_k(\lambda, \gamma)           \S_k = (_TYPE_="CSSCP' & &target = "something")           \S  = (_TYPE_="CSSCP" & &target = " ")           \W_k = (_TYPE_="N" & &target = "something") -1           \W  = (_TYPE_="N" & &target = " ") - &nClass + 1           \Sigma_k(\lambda) = \S_k(\lambda) / \W_k(\lambda)              where:                    \S_k(\lambda) = (1-\lambda)*S_k + \lambda*S                    \W_k(\lambda) = (1-\lambda)*W_k + \lambda*W          \Sigma_k(\lambda, \gamma) = (1-\lambda)*\Sigma_k(\lambda) + \gamma* trace[\Sigma_k(\lambda)]/p*I         */ proc sql noprint;     select distinct &target into :targetvalues separated by " "  from   stat  where _TYPE_="CSSCP"  ;  select distinct _NAME_ into :covars separated by " "  from   stat  where _TYPE_="CSSCP"  ;quit;%put &targetvalues;%put &covars;%let nClass=%sysfunc(countw("&targetvalues"));%put &nClass;%let nVars=%sysfunc(countw("&covars"));%put &nVars;data weights;     set stat;  where _TYPE_="N";  array _nv{*} _numeric_;  if &target="" then _total=_nv[1]-&nClass+1;  retain lambda λ  retain _total;  weight=(1-&lambda) * (_nv[1]-1) + &lambda*_total;    keep &target   weight  lambda;  if &target ^= "";run;proc sort data=stat  out=sigma;      by _TYPE_  ⌖   WHERE _TYPE_ = "CSSCP" & &target ^= " ";run;proc sort data=weights;     by  ⌖run;/*-  step 1.1 Update \Sigma_k(\lambda) = ((1-\lambda)*S_k + \lambda*S) / ( (1-\lambda)*W_k + \lambda*W )  -*/                                                                                          data sigma;     array _sigma{&nVars, &nVars} _temporary_;  array _x{&nVars} &covars;  array _y{&nClass} _temporary_;    if _n_=1 then do;     _iter=1;     do until (eof1);        set stat(where=(_TYPE_="CSSCP" & &target = " "))  end=eof1;     do _j=1 to &nVars;        _sigma[_iter, _j]=_x[_j];     end;     _iter+1;     end;              _iter=1;      do until (eof2);     set weights end=eof2;        _y[_iter]=weight;     _iter+1;  end;  put _y[3]=;    _target=⌖  _iter=0;  end;  modify sigma ;    retain  weight;  retain  _target;  _TYPE_="COV";   if &target^=_target then do;        _iter+1;        _i=0;   weight=_y[_iter];  _target=⌖  end;  _i+1;  put "&covars";  put &covars;  do _j=1 to &nVars;     _x[_j]= ((1-&lambda)*_x[_j] + &lambda*_sigma[_i, _j])  ;  end;  put &covars;  put "-----------------------------------------------------------------";  replace;  run;/*- trace is sum of every elemnt of a matrix -*/data _trace;     set sigma;  by ⌖  array _x{*} &covars;  retain trace _iter;  if first.&target then do;     _iter = 1;  trace=0;  end;  trace+_x[_iter];  _iter+1;  if last.&target then output;  keep &target  trace;run;/*- step 1.2 Update \Sigma_k (\lambda, \gamma) = (1-\gamma)*\Sigma_k(\lambda) + \gamma *c*I -*//*- where  c = trace(\Sigma_k(\lambda)) / p, where p=&nVars                             -*/data sigma;      if _n_=1 then do;      declare hash h(dataset:'_trace');   h.defineKey("&target");   h.defineData("trace");   h.defineDone();    call missing(trace);   end;   set sigma; by ⌖   array _x{*} &covars;   retain _adj  trace _iter;   if first.&target then do;      _iter=1;   end;   if _iter=1 then do;         _rc=h.find();         _adj= &gamma/&nVars * trace;   end;   put _iter=  &target=;   do _j=1 to &nVars;      if _j=_iter then do;      _x[_iter] = (1-&gamma)*_x[_iter] + _adj;   end;   else do;            _x[_iter] = (1-&gamma)*_x[_iter];   end;   end;   _iter=_iter+1;   drop _type_   _adj _iter _j  _rc  trace;run;/*-step 1.3 Obtain the inverse of W_k \Sigma_k(\lambda, \gamma) - c*I  using SVD                                     -*/  /*-This inverse is the same as first inverse \Sigma_k(\lambda, \gamma), then with S matrix deflated by W_k(\lambda)  -*//*-and differenced by c, where c=\gamma* |Z_v|^2/p, and Z_v = sqrt(bk(v))*(X_v- mean(X_k\v),                         -*//*-bk(v)=sk(v)*Wc(v) wv/(Wc(v)-wv)                                                                                   -*//*-v is the current observation                                                                                      -*//*-       1. apply SVD to obtain eigenvalues   %SVD(sigma, out_u, out_S, out_V)                                                                                     -*//*-       2. The whole CV will be done in one Data Step             Because all related quantitaties of the rest, such as such as |Z_v|, sk(v), Wc(v)              will require at least one pass through of the original data, therefore it is             best to incorporate all computing in this step.-*/data _iris;     array _m{*} &covars;     array _mt{&nVars} _temporary_;     if _n_=1 then do;     _class=1;        do until eof;     set stat(where=(_TYPE_='MEAN'& &target^=""))  end=eof;     do _j=1 to dim(_m);        _mt[_class, _j]=_m[_j];     end;     _class+1;  end;  end;  array _rank1{&nVars, &nVars} _temporary_;  set iris point=1;    do _j=1 to dim(_m);     _m[_j]=_m[_j]-_mt[1, _j];  end;  do _j=1 to &nVars;     do _k=1 to &nVars;           _rank[_j, _k]=_m[_j] * _[_k];           /*  MORE CODE TO COME*/     end;  end;run;
[1] J. H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association, 84(405):165–175, 1989.
21
May

Use recursion and gradient ascent to solve logistic regression in Python

In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive alternative. Comparing with the iterative method, the recursion costs more space but may bring the improvement of performance.
# -*- coding: utf-8 -*-"""Use recursion and gradient ascent to solve logistic regression in Python"""import pandas as pdfrom ggplot import *def sigmoid(inX):    return 1.0/(1+exp(-inX))def grad_ascent(dataMatrix, labelMat, cycle):    """    A function to use gradient ascent to calculate the coefficients    """    if isinstance(cycle, int) == False or cycle < 0:        raise ValueError("Must be a valid value for the number of iterations")    m, n = shape(dataMatrix)    alpha = 0.001    if cycle == 0:        return ones((n, 1))    else:        weights = grad_ascent(dataMatrix, labelMat, cycle-1)        h = sigmoid(dataMatrix * weights)        errors = (labelMat - h)        return weights + alpha * dataMatrix.transpose()* errorsdef plot(vector):    """    A funtion to use ggplot to visualize the result    """    x = arange(-3, 3, 0.1)    y = (-vector[0]-vector[1]*x) / vector[2]    new = pd.DataFrame()    new['x'] = x    new['y'] = array(y).flatten()    infile.classlab = infile.classlab.astype(str)    p = ggplot(aes(x='x', y='y', colour='classlab'), data=infile) + geom_point()    return p + geom_line# Use pandas to manipulate dataif __name__ == '__main__':    infile = pd.read_csv("https://raw.githubusercontent.com/pbharrin/machinelearninginaction/master/Ch05/testSet.txt", sep='\t', header=None, names=['x', 'y', 'classlab'])    infile['one'] = 1    mat1 = mat(infile[['one', 'x', 'y']])    mat2 = mat(infile['classlab']).transpose()    result1 = grad_ascent(mat1, mat2, 500)    print plot(result1)
​r
8
May

Douze points: Data science and the Eurovision song contest

An epic clash across the continent of Europe Every year since 1956 the nations of Europe (and now beyond) have come together to decide who has the best popular song.  The contest has launched the careers of many global pop stars, most notably the unforgettable Swedish foursome, ABBA.  No one […]
17
Dec

Why take SAS Training in San Francisco?

Why San Francisco? The Golden Gate Bridge. The Aquarium of the Bay. The best Ramen Noodles. Need more? Our Business knowledge Series offers you four more great reasons to visit San Francisco soon. Data Mining Techniques: Theory and Practice Customer Segmentation Using SAS Enterprise Miner SAS Functions by Example Survival [...]
10
Dec

PROC PLS and multicollinearity

Multicollinearity and its consequences

Multicollinearity usually brings significant challenges to a regression model by using either normal equation or gradient descent.

1. Invertible SSCP for normal equation

According to normal equation, the coefficients could be obtained by . If the SSCP turns to be singular and non-invertible due to multicollinearity, then the coefficients are theoretically not solvable.

2. Unstable solution for gradient descent

The gradient descent algorithm seeks to use iterative methods to minimize residual sum of squares (RSS). For example, as the plot above shows, if there is strong relationship between two regressors in a regression, many possible combinations of  and  lie along a narrow valley, which all corresponds to the minimal RSS. Thus  can be negative, positive or even zero, which becomes a confounding effect with respect to a stable model.

Partial Least Squares v.s. Principle Components Regression

The most direct way to deal with multicollinearity is to break down the regressors and construct new orthogonal variables. PLS and PCR are both dimension reduction methods that eliminate multicollinearity. The difference is that PLS also implements the response variable to select the new components. PLS is particularly useful in answering questions with multiple response variables. The PLS procedure in SAS is a powerful and flexible tool applying either PLS or PCR. One book, An Introduction to StatisticalLearning, suggests PCR over PLS.
While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash.
In the example using the baseball data set below, with 10-fold cross-validation, PLS chooses 9 components, while PCR picks out 5.
filename myfile url 'https://svn.r-project.org/ESS/trunk/fontlock-test/baseball.sas';%include myfile;proc contents data=baseball   position;   ods output position = pos;run;proc sql;      select variable into: regressors separated by ' '   from pos   where num between 5 and 20;quit;%put ®ressors;data baseball_t;   set baseball;      logsalary = log10(salary);run;proc pls data=baseball_t censcale nfac=10 cv=split(10);   title 'partial least squares';   model logsalary=®ressors;run;proc pls data=baseball_t censcale method = pcr nfac=10 cv=split(10);   title 'princinple components regression';   model logsalary=®ressors;run;