24
Aug

Flexibility of SAS Enterprise Miner

Do you use an array of tools to perform predictive analytics on your data? Is your current tool not flexible enough to accommodate some of your requirements? SAS Enterprise Miner may be your solution. With growing number of data mining applications, having a tool which can do variety of analysis […]

The post Flexibility of SAS Enterprise Miner appeared first on The SAS Training Post.

17
Aug

R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$ y_i=f_i(\boldsymbol{x}|\boldsymbol{\beta})+\varepsilon_i, $$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$ and the predictor or the independent variable is the $\boldsymbol{x}$ term defined in the mean function $f_i(\boldsymbol{x}|\boldsymbol{\beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$ y_i=\beta_0+\beta_1x_i+\varepsilon_i. $$ To obtain the (best) estimate of $\beta_0$ and $\beta_1$, we solve for the least residual sum of squares (RSS) given by, $$ S=\sum_{i=1}^{n}\varepsilon_i^2=\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2. $$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

The following is the plot of the residual sum of squares of the data base on the SLR model over $\beta_0$ and $\beta_1$, note that we standardized the variables first before plotting it,
Error Surface
If you are interested on the codes of the above figure, please click here. To minimize this elliptic paraboloid, differentiation has to be done with respect to the parameters, and then equate this to zero to obtain the stationary point, and finally solve for $\beta_0$ and $\beta_1$. For more on derivation of the estimates of the parameters see reference 1.

Simple Linear Regression in R

In R, we can fit the model using the lm function, which stands for linear models, i.e.

Formula, defined above as {response ~ predictor}, is a handy method for fitting model to the data in R. Mathematically, our model is $$ weight = \beta_0 + \beta_1 (height) + \varepsilon. $$ The summary of it is obtain by running model %>% summary or for non-magrittr user summary(model), given the model object defined in the previous code,

The Coefficients section above returns the estimated coefficients of the model, and these are $\beta_0 = -87.51667$ and $\beta_1=3.45000$ (it should be clear that we used the unstandardized variables for obtaining these estimates). The estimates are both significant base on the p-value under .05 and even in .01 level of the test. Using the estimated coefficients along with the residual standard error we can now construct the fitted line and it's confidence interval as shown below.
Fig 1. Plot of the Data and the Predicted Values in R.

Simple Linear Regression in Python

In Python, there are two modules that have implementation of linear regression modelling, one is in scikit-learn (sklearn) and the other is in Statsmodels (statsmodels). For example we can model the above data using sklearn as follows:

Above output is the estimate of the parameters, to obtain the predicted values and plot these along with the data points like what we did in R, we can wrapped the functions above into a class called linear_regression say, that requires Seaborn package for neat plotting, see the codes below:

Using this class and its methods, fitting the model to the data is coded as follows:

The predicted values of the data points is obtain using the predict method,

And Figure 2 below shows the plot of the predicted values along with its confidence interval and data points.
Fig 2. Plot of the Data and the Predicted Values in Python.
If one is only interested on the estimates of the model, then LinearRegression of scikit-learn is sufficient, but if the interest on other statistics such as that returned in R model summary is necessary, the said module can also do the job but might need to program other necessary routine. statsmodels, on the other hand, returns complete summary of the fitted model as compared to the R output above, which is useful for studies with particular interest on these information. So that modelling the data using simple linear regression is done as follows:

Clearly, we could spare time with statsmodels, especially in diagnostic checking involving test statistics such as Durbin-Watson and Jarque-Bera tests. We can of course add some plotting for diagnostic, but I prefer to discuss that on a separate entry.

Simple Linear Regression in SAS

Now let's consider running the data in SAS, I am using SAS Studio and in order to import the data, I saved it as a CSV file first with columns height and weight. Uploaded it to SAS Studio, in which follows are the codes below to import the data.

Next we fit the model to the data using the REG procedure,

Number of Observations Read15
Number of Observations Used15
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F ValuePr > F
Model13332.700003332.700001433.02<.0001
Error1330.233332.32564  
Corrected Total143362.93333   
Root MSE1.52501R-Square0.9910
Dependent Mean136.73333Adj R-Sq0.9903
Coeff Var1.11531  
Parameter Estimates
VariableDFParameter
Estimate
Standard
Error
t ValuePr > |t|
Intercept1-87.516675.93694-14.74<.0001
height13.450000.0911437.86<.0001
Now that's a lot of output, probably the complete one. But like I said, I am not going to discuss each of these values and plots as some of it are used for diagnostic checking (you can read more on that in reference 1, and in other applied linear regression books). For now, let's just confirm the coefficients obtained -- both the estimates are the same with that in R and Python.

Multiple Linear Regression (MLR)

To extend SLR to MLR, we'll demonstrate this by simulation. Using the formula-based lm function of R, assuming we have $x_1$ and $x_2$ as our predictors, then following is how we do MLR in R:

Although we did not use intercept in simulating the data, but the obtained estimates for $\beta_1$ and $\beta_2$ are close to the true parameters (.35 and .56). The intercept, however, will help us capture the noise term we added in simulation.

Next we'll try MLR in Python using statsmodels, consider the following:

It should be noted that, the estimates in R and in Python should not (necessarily) be the same since these are simulated values from different software. Finally, we can perform MLR in SAS as follows:

Number of Observations Read100
Number of Observations Used100
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F ValuePr > F
Model2610.86535305.43268303.88<.0001
Error9797.495211.00511  
Corrected Total99708.36056   
Root MSE1.00255R-Square0.8624
Dependent Mean244.07327Adj R-Sq0.8595
Coeff Var0.41076  
Parameter Estimates
VariableDFParameter
Estimate
Standard
Error
t ValuePr > |t|
Intercept118.0129911.101161.620.1079
X110.317700.0181817.47<.0001
X210.582760.0335817.35<.0001

Conclusion

In conclusion, SAS saves a lot of work, since it returns complete summary of the model, no doubt why companies prefer to use this, besides from their active customer support. R and Python, on the other hand, despite the fact that it is open-source, it can well compete with the former, although it requires programming skills to achieved all of the SAS outputs, but I think that's the exciting part of it -- it makes you think, and manage time. The achievement in R and Python is of course fulfilling. Hope you've learned something, feel free to share your thoughts on the comment below.

Reference

  1. Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. John Wiley & Sons, Inc. United States of America.
  2. Scikit-learn Documentation
  3. Statsmodels Documentation
  4. SAS Documentation
  5. Delwiche, Lora D., and Susan J. Slaughter. 2012. The Little SAS® Book: A Primer, Fifth Edition. Cary, NC: SAS Institute Inc.
  6. Regression with SAS. Institute for Digital Research and Education. UCLA. Retrieved August 13, 2015.
  7. Python Plotly Documentation
26
Apr

sklearn DecisionTree plot example needs pydotplus

18
Dec

%SVD macro with BY-Processing

For the Regularized Discriminant Analysis Cross Validation, we need to compute SVD for each pair of \((\lambda, \gamma)\), and the factorization result will be feed to the downdating algorithm to obtain leave one out variance-covariance matrix \(\hat{\Sigma}_{k\v}(\lambda, \gamma)\). The code is a direct modification of my original SVD macro @here. It is much more than simply put "BY-" statement in PROC PRINCOMP, but still easy to do. I used the within class CSSCP from the Iris data shipped with SAS and cross verified with R. %CallR macro is used here.

The screenshot below shows the SAS macro generates the same result as from R. Here is the test code.

dm log 'clear';
proc discrim data=sashelp.iris noclassify noprint
outstat=iris_stat(WHERE=(_TYPE_ = "CSSCP" & Species ^= " "))
;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
run;

%svd_bystmt(sigma, output_v, output_s, output_u, &covars, Species, , nfac=0);

data _null_;
file "d:\data\sigma.csv" dlm=",";
set iris_stat;
if _n_=1 then do;
put 'Species' ','
'_NAME_' ','
'SepalLength' ','
'SepalWidth' ','
'PetalLength' ','
'PetalWidth';
end;
put Species _NAME_ SepalLength SepalWidth PetalLength PetalWidth;
run;

%RScript(d:\data\rscript.r)
cards4;
dat=read.csv("d:\\data\\sigma.csv", header=1, as.is=1)
head(dat)
sigma=dat[, c(-1, -2)]
sigmasvd=by(sigma, dat$Species, svd)
sigmasvd$Setosa$d
sigmasvd$Virginica$d
sigmasvd$Setosa$v
;;;;
run;

%CallR(d:/data/rscript.r, d:/data/rlog1.txt);

data _null_;
infile "d:\data\rlog1.txt";
input;
put _infile_;
run;
data _null_;
do until (eof1);
set output_s end=eof1;
put @1 Species= @24 Number= @40 Singuval= 8.2;
end;
put " ";
do until (eof2);
set output_v end=eof2;
where Species="Setosa";
put Prin1 8.5 Prin2 8.5 Prin3 8.5 Prin4 8.5;
end;
stop;
run;









%macro SVD_bystmt(
input_dsn,
output_V,
output_S,
output_U,
input_vars,
by_var,
ID_var,
nfac=0
);

%local blank para EV USCORE n pos dsid nobs nstmt
shownote showsource nbylevel bystmt;

%let shownote=%sysfunc(getoption(NOTES));
%let showsource=%sysfunc(getoption(SOURCE));
options nonotes nosource;

%let blank=%str( );
%let EV=EIGENVAL;
%let USCORE=USCORE;

%if &by_var eq &blank %then %do;
%let bystmt = ␣
%let nbylevel = 1;
%end;
%else %do;
%let bystmt = by &by_var;
%end;

%let n=%sysfunc(countW(&input_vars));

%let dsid=%sysfunc(open(&input_dsn));
%let nobs=%sysfunc(attrn(&dsid, NOBS));
%let dsid=%sysfunc(close(&dsid));
%if &nfac eq 0 %then %do;
%let nstmt=␣ %let nfac=&n;
%end;
%else %do;
%let x=%sysfunc(notdigit(&nfac, 1));
%if &x eq 0 %then %do;
%let nfac=%sysfunc(min(&nfac, &n));
%let nstmt=%str(n=&nfac);
%end;
%else %do;
%put ERROR: Only accept non-negative integer.;
%goto exit;
%end;
%end;

/* calculate U=XV/S */
%if &output_U ne %str() %then %do;
%let outstmt= out=&output_U.(keep=&by_var &ID_var Prin:);
%end;
%else %do;
%let outstmt=␣
%end;

/* build index for input data set */
proc datasets library=work nolist;
modify &input_dsn;
index create &by_var;
run;quit;

%let options=noint cov noprint &nstmt;

proc princomp data=&input_dsn
/* out=&input_dsn._score */
&outstmt
outstat=&input_dsn._stat(where=(_type_ in ("&USCORE", "&EV"))) &options;
&bystmt;
var &input_vars;
run;

/*
need to check if the by_var is Char or Numeric, then set the
format accordingly
*/
data _null_;
set &input_dsn;
type=vtype(&by_var);
if type="C" then
call symput("bylevelfmtvar", "$bylevelfmt");
else
call symput("bylevelfmtvar", "bylevelfmt");
run;


data __bylevelfmt;
set &input_dsn._stat end=eof;
&bystmt;
retain _nbylevel 0;
retain fmtname "&bylevelfmtvar";
if first.&by_var then do;
_nbylevel+1;
start=&by_var;
label=_nbylevel;
output;
keep label start fmtname ;
end;
if eof then call symput("_nbylevel", _nbylevel);
run;
proc format cntlin=__bylevelfmt;
run;
%put &_nbylevel;


%if &output_V ne &blank %then %do;
proc transpose data=&input_dsn._stat(where=(_TYPE_="&USCORE"))
out=&output_V.(rename=(_NAME_=variable))
name=_NAME_;
&bystmt;
var &input_vars;
id _NAME_;
format &input_vars 8.6;
run;
%end;

/* recompute Proportion */
%if &output_S ne %str() %then %do;
data &output_S;
set &input_dsn._stat ;
&bystmt;
where _TYPE_="EIGENVAL";
array _s{*} &input_vars;
array _x{&nfac, 3} _temporary_;
Total=sum(of &input_vars, 0);
_t=0;
do _i=1 to &nfac;
_x[_i, 1]=_s[_i];
_x[_i, 2]=_s[_i]/Total;
if _i=1 then
_x[_i, 3]=_x[_i, 2];
else
_x[_i, 3]=_x[_i-1, 3]+_x[_i, 2];
_t+sqrt(_x[_i, 2]);
end;
do _i=1 to &nfac;
Number=_i;
EigenValue=_x[_i, 1]; Proportion=_x[_i, 2]; Cumulative=_x[_i, 3];
S=sqrt(_x[_i, 2])/_t; SinguVal=sqrt(_x[_i, 1] * &nobs/&_nbylevel);
keep &by_var Number EigenValue Proportion Cumulative S SinguVal;
output;
end;
run;
%end;

%if &output_U ne %str() %then %do;
data &output_U;
array _S{&_nbylevel, &nfac} _temporary_;
if _n_=1 then do;
do until (eof);
set &output_S(keep=&by_var Number SinguVal) end=eof;
where Number<=&nfac;
%if &by_var eq &blank %then %do;
_row=1;
%end;
%else %do;
_row = input(put(&by_var, &bylevelfmtvar..), best.);
%end;
_S[_row, number]=SinguVal;
if abs(_S[_row, number]) < CONSTANT('MACEPS') then _S[_row, j]=CONSTANT('BIG');
end;
end;
set &output_U;
array _A{*} Prin1-Prin&nfac;
%if &by_var eq &blank %then %do;
_row=1;
%end;
%else %do;
_row = input(put(&by_var, &bylevelfmtvar..), best.);
%end;
do _j=1 to dim(_A);
_A[_j]=_A[_j]/_S[_row, _j];
end;
keep &by_var &ID_var Prin1-Prin&nfac ;
run;
%end;

proc datasets library=work nolist;
modify &input_dsn;
index delete &by_var;
run;quit;


%exit:
options &shownote &showsource;
%mend;
15
Dec

Experient downdating algorithm for Leave-One-Out CV in RDA

In this post, I want to demonstrate a piece of experiment code for downdating algorithm for Leave-One-Out (LOO) Cross Validation in Regularized Discriminant Analysis [1].

In LOO CV, the program needs to calculate the inverse of \(\hat{\Sigma}_{k\v}(\lambda, \gamma)\) for each observation v to obtain its new scoring function at a given pair of regularizing pair parameter \( (\lambda, \gamma) \), where \(k\v\) indicates class \(k\) excluding observation \(v\).

As mentioned in [1], to obtain the inverse of this variance covariance matrix for class k, it is not sufficient to just remove observation \(v\) and recalculate it, but instead, we try to calculate $$W_{k\v}(\lambda, \gamma)\hat{\Sigma}^{-1}_{k\\v}(\lambda, \gamma)$$, which is equivalent to compute the inverse of :

\[
  W_k(\lambda)\hat{\Sigma}^{-1}_k (\lambda) - \gamma \|Z_v\|^2  \cdot  I - (1-\gamma)Z_v Z_v^{t} \]

In this formula, the first element \(W_k(\lambda)\hat{\Sigma}^{-1}_k (\lambda) \) does not involve quantities varying along with observation \(v\) and can be pre-computed. The latter two can be easily updated on the fly as we scan through each individual observations.

With this in mind, we use the iris data for demonstration. Note that the SAS code here is for illustrating the concept of downdating algorithm above for a given pair of [ (\lambda, \gamma) ]. The computing is divided into several steps. (1) We first use PROC DISCRIM to calculate CSSCP matrix, total weight and variable mean by class. These quantities correspond to \( \Sigma_k \), \( \bar{X_k}\) and \( W_k\) in the original paper. (2) We then use the %SVD Macro @here to obtain key elements for later updating.

Note that in order to obtain CSSCP, MEAN and total weights, in addition to PROC DISCRIM, we can also use PROC CORR. The advantage is that we can keep all computing within the framework of SAS/Base so to benefit budget-sensitive business and maximize compatibility. The down side is that the data set need to sorted or indexed to obtain CSSCP by class and the data has to be passed twice, one for pooled CSSCP and the other for class specific CSSCP.

(to be completed...)
 

/* study downdating algorithm */
%let lambda = 0.3;
%let gamma = 0.6;
%let target = Species;
%let covars = SepalLength SepalWidth PetalLength PetalWidth;

/* step 0. Obtain key statistics from DISCRIM */
proc discrim data=iris outstat=stat noprint noclassify;
class ⌖
var &covars;
run;

/* step 1. Obtain \Sigma_k(\lambda) & \Sigma_k(\lambda, \gamma)
\S_k = (_TYPE_="CSSCP' & &target = "something")
\S = (_TYPE_="CSSCP" & &target = " ")
\W_k = (_TYPE_="N" & &target = "something") -1
\W = (_TYPE_="N" & &target = " ") - &nClass + 1
\Sigma_k(\lambda) = \S_k(\lambda) / \W_k(\lambda)
where:
\S_k(\lambda) = (1-\lambda)*S_k + \lambda*S
\W_k(\lambda) = (1-\lambda)*W_k + \lambda*W

\Sigma_k(\lambda, \gamma) = (1-\lambda)*\Sigma_k(\lambda) + \gamma* trace[\Sigma_k(\lambda)]/p*I
*/
proc sql noprint;
select distinct &target into :targetvalues separated by " "
from stat
where _TYPE_="CSSCP"
;
select distinct _NAME_ into :covars separated by " "
from stat
where _TYPE_="CSSCP"
;
quit;
%put &targetvalues;
%put &covars;

%let nClass=%sysfunc(countw("&targetvalues"));
%put &nClass;
%let nVars=%sysfunc(countw("&covars"));
%put &nVars;

data weights;
set stat;
where _TYPE_="N";
array _nv{*} _numeric_;
if &target="" then _total=_nv[1]-&nClass+1;
retain lambda λ
retain _total;
weight=(1-&lambda) * (_nv[1]-1) + &lambda*_total;
keep &target weight lambda;
if &target ^= "";
run;

proc sort data=stat out=sigma;
by _TYPE_ ⌖
WHERE _TYPE_ = "CSSCP" & &target ^= " ";
run;
proc sort data=weights;
by ⌖
run;

/*- step 1.1 Update \Sigma_k(\lambda) = ((1-\lambda)*S_k + \lambda*S) / ( (1-\lambda)*W_k + \lambda*W ) -*/

data sigma;
array _sigma{&nVars, &nVars} _temporary_;
array _x{&nVars} &covars;
array _y{&nClass} _temporary_;
if _n_=1 then do;
_iter=1;
do until (eof1);
set stat(where=(_TYPE_="CSSCP" & &target = " ")) end=eof1;
do _j=1 to &nVars;
_sigma[_iter, _j]=_x[_j];
end;
_iter+1;
end;

_iter=1;
do until (eof2);
set weights end=eof2;
_y[_iter]=weight;
_iter+1;
end;
put _y[3]=;

_target=⌖
_iter=0;
end;
modify sigma ;
retain weight;
retain _target;
_TYPE_="COV";

if &target^=_target then do;
_iter+1;
_i=0;
weight=_y[_iter];
_target=⌖
end;
_i+1;
put "&covars";
put &covars;
do _j=1 to &nVars;
_x[_j]= ((1-&lambda)*_x[_j] + &lambda*_sigma[_i, _j]) ;
end;
put &covars;
put "-----------------------------------------------------------------";
replace;
run;

/*- trace is sum of every elemnt of a matrix -*/
data _trace;
set sigma;
by ⌖
array _x{*} &covars;
retain trace _iter;
if first.&target then do;
_iter = 1;
trace=0;
end;
trace+_x[_iter];
_iter+1;
if last.&target then output;
keep &target trace;
run;


/*- step 1.2 Update \Sigma_k (\lambda, \gamma) = (1-\gamma)*\Sigma_k(\lambda) + \gamma *c*I -*/
/*- where c = trace(\Sigma_k(\lambda)) / p, where p=&nVars -*/
data sigma;
if _n_=1 then do;
declare hash h(dataset:'_trace');
h.defineKey("&target");
h.defineData("trace");
h.defineDone();
call missing(trace);
end;
set sigma; by ⌖
array _x{*} &covars;
retain _adj trace _iter;
if first.&target then do;
_iter=1;
end;
if _iter=1 then do;
_rc=h.find();
_adj= &gamma/&nVars * trace;
end;
put _iter= &target=;
do _j=1 to &nVars;
if _j=_iter then do;
_x[_iter] = (1-&gamma)*_x[_iter] + _adj;
end;
else do;
_x[_iter] = (1-&gamma)*_x[_iter];
end;
end;
_iter=_iter+1;
drop _type_ _adj _iter _j _rc trace;
run;



/*-step 1.3 Obtain the inverse of W_k \Sigma_k(\lambda, \gamma) - c*I using SVD -*/
/*-This inverse is the same as first inverse \Sigma_k(\lambda, \gamma), then with S matrix deflated by W_k(\lambda) -*/
/*-and differenced by c, where c=\gamma* |Z_v|^2/p, and Z_v = sqrt(bk(v))*(X_v- mean(X_k\v), -*/
/*-bk(v)=sk(v)*Wc(v) wv/(Wc(v)-wv) -*/
/*-v is the current observation -*/

/*- 1. apply SVD to obtain eigenvalues
%SVD(sigma, out_u, out_S, out_V) -*/


/*- 2. The whole CV will be done in one Data Step
Because all related quantitaties of the rest, such as such as |Z_v|, sk(v), Wc(v)
will require at least one pass through of the original data, therefore it is
best to incorporate all computing in this step.
-*/
data _iris;
array _m{*} &covars;
array _mt{&nVars} _temporary_;
if _n_=1 then do;
_class=1;
do until eof;
set stat(where=(_TYPE_='MEAN'& &target^="")) end=eof;
do _j=1 to dim(_m);
_mt[_class, _j]=_m[_j];
end;
_class+1;
end;
end;
array _rank1{&nVars, &nVars} _temporary_;
set iris point=1;
do _j=1 to dim(_m);
_m[_j]=_m[_j]-_mt[1, _j];
end;
do _j=1 to &nVars;
do _k=1 to &nVars;
_rank[_j, _k]=_m[_j] * _[_k];

/* MORE CODE TO COME*/

end;
end;
run;

[1] J. H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association, 84(405):165–175, 1989.
21
May

Use recursion and gradient ascent to solve logistic regression in Python

In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive alternative. Comparing with the iterative method, the recursion costs more space but may bring the improvement of performance.
# -*- coding: utf-8 -*-
"""
Use recursion and gradient ascent to solve logistic regression in Python
"""


import pandas as pd
from ggplot import *

def sigmoid(inX):
return 1.0/(1+exp(-inX))

def grad_ascent(dataMatrix, labelMat, cycle):
"""
A function to use gradient ascent to calculate the coefficients
"""

if isinstance(cycle, int) == False or cycle < 0:
raise ValueError("Must be a valid value for the number of iterations")
m, n = shape(dataMatrix)
alpha = 0.001
if cycle == 0:
return ones((n, 1))
else:
weights = grad_ascent(dataMatrix, labelMat, cycle-1)
h = sigmoid(dataMatrix * weights)
errors = (labelMat - h)
return weights + alpha * dataMatrix.transpose()* errors

def plot(vector):
"""
A funtion to use ggplot to visualize the result
"""

x = arange(-3, 3, 0.1)
y = (-vector[0]-vector[1]*x) / vector[2]
new = pd.DataFrame()
new['x'] = x
new['y'] = array(y).flatten()
infile.classlab = infile.classlab.astype(str)
p = ggplot(aes(x='x', y='y', colour='classlab'), data=infile) + geom_point()
return p + geom_line

# Use pandas to manipulate data
if __name__ == '__main__':
infile = pd.read_csv("https://raw.githubusercontent.com/pbharrin/machinelearninginaction/master/Ch05/testSet.txt", sep='\t', header=None, names=['x', 'y', 'classlab'])
infile['one'] = 1
mat1 = mat(infile[['one', 'x', 'y']])
mat2 = mat(infile['classlab']).transpose()
result1 = grad_ascent(mat1, mat2, 500)
print plot(result1)
​r
8
May

Douze points: Data science and the Eurovision song contest

An epic clash across the continent of Europe Every year since 1956 the nations of Europe (and now beyond) have come together to decide who has the best popular song.  The contest has launched the careers of many global pop stars, most notably the unforgettable Swedish foursome, ABBA.  No one […]
17
Dec

Why take SAS Training in San Francisco?

Why San Francisco? The Golden Gate Bridge. The Aquarium of the Bay. The best Ramen Noodles. Need more? Our Business knowledge Series offers you four more great reasons to visit San Francisco soon. Data Mining Techniques: Theory and Practice Customer Segmentation Using SAS Enterprise Miner SAS Functions by Example Survival [...]
10
Dec

PROC PLS and multicollinearity

Multicollinearity and its consequences

Multicollinearity usually brings significant challenges to a regression model by using either normal equation or gradient descent.

1. Invertible SSCP for normal equation

According to normal equation, the coefficients could be obtained by \hat{\beta}=(X'X)^{-1}X'y. If the SSCP turns to be singular and non-invertible due to multicollinearity, then the coefficients are theoretically not solvable.

2. Unstable solution for gradient descent

The gradient descent algorithm seeks to use iterative methods to minimize residual sum of squares (RSS). For example, as the plot above shows, if there is strong relationship between two regressors in a regression, many possible combinations of \beta1 and \beta2 lie along a narrow valley, which all corresponds to the minimal RSS. Thus \beta1 can be negative, positive or even zero, which becomes a confounding effect with respect to a stable model.

Partial Least Squares v.s. Principle Components Regression

The most direct way to deal with multicollinearity is to break down the regressors and construct new orthogonal variables. PLS and PCR are both dimension reduction methods that eliminate multicollinearity. The difference is that PLS also implements the response variable to select the new components. PLS is particularly useful in answering questions with multiple response variables. The PLS procedure in SAS is a powerful and flexible tool applying either PLS or PCR. One book, An Introduction to StatisticalLearning, suggests PCR over PLS.
While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash.
In the example using the baseball data set below, with 10-fold cross-validation, PLS chooses 9 components, while PCR picks out 5.
filename myfile url 'https://svn.r-project.org/ESS/trunk/fontlock-test/baseball.sas';
%include myfile;
proc contents data=baseball position;
ods output position = pos;
run;

proc sql;
select variable into: regressors separated by ' '
from pos
where num between 5 and 20;
quit;
%put ®ressors;

data baseball_t;
set baseball;
logsalary = log10(salary);
run;

proc pls data=baseball_t censcale nfac=10 cv=split(10);
title 'partial least squares';
model logsalary=®ressors;
run;

proc pls data=baseball_t censcale method = pcr nfac=10 cv=split(10);
title 'princinple components regression';
model logsalary=®ressors;
run;
9
Dec

Use R in Hadoop by streaming

It seems that the combination of R and Hadoop is a must-have toolkit for people working with both statistics and large data set.

An aggregation example

The Hadoop version used here is Cloudera’s CDH4, and the underlying Linux OS is CentOS 6. The data used is a simulated sales data set form a training course by Udacity. Format of each line of the data set is: date, time, store name, item description, cost and method of payment. The six fields are separated by tab. Only two fields, store and cost, are used to aggregate the cost by each store.
A typical MapReduce job contains two R scripts: Mapper.R and reducer.R.
Mapper.R
# Use batch mode under R (don't use the path like /usr/bin/R)  
#! /usr/bin/env Rscript

options(warn=-1)

# We need to input tab-separated file and output tab-separated file

input = file("stdin", "r")
while(length(currentLine = readLines(input, n=1, warn=FALSE)) > 0) {
fields = unlist(strsplit(currentLine, "\t"))
# Make sure the line has six fields
if (length(fields)==6) {
cat(fields[3], fields[5], "\n", sep="\t")
}
}
close(input)
Reducer.R
#! /usr/bin/env Rscript  

options(warn=-1)
salesTotal = 0
oldKey = ""

# Loop around the data by the formats such as key-val pair
input = file("stdin", "r")
while(length(currentLine = readLines(input, n=1, warn=FALSE)) > 0) {
data_mapped = unlist(strsplit(currentLine, "\t"))
if (length(data_mapped) != 2) {
# Something has gone wrong. However, we can do nothing.
continue
}

thisKey = data_mapped[1]
thisSale = as.double(data_mapped[2])

if (!identical(oldKey, "") && !identical(oldKey, thisKey)) {
cat(oldKey, salesTotal, "\n", sep="\t")
oldKey = thisKey
salesTotal = 0
}

oldKey = thisKey
salesTotal = salesTotal + thisSale
}

if (!identical(oldKey, "")) {
cat(oldKey, salesTotal, "\n", sep="\t")
}

close(input)

Testing

Before running MapReduce, it is better to test the codes by some linux commands.
# Make R scripts executable   
chmod w+x mapper.R
chmod w+x reducer.R
ls -l

# Strip out a small file to test
head -500 purchases.txt > test1.txt
cat test1.txt | ./mapper.R | sort | ./reducer.R

Execution

One way is to specify all the paths and therefore start the expected MapReduce job.
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar   
-mapper mapper.R –reducer reducer.R
–file mapper.R –file reducer.R
-input myinput
-output joboutput
Or we can use the alias under CDH4, which saves a lot of typing.
hs mapper.R reducer.R myinput joboutput
Overall, the MapReduce job driven by R is performed smoothly. The Hadoop JobTracker can be used to monitor or diagnose the overall process.

Rhadoop or streaming?

RHadoop is a package developed under Revolution Alytics, which allows the users to apply MapReduce job directly in R and is surely a much more popular way to integrate R and Hadoop. However, this package currently undergoes fast evolution and requires complicated dependency. As an alternative, the functionality of streaming is embedded with Hadoop, and supports all programming languages including R. If the proper installation of RHadoop poses a challenge, then streaming is a good starting point.
Back to Top