5
Feb

Coding in Python with SAS University Edition

Good news learners! SAS University Edition has gone back to school and learned some new tricks.

With the December 2017 update, SAS University Edition now includes the SASPy package, available in its Jupyter Notebook interface. If you're keeping track, you know that SAS University Edition has long had support for Jupyter Notebook. With that, you can write and run SAS programs in a notebook-style environment. But until now, you could not use that Jupyter Notebook to run Python programs. With the latest update, you can -- and you can use the SASPy library to drive SAS features like a Python coder.

Oh, and there's another new trick that you'll find in this version: you can now use SAS (and Python) to access data from HTTPS websites -- that is, sites that use SSL encryption. Previous releases of SAS University Edition did not include the components that are needed to support these encrypted connections. That's going to make downloading web data much easier, not to mention using REST APIs. I'll show one HTTPS-enabled example in this post.

How to create a Python notebook in SAS University Edition

When you first access SAS University Edition in your web browser, you'll see a colorful "Welcome" window. From here, you can (A) start SAS Studio or (B) start Jupyter Notebook. For this article, I'll assume that you select choice (B). However, if you want to learn to use SAS and all of its capabilities, SAS Studio remains the best method for doing that in SAS University Edition.

When you start the notebook interface, you're brought into the Jupyter Home page. To get started with Python, select New->Python 3 from the menu on the right. You'll get a new empty Untitled notebook. I'm going to assume that you know how to work with the notebook interface and that you want to use those skills in a new way...with SAS. That is why you're reading this, right?

Move data from a pandas data frame to SAS

pandas is the standard for Python programmers who work with data. The pandas module is included in SAS University Edition -- you can use it to read and manipulate data frames (which you can think of like a table). Here's an example of retrieving a data file from GitHub and loading it into a data frame. (Read more about this particular file in this article. Note that GitHub uses HTTPS -- now possible to access in SAS University Edition!)

import saspy
import pandas as pd
 
df = pd.read_csv('https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv')
df.describe()

Here's the result. This is all straight Python stuff; we haven't started using any SAS yet.

Before we can use SAS features with this data, we need to move the data into a SAS data set. SASPy provides a dataframe2sasdata() method (shorter alias: df2sd) that can import your Python pandas data frame into a SAS library and data set. The method returns a SASdata object. This example copies the data into WORK.PROBLY in the SAS session:

sas = saspy.SASsession()
probly = sas.df2sd(df,'PROBLY')
probly.describe()

The SASdata object also includes a describe() method that yields a result that's similar to what you get from pandas:

Drive SAS procedures with Python

SASPy includes a collection of built-in objects and methods that provide APIs to the most commonly used SAS procedures. The APIs present a simple "Python-ic" style approach to the work you're trying to accomplish. For example, to create a SAS-based histogram for a variable in a data set, simply use the hist() method.

SASPy offers dozens of simple API methods that represent statistics, machine learning, time series, and more. You can find them documented on the GitHub project page. Note that since SAS University Edition does not include all SAS products, some of these API methods might not work for you. For example, the SASml.forest() method (representing

In SASPy, all methods generate SAS program code behind the scenes. If you like the results you see and want to learn the SAS code that was used, you can flip on the "teach me SAS" mode in SASPy.

sas.teach_me_sas('true')

Here's what SASPy reveals about the describe() and hist() methods we've already seen:

Interesting code, right? Does it make you want to learn more about SCALE= option on PROC SGPLOT?

If you want to experiment with SAS statements that you've learned, you don't need to leave the current notebook and start over. There's also a built-in %%SAS "magic command" that you can use to try out a few of these SAS statements.

%%SAS
proc means data=sashelp.cars stackodsoutput n nmiss median mean std min p25 p50 p75 max;run;

Python limitations in SAS University Edition

SAS University Edition includes over 300 Python modules to support your work in Jupyter Notebook. To see a complete list, run the help('modules') command from within a Python notebook. This list includes the common Python packages required to work with data, such as pandas and NumPy. However, it does not include any of the popular Python-based machine learning modules, nor any modules to support data visualization. Of course, SASPy has support for most of this within its APIs, so why would you need anything else...right?

Because SAS University Edition is packaged in a virtual machine that you cannot alter, you don't have the option of installing additional Python modules. You also don't have access to the Jupyter terminal, which would allow you to control the system from a shell-like interface. All of this is possible (and encouraged) when you have your own SAS installation with your own instance of SASPy. It's all waiting for you when you've outgrown the learning environment of SAS University Edition and you're ready to apply your SAS skills and tech to your official work!

Learn more

The post Coding in Python with SAS University Edition appeared first on The SAS Dummy.

10
Jan

Come on in, we're open: The openness of SAS® 9.4

The SAS® platform is now open to be accessed from open-source clients such as Python, Lua, Java, the R language, and REST APIs to leverage the capabilities of SAS® Viya® products and solutions. You can analyze your data in a cloud-enabled environment that handles large amounts of data in a variety of different formats. To find out more about SAS Viya, see the “SAS Viya: What's in it for me? The user.” article.

This blog post focuses on the openness of SAS® 9.4 and discusses features such as the SASPy package and the SAS kernel for Jupyter Notebook and more as clients to SAS. Note: This blog post is relevant for all maintenance releases of SAS 9.4.

SASPy

The SASPy package enables you to connect to and run your analysis from SAS 9.4 using the object-oriented methods and objects from the Python language as well as the Python magic methods. SASPy translates the objects and methods added into the SAS code before executing the code. To use SASPy, you must have SAS 9.4 and Python 3.x or later.
Note: SASPy is an open-source project that encourages your contributions.

After you have completed the installation and configuration of SASPy, you can import the SASPy package as demonstrated below:
Note: I used Jupyter Notebook to run the examples in this blog post.

1.   Import the SASPy package:

Openness of SAS® 9.4

2.   Start a new session. The sas object is created as a result of starting a SAS session using a locally installed version of SAS under Microsoft Windows. After this session is successfully established, the following note is generated:

Adding Data

Now that the SAS session is started, you need to add some data to analyze. This example uses SASPy to read a CSV file that provides census data based on the ZIP Codes in Los Angeles County and create a SASdata object named tabl:

To view the attributes of this SASdata object named tabl, use the PRINT() function below, which shows the libref and the SAS data set name. It shows the results as Pandas, which is the default result output for tables.

Using Methods to Display and Analyze Data

This section provides some examples of how to use different methods to interact with SAS data via SASPy.

Head() Method

After loading the data, you can look at the first few records of the ZIP Code data, which is easy using the familiar head() method in Python. This example uses the head() method on the SASdata object tabl to display the first five records. The output is shown below:

Describe() Method

After verifying that the data is what you expected, you can now analyze the data. To generate a simple summary of the data, use the Python describe() method in conjunction with the index [1:3]. This combination generates a summary of all the numeric fields within the table and displays only the second and third records. The subscript works only when the result is set to Pandas and does not work if set to HTML or Text, which are also valid options.

Teach_me_SAS() Method

The SAS code generated from the object-oriented Python syntax can also be displayed using SASPy with the teach_me_SAS() method. When you set the argument in this method to True, which is done using a Boolean value, the SAS code is displayed without executing the code:

ColumnInfo() Method

In the next cell, use the columnInfo() method to display the information about each variable in the SAS data set. Note: The SAS code is generated as a result of adding the teach_me_SAS() method in the last section:

Submit() Method

Then, use the submit() method to execute the PROC CONTENTS that are displayed in the cell above directly from Python. The submit method returns a dictionary with two keys, LST and LOG. The LST key contains the results and the LOG key returns the SAS log. The results are displayed as HTML. The HTML package is imported  to display the results.

The SAS Kernel Using Jupyter Notebook

Jupyter Notebook can run programs in various programming languages including SAS when you install and configure the SAS kernel. Using the SAS kernel is another way to run SAS interactively using a web-based program, which also enables you to save the analysis in a notebook. See the links above for details about installation and configuration of the SAS kernel. To verify that the SAS kernel installed successfully, you can run the following code: jupyter kernelspec list

From the command line, use the following command to start the Jupyter Notebook: Jupyter notebook. The screenshot below shows the Jupyter Notebook session that starts when you run the code. To execute SAS syntax from Jupyter Notebook, select SAS from the New drop-down list as shown below:

You can add SAS code to a cell in Jupyter Notebook and execute it. The following code adds a PRINT procedure and a SGPLOT procedure. The output is in HTML5 by default. However, you can specify a different output format if needed.

You can also use magics in the cell such as the %%python magic even though you are using the SAS kernel. You can do this for any kernel that you have installed.

Other SAS Goodness

There are more ways of interacting with other languages with SAS as well. For example, you can use the Groovy procedure to run Groovy statements on the Java Virtual Machine (JVM). You can also use the LUA procedure to run LUA code from SAS along with the ability to call most SAS functions from Lua. For more information, see “Using Lua within your SAS programs.” Another very powerful feature is the DATA step JavaObject, which provides the ability to instantiate Java classes and access fields and methods. The DATA step JavaObject has been available since SAS® 9.2.

Resources

SASPy Documentation

Introducing SASPy: Use Python code to access SAS

Come on in, we're open: The openness of SAS® 9.4 was published on SAS Users.

10
Jan

Linear Regression in Python

Linear Regression is a supervised statistical technique where we try to estimate the dependent variable with a given set of independent variables. We assume the relationship to be linear and our dependent variable must be continuous in nature.
Python : Linear Regression
In the following diagram we can see that as horsepower increases mileage decreases thus we can think to fit linear regression. The red line is the fitted line of regression and the points denote the actual observations.

The vertical distance between the points and the fitted line (line of best fit) are called errors. The main idea is to fit this line of regression by minimizing the sum of squares of these errors. This is also known as principle of least squares.

Examples:
  • Estimating the price (Y) of a house on the basis of its Area (X1), Number of bedrooms (X2), proximity to market (X3) etc. 
  • Estimating the mileage of a car (Y) on the basis of its displacement (X1), horsepower(X2), number of cylinders(X3), whether it is automatic or manual (X4) etc. 
  • To find the treatment cost or to predict the treatment cost on the basis of factors like age, weight, past medical history, or even if there are blood reports, we can use the information from the blood report.
Simple Linear Regression Model: In this we try to predict the value of dependent variable (Y) with only one regressor or independent variable(X).

Multiple Linear Regression Model: Here we try to predict the value of dependent variable (Y) with more than one regressor or independent variables.

The linear regression model:
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term.
Multiple Regression Equation

Assumptions of linear regression:
  • There must be a linear relationship between the dependent and independent variables.
  • Sample observations are independent.
  • Error terms are normally distributed with mean 0. 
  • No multicollinearity -  When the independent variables in my model are highly linearly related then such a situation is called multicollinearity.
  • Error terms are identically and independently distributed. (Independence means absence of autocorrelation).
  • Error terms have constant variance i.e. there is no heteroscedasticity.
  • No outliers are present in the data.


Important Model Performance Metrics

Coefficient of Determination (R square)
It suggests the proportion of variation in Y which can be explained with the independent variables. Mathematically, it is the ratio of predicted values and observed values, i.e.
RSquare

If our fit is perfect then

If then R2 = 0 indicates a poor fit. Thus it lies between 0 and 1.
If the value of R2 is 0.912 then this suggests that 91.2% of the variation in Y can be explained with the help of given explanatory variables in that model. In other words, it explains the proportion of variation in the dependent variable that is explained by the independent variables.
R square solely not such a good measure:
On addition of a new variable the error is sure to decrease, thus R square always increases whenever a new variable is added to our model. This may not describe the importance of a variable.
For eg. In a model determining the price of the house, suppose we had the variables GDP, Inflation rate, Area. If we add a new variable: no. of plane crashes (which is irrelevant) then still R square will increase.

Adjusted R square:

Adjusted R square is given by:
or
Adjusted R-Square

where k is the no. of regressors or predictors.

Hence adjusted R square will always be less than or equal to R square.

On addition of a variable then R square in numerator and 'k' in the denominator will increase.
If the variable is actually useful then R square will increase by a large amount and 'k' in the denominator will be increased by 1. Thus the magnitude of increase in R square will compensate for increase in 'k'. On the other hand, if a variable is irrelevant then on its addition R square will not increase much and hence eventually adjusted R square will increase.
Thus as a general thumb rule if adjusted R square increases when a new variable is added to the model, the variable should remain in the model. If the adjusted R square decreases when the new variable is added then the variable should not remain in the model.
Why error terms should be normally distributed?
For parameter estimate (i.e. estimating the βi’s) we don't need that assumption. But, if it is not a normal distribution, some of those hypotheses tests which we will be doing as part of diagnostics may not be valid. 
For example:  To check whether the Beta (the regression coefficient) is significant or not, I'll do a T-test. So, if my error is not a normal distribution, then the statistic I derive may not be a T-distribution. So, my diagnostic test or hypotheses test is not valid. Similarly, F-test for linear regression which checks whether any of the independent variables in a multiple linear regression model are significant will be not be viable.
Why is expectation of error always zero?

The error term is the deviation between observed points and the fitted line. The observed points will be above and below the fitted line, so if I took the average of all the deviations, it should be 0 or near 0. Zero conditional mean is there which says that there are both negative and positive errors which cancel out on an average. This helps us to estimate dependent variable precisely.

Why multicollinearity is a problem? 

If my Xi’s are highly correlated then |X’X| will be close to 0 and hence inverse of (X’X) will not exist or will be indefinitely large. Mathematically, which will be indefinitely large in presence of multicollinearity. Long story in short, multicollinearity increases the estimate of standard error of regression coefficients which makes some variables statistically insignificant when they should be significant.

How can you detect multicollinearity?1. Bunch Map Analysis: By plotting scatter plots between various Xi’ s we can have a visual description of how the variables are related.

2. Correlation Method: By calculating the correlation coefficients between the variables we can get to know about the extent of multicollinearity in the data.

3.  VIF (Variance Inflation Factor) Method: Firstly we fit a model with all the variables and then calculate the variance inflation factor (VIF) for each variable. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. The higher the value of VIF for ith regressor, the more it is highly correlated to other variables.

So what is Variance Inflation Factor?
Variance inflation factor (VIF) for an explanatory variable is given 1/(1-R^2 )  . Here, we take that particular X as response variable and all other explanatory variables as independent variables. So, we run a regression between one of those explanatory variables with remaining explanatory variables. 
Detecting heteroscedasticity!
  1. Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.
  2. Goldfeld Quandt (GQ)Test: It assumes that heteroscedastic variance σi2 is positively related to one of the explanatory variables And errors are assumed to be normal. Thus if heteroscedasticity is present then the variance would be high for large values of X.

Steps for GQ test:
  1. Order/ rank (ascending) the observations according to the value of Xi beginning with the lowest X value.
  2. Omit ‘c’ central observations and divide the remaining (n-c) observations into 2 groups of (n-c)/2 observations each.
  3. Fit separate OLS regression to both the groups and obtain residual sum of squares (RSS1 and RSS2) for both the groups.
  4. Obtain F = RSS2/ RSS1 
It follows F with ((n-c)/2-k) d.f. both both numerator and denominator.
Where k is the no. of parameters to be estimated including the intercept.
If errors are homoscedastic then the two variances RSS2 and RSS1 turn out to be equal i. e. F will tend to 1. 
Dataset used:
We have 1030 observations on 9 variables. We try to estimate the Complete compressive strength(CRS) using:

  1. Cement - kg in a m3 mixture
  2. Blast Furnace Slag - kg in a m3 mixture
  3. Fly Ash - kg in a m3 mixture
  4. Water - kg in a m3 mixture
  5. Superplasticizer - kg in a m3 mixture
  6. Coarse Aggregate - kg in a m3 mixture
  7. Fine Aggregate - kg in a m3 mixture
  8. Age - Day (1-365)

Dataset - Download Data 

Importing the libraries:

Numpy, pandas and matplotlib.pyplot are imported with aliases np, pd and plt respectively.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading the data
We load our data using pd.read_csv( )
data = pd.read_csv("Concrete_Data.csv")
Now the data is divided into independent (x) and dependent variables (y)
x = data.iloc[:,0:8]
y = data.iloc[:,8:]

Splitting the data into training and test sets

Using sklearn we split 80% of our data into training set and rest in test set. Setting random_state will give the same training and test set everytime on running the code.
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 100) 

Running linear regression using sklearn
Using sklearn linear regression can be carried out using LinearRegression( ) class. sklearn automatically adds an intercept term to our model.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm = lm.fit(x_train,y_train)   #lm.fit(input,output)
The coefficients are given by:
lm.coef_
array([[ 0.12415357,  0.10366839,  0.093371  , -0.13429401,  0.28804259,
0.02065756, 0.02563037, 0.11461733]])

To store coefficients in a data frame along with their respective independent variables -
coefficients = pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))], axis = 1)
0            Cement  0.124154
1 Blast 0.103668
2 Fly Ash 0.093371
3 Water -0.134294
4 Superplasticizer 0.288043
5 CA 0.020658
6 FA 0.025630
7 Age 0.114617
The intercept is:
lm.intercept_
array([-34.273527])
To predict the values of y on the test set we use lm.predict( )
y_pred = lm.predict(x_test)
Errors are the difference between observed and predicted values.
y_error = y_test - y_pred
R square can be obbtained using sklearn.metrics ( ):
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
0.62252008774048395

Running linear regression using statsmodels:

It is to be noted that statsmodels does not add intercept term automatically thus we need to create an intercept to our model.
import statsmodels.api as sma
X_train = sma.add_constant(x_train) ## let's add an intercept (beta_0) to our model
X_test = sma.add_constant(x_test) 
Linear regression can be run by using sm.OLS:
import statsmodels.formula.api as sm
lm2 = sm.OLS(y_train,X_train).fit()
The summary of our model can be obtained via:
lm2.summary()
"""
OLS Regression Results
==============================================================================
Dep. Variable: CMS R-squared: 0.613
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 161.0
Date: Wed, 03 Jan 2018 Prob (F-statistic): 4.37e-162
Time: 21:29:10 Log-Likelihood: -3090.4
No. Observations: 824 AIC: 6199.
Df Residuals: 815 BIC: 6241.
Df Model: 8
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -34.2735 29.931 -1.145 0.253 -93.025 24.478
Cement 0.1242 0.010 13.054 0.000 0.105 0.143
Blast 0.1037 0.011 9.229 0.000 0.082 0.126
Fly Ash 0.0934 0.014 6.687 0.000 0.066 0.121
Water -0.1343 0.046 -2.947 0.003 -0.224 -0.045
Superplasticizer 0.2880 0.102 2.810 0.005 0.087 0.489
CA 0.0207 0.011 1.966 0.050 2.79e-05 0.041
FA 0.0256 0.012 2.131 0.033 0.002 0.049
Age 0.1146 0.006 19.064 0.000 0.103 0.126
==============================================================================
Omnibus: 3.757 Durbin-Watson: 2.033
Prob(Omnibus): 0.153 Jarque-Bera (JB): 3.762
Skew: -0.165 Prob(JB): 0.152
Kurtosis: 2.974 Cond. No. 1.07e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
The predicted values for test set are given by:
y_pred2 = lm2.predict(X_test) 
Note that both y_pred and y_pred2 are same. It's just these are calculated via different packages.

Calculate R-Squared and Adjusted R-Squared Manually on Test data

We can also calculate r-squared and adjusted r-squared via formula without using any package.
import numpy as np
y_test = pd.to_numeric(y_test.CMS, errors='coerce')
RSS = np.sum((y_pred2 - y_test)**2)
y_mean = np.mean(y_test)
TSS = np.sum((y_test - y_mean)**2)
R2 = 1 - RSS/TSS
R2

n=X_test.shape[0]
p=X_test.shape[1] - 1

adj_rsquared = 1 - (1 - R2) * ((n - 1)/(n-p-1))
adj_rsquared

R-Squared : 0.6225
Adjusted RSquared : 0.60719

Detecting Outliers:
Firstly we try to get the studentized residuals using get_influence( ). The studentized residuals are saved in resid_student.
influence = lm2.get_influence() 
resid_student = influence.resid_studentized_external
Combining the training set and the residuals we have:
   Cement  Blast  Fly Ash  Water  Superplasticizer      CA     FA    Age  \
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28.0
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28.0
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270.0
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365.0
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360.0

Studentized Residuals
0 1.559672
1 -0.917354
2 1.057443
3 0.637504
4 -1.170290
resid = pd.concat([x_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
resid.head()
 If the absolute value of studentized residuals is more than 3 then that observation is considered as an outlier and hence should be removed. We try to create a logical vector for the absolute studentized residuals more than 3
     Cement  Blast  Fly Ash  Water  Superplasticizer     CA     FA  Age  \
649 166.8 250.2 0.0 203.5 0.0 975.6 692.6 3.0

Studentized Residuals
649 3.161183
resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]
The index of the outliers are given by ind:
ind = resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:].index
ind
 Int64Index([649], dtype='int64')

Dropping Outlier 
Using the drop( ) function we remove the outlier from our training sets!
y_train.drop(ind,axis = 0,inplace = True)
x_train.drop(ind,axis = 0,inplace = True)  #Interept column is not there
X_train.drop(ind,axis = 0,inplace = True)  #Intercept column is there

Detecting and Removing Multicollinearity 
We use the statsmodels library to calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
[variance_inflation_factor(x_train.values, j) for j in range(x_train.shape[1])]
[15.477582601956859,
3.2696650121931814,
4.1293255012993439,
82.210084751631086,
5.21853674386234,
85.866945489015535,
71.816336942930675,
1.6861600968467656]

We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.
def calculate_vif(x):
    thresh = 5.0
    output = pd.DataFrame()
    k = x.shape[1]
    vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
    for i in range(1,k):
        print("Iteration no.")
        print(i)
        print(vif)
        a = np.argmax(vif)
        print("Max VIF is for variable no.:")
        print(a)
        if vif[a] <= thresh :
            break
        if i == 1 :         
            output = x.drop(x.columns[a], axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
        elif i > 1 :
            output = output.drop(output.columns[a],axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
    return(output)
train_out = calculate_vif(x_train) 
Now we view the training set
train_out.head()
     Cement Blast Fly Ash Superplasticizer Age
337   275.1    0.0    121.4               9.9   56
384 516.0 0.0 0.0 8.2 28
805 393.0 0.0 0.0 0.0 90
682 183.9 122.6 0.0 0.0 28
329 246.8 0.0 125.1 12.0 3

Removing the variables from the test set.

x_test.head()
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
x_test.head()
     Cement  Blast  Fly Ash  Superplasticizer  Age
173 318.8 212.5 0.0 14.3 91
134 362.6 189.0 0.0 11.6 28
822 322.0 0.0 0.0 0.0 28
264 212.0 0.0 124.8 7.8 3
479 446.0 24.0 79.0 11.6 7

Running linear regression again on our new training set (without multicollinearity)
import statsmodels.api as sma
import statsmodels.formula.api as sm
train_out = sma.add_constant(train_out) ## let's add an intercept (beta_0) to our model
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
X_test = sma.add_constant(x_test)
lm2 = sm.OLS(y_train,train_out).fit()
lm2.summary()
"""
OLS Regression Results
==============================================================================
Dep. Variable: CMS R-squared: 0.570
Model: OLS Adj. R-squared: 0.567
Method: Least Squares F-statistic: 216.3
Date: Wed, 10 Jan 2018 Prob (F-statistic): 6.88e-147
Time: 15:14:59 Log-Likelihood: -3128.8
No. Observations: 823 AIC: 6270.
Df Residuals: 817 BIC: 6298.
Df Model: 5
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -11.1119 1.915 -5.803 0.000 -14.871 -7.353
Cement 0.1031 0.005 20.941 0.000 0.093 0.113
Blast 0.0721 0.006 12.622 0.000 0.061 0.083
Fly Ash 0.0614 0.009 6.749 0.000 0.044 0.079
Superplasticizer 0.7519 0.077 9.739 0.000 0.600 0.903
Age 0.1021 0.006 16.582 0.000 0.090 0.114
==============================================================================
Omnibus: 0.870 Durbin-Watson: 2.090
Prob(Omnibus): 0.647 Jarque-Bera (JB): 0.945
Skew: 0.039 Prob(JB): 0.623
Kurtosis: 2.853 Cond. No. 1.59e+03
==============================================================================

Checking normality of residualsWe use Shapiro Wilk test  from scipy library to check the normality of residuals.
  1. Null Hypothesis: The residuals are normally distributed.
  2. Alternative Hypothesis: The residuals are not normally distributed.
from scipy import stats
stats.shapiro(lm2.resid)
(0.9983407258987427, 0.6269884705543518)

Since the pvalue is 0.6269 thus at 5% level of significance we can say that the residuals are normally distributed.

Checking for autocorrelationTo ensure the absence of autocorrelation we use Ljungbox test.
  1. Null Hypothesis: Autocorrelation is absent.
  2. Alternative Hypothesis: Autocorrelation is present.
from statsmodels.stats import diagnostic as diag
diag.acorr_ljungbox(lm2.resid , lags = 1) 
(array([ 1.97177212]), array([ 0.16025989]))

Since p-value is 0.1602 thus we can accept the null hypothesis and can say that autocorrelation is absent.

Checking heteroscedasticityUsing Goldfeld Quandt we test for heteroscedasticity.
  1. Null Hypothesis: Error terms are homoscedastic
  2. Alternative Hypothesis: Error terms are heteroscedastic.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(lm2.resid, lm2.model.exog)
lzip(name, test)
[('F statistic', 0.9903), ('p-value', 0.539)]

The p-value is 0.539 hence we can say that the residuals have constant variance. Hence we can say that all the assumptions of our linear regression model are satisfied.
31
Dec

NumPy Tutorial : Step by Step Guide

NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

Python : Numpy Tutorial

Why NumPy instead of lists?
One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:
  1. Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
  2. They are more speedy to work with and hence are more efficient than the lists.
  3. They are more convenient to deal with.

    NumPy vs. Pandas
    Pandas is built on top of NumPy. In other words, Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additional method or provides more streamlined way of working with numerical and tabular data in Python.

    Importing numpy
    Firstly you need to import the numpy library. Importing numpy can be done by running the following command:
    import numpy as np
    It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

    Functions Tasks
    array Create numpy array
    ndim Dimension of the array
    shape Size of the array (Number of rows and Columns)
    size Total number of elements in the array
    dtype Type of elements in the array, i.e., int64, character
    reshape Reshapes the array without changing the original shape
    resize Reshapes the array. Also change the original shape
    arrange Create sequence of numbers in array
    Itemsize Size in bytes of each item
    diag Create a diagonal matrix
    vstack Stacking vertically
    hstack Stacking horizontally
    1D array
    Using numpy an array is created by using np.array:
    a = np.array([15,25,14,78,96])
    a
    print(a)
    a
    Output: array([15, 25, 14, 78, 96])

    print(a)
    Output: [15 25 14 78 96]
    Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

    Changing the datatype
    np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.
    a.dtype
    a = np.array([15,25,14,78,96],dtype = "float")
    a
    a.dtype
    Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

    Creating the sequence of numbers
    If you want to create a sequence of numbers then using np.arange we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.
    b = np.arange(start = 20,stop = 30) 
    b
    array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
     In np.arange the end point is always excluded.

    Create an Arithmetic Progression 
    np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

    Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.
    c = np.arange(20,30,2)   #30 is excluded.
    array([20, 22, 24, 26, 28])
    It is to be taken care that in np.arange( ) the stop argument is always excluded.

    Reshaping the arrays
    To reshape the array we can use reshape( ). 
    f = np.arange(101,113)
    f.reshape(3,4)
    f
     array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

    Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( ) 
    f.resize(3,4)
    f
    array([[101, 102, 103, 104],
    [105, 106, 107, 108],
    [109, 110, 111, 112]])

    If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.
    f.reshape(3,-1)
    array([[101, 102, 103, 104],
    [105, 106, 107, 108],
    [109, 110, 111, 112]])

    In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

    2D arrays
    A 2D array in numpy can be created in the following manner:
    g = np.array([(10,20,30),(40,50,60)])
    #Alternatively
    g = np.array([[10,20,30],[40,50,60]])
    g
    The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:
    g.ndim
    g.size
    g.shape
    g.ndim
    Output: 2

    g.size
    Output: 6

    g.shape
    Output: (2, 3)

    Creating some usual matrices.
    numpy provides the utility to create some usual matrices which are commonly used for linear algebra.
    To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):
    np.zeros( (2,4) )
    array([[ 0.,  0.,  0.,  0.],
    [ 0., 0., 0., 0.]])
    Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'
    np.zeros([2,4],dtype=np.int16)       
    array([[0, 0, 0, 0],
    [0, 0, 0, 0]], dtype=int16)
    To get a matrix of all random numbers from 0 to 1 we write np.empty. 
    np.empty( (2,3) )       
    array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
    [ 2.29175545e-312, 2.33419537e-312, 2.37663529e-312]])
    Note: The results may vary everytime you run np.empty.
    To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:
    np.ones([3,3])
    array([[ 1.,  1.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 1.]])
    To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:
    np.diag([14,15,16,17])
    array([[14,  0,  0,  0],
    [ 0, 15, 0, 0],
    [ 0, 0, 16, 0],
    [ 0, 0, 0, 17]])
    To create an identity matrix we can use np.eye( ) .
    np.eye(5,dtype = "int")
    array([[1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 1]])
    By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

    Reshaping 2D arrays
    To get a flattened 1D array we can use ravel( ) 
    g = np.array([(10,20,30),(40,50,60)])
    g.ravel()  
     array([10, 20, 30, 40, 50, 60])
    To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.
    g.reshape(3,-1)  # returns the array with a modified shape
    #It does not modify the original array
    g.shape
     (2, 3)
    Similar to 1D arrays, using resize( ) will modify the shape in the original array.
    g.resize((3,2))
    g   #resize modifies the original array
    array([[10, 20],
    [30, 40],
    [50, 60]])

    Time for some matrix algebra
    Let us create some arrays A,b and B and they will be used for this section:
    A = np.array([[2,0,1],[4,3,8],[7,6,9]])
    b = np.array([1,101,14])
    B = np.array([[10,20,30],[40,50,60],[70,80,90]])
     In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.
    A.T             #transpose
    A.transpose()  #transpose
    np.trace(A)  # trace
    np.linalg.inv(A)  #Inverse
    A.transpose()  #transpose
    Output:
    array([[2, 4, 7],
    [0, 3, 6],
    [1, 8, 9]])

    np.trace(A) # trace
    Output: 14

    np.linalg.inv(A) #Inverse
    Output:
    array([[ 0.53846154, -0.15384615, 0.07692308],
    [-0.51282051, -0.28205128, 0.30769231],
    [-0.07692308, 0.30769231, -0.15384615]])
    Note that transpose does not modify the original array.

    Matrix addition and subtraction can be done in the usual way:
    A+B
    A-B
    A+B
    Output:
    array([[12, 20, 31],
    [44, 53, 68],
    [77, 86, 99]])

    A-B
    Output:
    array([[ -8, -20, -29],
    [-36, -47, -52],
    [-63, -74, -81]])
    Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.
    A.dot(B)
    array([[  90,  120,  150],
    [ 720, 870, 1020],
    [ 940, 1160, 1380]])
    To solve the system of linear equations: Ax = b we use np.linalg.solve( )
    np.linalg.solve(A,b)
    array([-13.92307692, -24.69230769,  28.84615385])
    The eigen values and eigen vectors can be calculated using np.linalg.eig( ) 
    np.linalg.eig(A)
    (array([ 14.0874236 ,   1.62072127,  -1.70814487]),
    array([[-0.06599631, -0.78226966, -0.14996331],
    [-0.59939873, 0.54774477, -0.81748379],
    [-0.7977253 , 0.29669824, 0.55608566]]))
    The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

    Some Mathematics functions

    We can have various trigonometric functions like sin, cosine etc. using numpy:
    B = np.array([[0,-20,36],[40,50,1]])
    np.sin(B)
    array([[ 0.        , -0.91294525, -0.99177885],
    [ 0.74511316, -0.26237485, 0.84147098]])
    The resultant is the matrix of all sin( ) elements.
    In order to get the exponents we use **
    B**2
    array([[   0,  400, 1296],
    [1600, 2500, 1]], dtype=int32)
    We get the matrix of the square of all elements of B.
    In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:
    B>25
    array([[False, False,  True],
    [ True, True, False]], dtype=bool)
    We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
    In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.
    np.absolute(B)
    np.sqrt(B)
    np.exp(B)
    Now we consider a matrix A of shape 3*3:
    A = np.arange(1,10).reshape(3,3)
    A
    array([[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]])
    To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:
    A.sum()
    A.min()
    A.max()
    A.mean()
    A.std()   #Standard deviation
    A.var()  #Variance
    A.sum()
    Output: 45

    A.min()
    Output: 1

    A.max()
    Output: 9

    A.mean()
    Output: 5.0

    A.std() #Standard deviation
    Output: 2.5819888974716112

    A.var()
    Output: 6.666666666666667
    In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.
    A.argmin()
    A.argmax()
    A.argmin()
    Output: 0

    A.argmax()
    Output: 8
    If we wish to find the above statistics for each row or column then we need to specify the axis:
    A.sum(axis=0)             
    A.mean(axis = 0)
    A.std(axis = 0)
    A.argmin(axis = 0)
    A.sum(axis=0)                 # sum of each column, it will move in downward direction
    Output: array([12, 15, 18])

    A.mean(axis = 0)
    Output: array([ 4., 5., 6.])

    A.std(axis = 0)
    Output: array([ 2.44948974, 2.44948974, 2.44948974])

    A.argmin(axis = 0)
    Output: array([0, 0, 0], dtype=int64)
    By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column.
    To find the min and index of maximum element fow each row, we need to move in rightwise direction so we write axis = 1:
    A.min(axis=1)
    A.argmax(axis = 1)
    A.min(axis=1)                  # min of each row, it will move in rightwise direction
    Output: array([1, 4, 7])

    A.argmax(axis = 1)
    Output: array([2, 2, 2], dtype=int64)
       To find the cumulative sum along each row we use cumsum( )
    A.cumsum(axis=1)
    array([[ 1,  3,  6],
    [ 4, 9, 15],
    [ 7, 15, 24]], dtype=int32)

    Creating 3D arrays
    Numpy also provides the facility to create 3D arrays. A 3D array can be created as:
    X = np.array( [[[  1, 2,3],             
                    [ 4, 5, 6]],
                   [[7,8,9],
                    [10,11,12]]])
    X.shape
    X.ndim
    X.size
    X contains two 2D arrays  Thus the shape is 2,2,3. Totol number of elements is 12.
    To calculate the sum along a particular axis we use the axis parameter as follows:
    X.sum(axis = 0)
    X.sum(axis = 1)
    X.sum(axis = 2)
    X.sum(axis = 0)
    Output:
    array([[ 8, 10, 12],
    [14, 16, 18]])

    X.sum(axis = 1)
    Output:
    array([[ 5, 7, 9],
    [17, 19, 21]])

    X.sum(axis = 2)
    Output:
    array([[ 6, 15],
    [24, 33]])
    axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.
    X.ravel()
     array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
    ravel( ) writes all the elements in a single array.

    Indexing in arrays
    It is important to note that Python indexing starts from 0. The syntax of indexing is as follows -
    1. x[start:end] : Elements in array x start through the end (but the end is excluded)
    2. x[start:]       : Elements start through the end
    3. x[:end]        : Elements from the beginning through the end (but the end is excluded)

    If we want to extract 3rd element we write the index as 2 as it starts from 0.
    x = np.arange(10)
    x[2]
    x[2:5]
     x[2]
    Output: 2

    x[2:5]
    Output: array([2, 3, 4])
    Note that in x[2:5] elements starting from 2nd index upto 5th index(exclusive) are selected.
    If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:
    x[:7:3] = 123
    x
     array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])
    To reverse a given array we write:
    x = np.arange(10)
    x[ : :-1]                                 # reversed x
    array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
    Note that the above command does not modify the original array.
    Consider a 3D array:
    X = np.array( [[[  1, 2,3],           
                    [ 4, 5, 6]],
                   [[7,8,9],
                    [10,11,12]]])
    To extract the 2nd matrix we write:
    X[1,...]                                   # same as X[1,:,:] or X[1]
    array([[ 7,  8,  9],
    [10, 11, 12]])
    Remember python indexing starts from 0 that is why we wrote 1 to extract the2nd 2D array.
    To extract the first element from all the rows we write:
    X[...,0]                                   # same as X[:,:,0] 
    array([[ 1,  4],
    [ 7, 10]])

    Indexing with Arrays of Indices
    Consider a 1D array.
    x = np.arange(11,35,2)                   
    x
    array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])
    We form a 1D array i which subsets the elements of x as follows:
    i = np.array( [0,1,5,3,7,9 ] )
    x[i]                                    
    array([11, 13, 21, 17, 25, 29])
    In a similar manner we create a 2D array j of indices to subset x.
    j = np.array( [ [ 0, 1], [ 6, 2 ] ] )   
    x[j]                                      
    array([[11, 13],
    [23, 15]])
    Similarly we can create both  i and j as 2D arrays of indices for x
    x = np.arange(15).reshape(3,5)
    x
    i = np.array( [ [0,1],                        # indices for the first dim
                    [2,0] ] )
    j = np.array( [ [1,1],                        # indices for the second dim
                    [2,0] ] )
    To get the ith index in row and jth index for columns we write:
    x[i,j]                                     # i and j must have equal shape
    array([[ 1,  6],
    [12, 0]])
    To extract ith index from 3rd column we write:
    x[i,2]
    array([[ 2,  7],
    [12, 2]])
    For each row if we want to find the jth index we write:
    x[:,j]                                    
    array([[[ 1,  1],
    [ 2, 0]],

    [[ 6, 6],
    [ 7, 5]],

    [[11, 11],
    [12, 10]]])
    Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

    You can also use indexing with arrays to assign the values:
    x = np.arange(10)
    x
    x[[4,5,8,1,2]] = 0
    x
    array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])
    0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
    When the list of indices contains repetitions then it assigns the last value to that index:
    x = np.arange(10)
    x
    x[[4,4,2,3]] = [100,200,300,400]
    x
    array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])
    Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
    Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.
    x = np.arange(10)
    x[[1,1,1,7,7]]+=1
    x
     array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])
    Although index 1 and 7 are repeated but they are incremented only once.

    Indexing with Boolean arrays 
    We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.
    a = np.arange(12).reshape(3,4)
    b = a > 4
    b                                    
    array([[False, False, False, False],
    [False, True, True, True],
    [ True, True, True, True]], dtype=bool)
    Note that 'b' is a Boolean with same shape as that of 'a'.
    To select the elements from 'a' which adhere to condition 'b' we write:
    a[b]                                       
    array([ 5,  6,  7,  8,  9, 10, 11])
    Now 'a' becomes a 1D array with the selected elements
    This property can be very useful in assignments:
    a[b] = 0                                 
    a
    array([[0, 1, 2, 3],
    [4, 0, 0, 0],
    [0, 0, 0, 0]])
    All elements of 'a' higher than 4 become 0
    As done in integer indexing we can use indexing via Booleans:
    Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.
    x = np.arange(15).reshape(3,5)
    y = np.array([True,True,False])             # first dim selection
    z = np.array([True,True,False,True,False])       # second dim selection
    We write the x[y,:] which will select only those rows where y is True.
    x[y,:]                                   # selecting rows
    x[y]                                     # same thing
    Writing x[:,z] will select only those columns where z is True.
    x[:,z]                                   # selecting columns
    x[y,:]                                   # selecting rows
    Output:
    array([[0, 1, 2, 3, 4],
    [5, 6, 7, 8, 9]])

    x[y] # same thing
    Output:
    array([[0, 1, 2, 3, 4],
    [5, 6, 7, 8, 9]])

    x[:,z] # selecting columns
    Output:
    array([[ 0, 1, 3],
    [ 5, 6, 8],
    [10, 11, 13]])

    Stacking various arrays
    Let us consider 2 arrays A and B:
    A = np.array([[10,20,30],[40,50,60]])
    B = np.array([[100,200,300],[400,500,600]])
    To join them vertically we use np.vstack( ).
    np.vstack((A,B))   #Stacking vertically
    array([[ 10,  20,  30],
    [ 40, 50, 60],
    [100, 200, 300],
    [400, 500, 600]])
    To join them horizontally we use np.hstack( ).
    np.hstack((A,B))   #Stacking horizontally
    array([[ 10,  20,  30, 100, 200, 300],
    [ 40, 50, 60, 400, 500, 600]])
    newaxis  helps in transforming a 1D row vector to a 1D column vector.
    from numpy import newaxis
    a = np.array([4.,1.])
    b = np.array([2.,8.])
    a[:,newaxis]  
    array([[ 4.],
    [ 1.]])
    #The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:
    np.column_stack((a[:,newaxis],b[:,newaxis]))
    np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack
    np.column_stack((a[:,newaxis],b[:,newaxis]))
    Output:
    array([[ 4., 2.],
    [ 1., 8.]])

    np.hstack((a[:,newaxis],b[:,newaxis]))
    Output:
    array([[ 4., 2.],
    [ 1., 8.]])

    Splitting the arrays
    Consider an array 'z' of 15 elements:
    z = np.arange(1,16)
    Using np.hsplit( ) one can split the arrays
    np.hsplit(z,5)   # Split a into 5 arrays
    [array([1, 2, 3]),
    array([4, 5, 6]),
    array([7, 8, 9]),
    array([10, 11, 12]),
    array([13, 14, 15])]
    It splits 'z' into 5 arrays of eqaual length.
    On passing 2 elements we get:
    np.hsplit(z,(3,5))   
    [array([1, 2, 3]),
    array([4, 5]),
    array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
    It splits 'z' after the third and the fifth element.
    For 2D arrays np.hsplit( ) works as follows:
    A = np.arange(1,31).reshape(3,10)
    A
    np.hsplit(A,5)   # Split a into 5 arrays
    [array([[ 1,  2],
    [11, 12],
    [21, 22]]), array([[ 3, 4],
    [13, 14],
    [23, 24]]), array([[ 5, 6],
    [15, 16],
    [25, 26]]), array([[ 7, 8],
    [17, 18],
    [27, 28]]), array([[ 9, 10],
    [19, 20],
    [29, 30]])]
    In the above command A gets split into 5 arrays of same shape.
    To split after the third and the fifth column we write:
    np.hsplit(A,(3,5))   
    [array([[ 1,  2,  3],
    [11, 12, 13],
    [21, 22, 23]]), array([[ 4, 5],
    [14, 15],
    [24, 25]]), array([[ 6, 7, 8, 9, 10],
    [16, 17, 18, 19, 20],
    [26, 27, 28, 29, 30]])]

    Copying
    Consider an array x
    x = np.arange(1,16)
    We assign y as x and then say 'y is x'
    y = x         
    y is x         
    Let us change the shape of y
    y.shape = 3,5  
    Note that it alters the shape of x
    x.shape
    (3, 5)

    Creating a view of the data
    Let us store z as a view of x by:
    z = x.view()
    z is x   
    False
    Thus z is not x.
    Changing the shape of z
    z.shape = 5,3                   
    Creating a view does not alter the shape of x
    x.shape
    (3, 5)
    Changing an element in z
    z[0,0] = 1234                      
    Note that the value in x also get alters:
    x
    array([[1234,    2,    3,    4,    5],
    [ 6, 7, 8, 9, 10],
    [ 11, 12, 13, 14, 15]])
    Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.


    Creating a copy of the data:
    Now let us create z as a copy of x:
    z = x.copy()                        
    Note that z is not x
    z is x
    Changing the value in z
    z[0,0] = 9999
    No alterations are made in x.
    x
    array([[1234,    2,    3,    4,    5],
    [ 6, 7, 8, 9, 10],
    [ 11, 12, 13, 14, 15]])
    Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.
    26
    Dec

    Pandas Python Tutorial - Learn by Example

    Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

    The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

    Data Analysis with Python : Pandas Step by Step Guide

    Why pandas?
    It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -
    1. Create a structured data set similar to R's data frame and Excel spreadsheet.
    2. Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
    3. Selecting particular rows or columns from data set
    4. Arranging data in ascending or descending order
    5. Filtering data based on some conditions
    6. Summarizing data by classification variable
    7. Reshape data into wide or long format
    8. Time series analysis
    9. Merging and concatenating two datasets
    10. Iterate over the rows of dataset
    11. Writing or Exporting data in CSV or Excel format

    Datasets:

    In this tutorial we will use two datasets: 'income' and 'iris'.
    1. 'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
    2. 'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link 


    Important pandas functions to remember

    The following is a list of common tasks along with pandas functions.
    Utility Functions
    To select random no. of rows df.sample(n = 10)
    To select fraction of random rows df.sample(frac = 0.2)
    To rename the variables df.rename( )
    Selecting a column as index df.set_index( )
    Removing rows or columns df.drop( )
    Sorting values df.sort_values( )
    Grouping variables df.groupby( )
    Filtering df.query( )
    Finding the missing values df.isnull( )
    Dropping the missing values df.dropna( )
    Removing the duplicates df.drop_duplicates( )
    Creating dummies pd.get_dummies( )
    Ranking df.rank( )
    Cumulative sum df.cumsum( )
    Quantiles df.quantile( )
    Selecting numeric variables df.select_dtypes( )
    Concatenating two dataframes pd.concat()
    Merging on basis of common variable pd.merge( )

    Importing pandas library

    You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:
    import pandas as pd
    The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of  pandas.function every time you need to apply it.

    Importing Dataset

    To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.
    income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")
     Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
    4 C California 1685349 1675807 1889570 1480280 1735069 1812546

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
    0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    4 1487315 1663809 1624509 1639670 1921845 1156536 1388461 1644607
    Knowing the Variable types

    You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.
    income.dtypes 
    Index    object
    State object
    Y2002 int64
    Y2003 int64
    Y2004 int64
    Y2005 int64
    Y2006 int64
    Y2007 int64
    Y2008 int64
    Y2009 int64
    Y2010 int64
    Y2011 int64
    Y2012 int64
    Y2013 int64
    Y2014 int64
    Y2015 int64
    dtype: object

    Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

    To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -
    income['State'].dtypes
    It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

    Changing the data types

    Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:
    income.Y2008 = income.Y2008.astype(float)
    income.dtypes
    Index     object
    State object
    Y2002 int64
    Y2003 int64
    Y2004 int64
    Y2005 int64
    Y2006 int64
    Y2007 int64
    Y2008 float64
    Y2009 int64
    Y2010 int64
    Y2011 int64
    Y2012 int64
    Y2013 int64
    Y2014 int64
    Y2015 int64
    dtype: object

    To view the dimensions or shape of the data
    income.shape
     (51, 16)

    51 is the number of rows and 16 is the number of columns.

    You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R). 
    income.shape[0]
    income.shape[1]

    To view only some of the rows

    By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.
    income.head()
    income.head(2)
    income.tail()
    income.tail(2)

    Alternatively, the command income[0:5] can be used to fetch first five rows.

    Extract Unique Values

    The unique() function shows the unique levels or categories in the dataset.
    income.Index.unique()
    array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)


    The nunique( ) shows the number of unique values.
    income.Index.nunique()
    It returns 19 as index column contains distinct 19 values.

    Generate Cross Tab

    pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.
    pd.crosstab(income.Index,income.State)

    Creating a frequency distribution

    income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.
    income.Index.value_counts(ascending = True)
    F    1
    G 1
    U 1
    L 1
    H 1
    P 1
    R 1
    D 2
    T 2
    S 2
    V 2
    K 2
    O 3
    C 3
    I 4
    W 4
    A 4
    M 8
    N 8
    Name: Index, dtype: int64

    To draw the samples
    income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.
    income.sample(n = 5)
    income.sample(frac = 0.1)
    Selecting only a few of the columns
    To select only a specific columns we use either loc[ ] or iloc[ ] commands. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.
    income.loc[:,["Index","State","Y2008"]]
    income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns
    #In the above command both Index and Y2008 are included.
    income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included
    The difference between loc and iloc is that loc requires the column(rows) names to be selected while iloc requires the column(rows) indices (position).

    Renaming the variables
    We create a dataframe 'data' for information of people and their respective zodiac signs.
    data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
    data 
           A          B
    0 John Libra
    1 Mary Capricorn
    2 Julia Aries
    3 Kenny Scorpio
    4 Henry Aquarius
    If all the columns are to be renamed then we can use data.columns and assign the list of new column names.
    #Renaming all the variables.
    data.columns = ['Names','Zodiac Signs']

       Names Zodiac Signs
    0 John Libra
    1 Mary Capricorn
    2 Julia Aries
    3 Kenny Scorpio
    4 Henry Aquarius
    If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.
    #Renaming only some of the variables.
    data.rename(columns = {"Names":"Cust_Name"},inplace = True)
      Cust_Name Zodiac Signs
    0 John Libra
    1 Mary Capricorn
    2 Julia Aries
    3 Kenny Scorpio
    4 Henry Aquarius
    By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

    Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"
    income.columns = income.columns.str.replace('Y' , 'Year ')
    income.columns
    Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
    'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
    'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
    dtype='object')

    Setting one column in the data frame as the index
    Using set_index("column name") we can set the indices as that column and that column gets removed.
    income.set_index("Index",inplace = True)
    income.head()
    #Note that the indices have changed and Index column is now no more a column
    income.columns
    income.reset_index(inplace = True)
    income.head()
    reset_index( ) tells us that one should use the by default indices.

    Removing the columns and rows
    To drop a column we use drop( ) where the first argument is a list of columns to be removed. By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.
    income.drop('Index',axis = 1)

    #Alternatively
    income.drop("Index",axis = "columns")
    income.drop(['Index','State'],axis = 1)
    income.drop(0,axis = 0)
    income.drop(0,axis = "index")
    income.drop([0,1,2,3],axis = 0)
     Also inplace = False by default thus no alterations are made in the original dataset.  axis = "columns"  and axis = "index" means the column and row(index) should be removed respectively.

    Sorting the data
    To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.
    income.sort_values("State",ascending = False)
    income.sort_values("State",ascending = False,inplace = True)
    income.Y2006.sort_values() 
    We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002
    income.sort_values(["Index","Y2002"]) 
    Some arithmetic operations.
    Using eval( ) arithmetic operations on various columns can be carried out in a dataset.
    income["difference"] = income.Y2008-income.Y2009

    #Alternatively
    income["difference2"] = income.eval("Y2008 - Y2009")
    income.head()
      Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
    4 C California 1685349 1675807 1889570 1480280 1735069 1812546

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 \
    0 1945229.0 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826.0 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886.0 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104.0 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    4 1487315.0 1663809 1624509 1639670 1921845 1156536 1388461 1644607

    difference difference2
    0 1056.0 1056.0
    1 115285.0 115285.0
    2 198556.0 198556.0
    3 -440876.0 -440876.0
    4 -176494.0 -176494.0

    income.ratio = income.Y2008/income.Y2009
    The above command does not work, thus to create new columns we need to use square brackets.
    We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.
    data = income.assign(ratio = (income.Y2008 / income.Y2009))
    data.head()

    Finding Descriptive Statistics
    describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.
    income.describe() #for numeric variables
    To find the total count, maximum occuring string and its frequency we write include = ['object']
    income.describe(include = ['object'])  #Only for strings / objects
    Mean, median, maximum and minimum can be obtained for a particular column(s) as:
    income.Y2008.mean()
    income.Y2008.median()
    income.Y2008.min()
    income.loc[:,["Y2002","Y2008"]].max()

    Groupby function
    To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.
    income.groupby("Index").Y2008.min()
    income.groupby("Index")["Y2008","Y2010"].max()
    agg( ) function is used to find all the functions for a given variable.
    income.groupby("Index").Y2002.agg(["count","min","max","mean"])
    income.groupby("Index")["Y2002","Y2003"].agg(["count","min","max","mean"])
    The following command finds minimum and maximum values for Y2002 and only mean for Y2003
    income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
              Y2002                 Y2003
    min max mean
    Index
    A 1170302 1742027 1810289.000
    C 1343824 1685349 1595708.000
    D 1111437 1330403 1631207.000
    F 1964626 1964626 1468852.000
    G 1929009 1929009 1541565.000
    H 1461570 1461570 1200280.000
    I 1353210 1776918 1536164.500
    K 1509054 1813878 1369773.000
    L 1584734 1584734 1110625.000
    M 1221316 1983285 1535717.625
    N 1395149 1885081 1382499.625
    O 1173918 1802132 1569934.000
    P 1320191 1320191 1446723.000
    R 1501744 1501744 1942942.000
    S 1159037 1631522 1477072.000
    T 1520591 1811867 1398343.000
    U 1771096 1771096 1195861.000
    V 1134317 1146902 1498122.500
    W 1677347 1977749 1521118.500

    Filtering
    To filter only those rows which have Index as "A" we write:
    income[income.Index == "A"]

    #Alternatively
    income.loc[income.Index == "A",:]
      Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
    0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    To select the States having Index as "A":
    income.loc[income.Index == "A","State"]
    income.loc[income.Index == "A",:].State
    To filter the rows with Index as "A" and income for 2002 > 1500000"
    income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]
    To filter the rows with index either "A" or "W", we can use isin( ) function:
    income.loc[(income.Index == "A") | (income.Index == "W"),:]

    #Alternatively.
    income.loc[income.Index.isin(["A","W"]),:]
       Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
    47 W Washington 1977749 1687136 1199490 1163092 1334864 1621989
    48 W West Virginia 1677347 1380662 1176100 1888948 1922085 1740826
    49 W Wisconsin 1788920 1518578 1289663 1436888 1251678 1721874
    50 W Wyoming 1775190 1498098 1198212 1881688 1750527 1523124

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
    0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    47 1545621 1555554 1179331 1150089 1775787 1273834 1387428 1377341
    48 1238174 1539322 1539603 1872519 1462137 1683127 1204344 1198791
    49 1980167 1901394 1648755 1940943 1729177 1510119 1701650 1846238
    50 1587602 1504455 1282142 1881814 1673668 1994022 1204029 1853858
    Alternatively we can use query( ) function and write our filtering criteria:
    income.query('Y2002>1700000 & Y2003 > 1500000')

    Dealing with missing values
    We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.
    import numpy as np
    mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
            'Yield': [1010, 1025.2, 1404.2, 1251.7],
            'cost' : [102, np.nan, 20, 68]}
    crops = pd.DataFrame(mydata)
    crops
    isnull( ) returns True and notnull( ) returns False if the value is NaN.
    crops.isnull()  #same as is.na in R
    crops.notnull()  #opposite of previous command.
    crops.isnull().sum()  #No. of missing values.
    crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

    crops[crops.cost.isnull()] #shows the rows with NAs.
    crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
    crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop
    To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

    crops.dropna(how = "any").shape
    crops.dropna(how = "all").shape  
    To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:
    crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
    crops.dropna(subset = ['Yield',"cost"],how = 'all').shape
    Replacing the missing values by "UNKNOWN" sub attribute in Column name.
    crops['cost'].fillna(value = "UNKNOWN",inplace = True)
    crops

    Dealing with duplicates
    We create a new dataframe comprising of items and their respective prices.
    data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
    data
                 Items  Price
    0 TV 10000
    1 Washing Machine 50000
    2 Mobile 20000
    3 TV 10000
    4 TV 10000
    5 Washing Machine 40000
    duplicated() returns a logical vector returning True when encounters duplicated.
    data.loc[data.duplicated(),:]
    data.loc[data.duplicated(keep = "first"),:]
    By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
    If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.
    data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.
    If keep = "False" then it considers all the occurences of the repeated observations as duplicates.
    data.loc[data.duplicated(keep = False),:]  #all the duplicates, including unique are shown.
    To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )
    data.drop_duplicates(keep = "first")
    data.drop_duplicates(keep = "last")
    data.drop_duplicates(keep = False,inplace = True)  #by default inplace = False
    data

    Creating dummies
    Now we will consider the iris dataset
    iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
    iris.head()
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
    0 5.1 3.5 1.4 0.2 setosa
    1 4.9 3.0 1.4 0.2 setosa
    2 4.7 3.2 1.3 0.2 setosa
    3 4.6 3.1 1.5 0.2 setosa
    4 5.0 3.6 1.4 0.2 setosa
    map( ) function is used to match the values and replace them in the new series automatically created.
    iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
    iris.head()
    To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.
    pd.get_dummies(iris.Species,prefix = "Species")
    pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1]  #1 is not included
    species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]
    With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.
    iris = pd.concat([iris,species_dummies],axis = 1)
    iris.head()
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
    0 5.1 3.5 1.4 0.2 setosa
    1 4.9 3.0 1.4 0.2 setosa
    2 4.7 3.2 1.3 0.2 setosa
    3 4.6 3.1 1.5 0.2 setosa
    4 5.0 3.6 1.4 0.2 setosa

    Species_setosa Species_versicolor Species_virginica
    0 1 0 0
    1 1 0 0
    2 1 0 0
    3 1 0 0
    4 1 0 0
    It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True
    pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

    Ranking
     To create a dataframe of all the ranks we use rank( )
    iris.rank() 
    Ranking by a specific variable
    Suppose we want to rank the Sepal.Length for different species in ascending order:
    iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
    iris.head( )

    #Alternatively
    iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
    iris.head()

    Calculating the Cumulative sum
    Using cumsum( ) function we can obtain the cumulative sum
    iris['cum_sum'] = iris["Sepal.Length"].cumsum()
    iris.head()
    Cumulative sum by a variable
    To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )
    iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
    iris.head()

    Calculating the percentiles.
    Various quantiles can be obtained by using quantile( )
    iris.quantile(0.5)
    iris.quantile([0.1,0.2,0.5])
    iris.quantile(0.55)

    if else
    We create a new dataframe of students' name and their respective zodiac signs.
    students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                             'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
    Importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.
    import numpy as np
    students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
    students
          Names Zodiac Signs flag
    0 John Aquarius yes
    1 Mary Libra no
    2 Henry Gemini yes
    3 Augustus Pisces no
    4 Kenny Virgo no

    Multiple conditions for if else
    We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False
    conditions = [
        (students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
        (students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
        (students['Zodiac Signs'] == 'Pisces')]
    choices = ['yellow', 'blue', 'purple']
    students['color'] = np.select(conditions, choices, default='black')
    students
          Names Zodiac Signs flag   color
    0 John Aquarius yes yellow
    1 Mary Libra no blue
    2 Henry Gemini yes black
    3 Augustus Pisces no purple
    4 Kenny Virgo no black
    If using simple 'if else' we need to take care of the indentation. Python does not involve curly braces for the loops and if else.
    x = 1001   #Take care of the indentation.
    if x%10 == 0:
        print("x is a multiple of 10")
    else:
        print("It is not a multiple of 10")

    Select numeric or categorical columns only
    To include numeric columns we use select_dtypes( ) 
    data1 = iris.select_dtypes(include=[np.number])
    data1.head()
     _get_numeric_data also provides utility to select the numeric columns only.
    data3 = iris._get_numeric_data()
    data3.head(3)
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
    0 5.1 3.5 1.4 0.2 5.1 5.1
    1 4.9 3.0 1.4 0.2 10.0 10.0
    2 4.7 3.2 1.3 0.2 14.7 14.7
    For selecting categorical variables
    data4 = iris.select_dtypes(include = ['object'])
    data4.head(2)
     Species
    0 setosa
    1 setosa

    Concatenating
    We create 2 dataframes containing the details of the students:
    students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                             'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
    students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                              'Marks' : [50,81,98,25,35]})
     using pd.concat( ) function we can join the 2 dataframes:
    data = pd.concat([students,students2])  #by default axis = 0
       Marks     Names Zodiac Signs
    0 NaN John Aquarius
    1 NaN Mary Libra
    2 NaN Henry Gemini
    3 NaN Augustus Pisces
    4 NaN Kenny Virgo
    0 50.0 John NaN
    1 81.0 Mary NaN
    2 98.0 Henry NaN
    3 25.0 Augustus NaN
    4 35.0 Kenny NaN
    By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1
    data = pd.concat([students,students2],axis = 1)
    data
          Names Zodiac Signs  Marks     Names
    0 John Aquarius 50 John
    1 Mary Libra 81 Mary
    2 Henry Gemini 98 Henry
    3 Augustus Pisces 25 Augustus
    4 Kenny Virgo 35 Kenny
    Using append function we can join the dataframes row-wise
    students.append(students2)  #for rows
    Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise
    classes = {'x': students, 'y': students2}
     result = pd.concat(classes)
    result 
         Marks     Names Zodiac Signs
    x 0 NaN John Aquarius
    1 NaN Mary Libra
    2 NaN Henry Gemini
    3 NaN Augustus Pisces
    4 NaN Kenny Virgo
    y 0 50.0 John NaN
    1 81.0 Mary NaN
    2 98.0 Henry NaN
    3 25.0 Augustus NaN
    4 35.0 Kenny NaN

    Merging or joining on the basis of common variable.
    We take 2 dataframes with different number of observations:
    students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
                             'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
    students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                              'Marks' : [50,81,98,25,35]})
    Using pd.merge we can join the two dataframes. on = 'Names' denotes the common variable on the basis of which the dataframes are to be combined is 'Names'
    result = pd.merge(students, students2, on='Names')  #it only takes intersections
    result
       Names Zodiac Signs  Marks
    0 John Aquarius 50
    1 Mary Libra 81
    2 Henry Gemini 98
     By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"
     result = pd.merge(students, students2, on='Names',how = "outer")  #it only takes unions
    result
          Names Zodiac Signs  Marks
    0 John Aquarius 50.0
    1 Mary Libra 81.0
    2 Henry Gemini 98.0
    3 Maria Capricorn NaN
    4 Augustus NaN 25.0
    5 Kenny NaN 35.0
    To take only intersections and all the values in left df set how = 'left'
    result = pd.merge(students, students2, on='Names',how = "left")
    result
       Names Zodiac Signs  Marks
    0 John Aquarius 50.0
    1 Mary Libra 81.0
    2 Henry Gemini 98.0
    3 Maria Capricorn NaN
    Similarly how = 'right' takes only intersections and all the values in right df.
    result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
    result
          Names Zodiac Signs  Marks      _merge
    0 John Aquarius 50 both
    1 Mary Libra 81 both
    2 Henry Gemini 98 both
    3 Augustus NaN 25 right_only
    4 Kenny NaN 35 right_only
    indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.
    22
    Dec

    Visualize activations functions using keras

    In keras, we can visualize activation functions' geometric properties using backend functions over layers of a model.

    We all know the exact function of popular activation functions such as 'sigmoid', 'tanh', 'relu', etc, and we can feed data to these functions to directly obtain their output. But how to do that via keras without explicitly specifying their functional forms?

    This can be done following the four steps below:

    1. define a simple MLP model with a one dimension input data, a one neuron dense network as the hidden layer, and the output layer will have a 'linear' activation function for one neuron.
    2. Extract layers' output of the model (fitted or not) via iterating through model.layers
    3. Using backend function K.function() to obtain calculated output for a given input data
    4. Feed desired data to the above functions to obtain the output from appropriate activation function.

    The code below is a demo:




    from keras.layers import Dense, Activation
    from keras.models import Sequential
    import keras.backend as K
    import numpy as np
    import matplotlib.pyplot as plt



    # 以下设置显示中文文方法根据 http://blog.csdn.net/rumswell/article/details/6544377
    plt.rcParams['font.sans-serif'] = ['SimHei'] #指定默认字体
    plt.rcParams['axes.unicode_minus'] = False #解决图像中中文符号显示为方块的问题

    def NNmodel(activationFunc='linear'):
    '''
    定义一个神经网络模型。如果要定义不同的模型,可以直接修改该函数
    '''
    if (activationFunc=='softplus') | (activationFunc=='sigmoid'):
    winit='lecun_uniform'
    elif activationFunc=='hard_sigmoid':
    winit='lecun_normal'
    else:
    winit='he_uniform'
    model = Sequential()
    model.add(Dense(1, input_shape=(1,), activation=activationFunc,
    kernel_initializer=winit,
    name='Hidden'))

    model.add(Dense(1, activation='linear', name='Output'))
    model.compile(loss='mse', optimizer='sgd')
    return model

    def VisualActivation(activationFunc='relu', plot=True):
    x = (np.arange(100)-50)/10
    y = np.log(x+x.max()+1)

    model = NNmodel(activationFunc = activationFunc)

    inX = model.input
    outputs = [layer.output for layer in model.layers if layer.name=='Hidden']
    functions = [K.function([inX], [out]) for out in outputs]

    layer_outs = [func([x.reshape(-1, 1)]) for func in functions]
    activationLayer = layer_outs[0][0]

    activationDf = pd.DataFrame(activationLayer)
    result=pd.concat([pd.DataFrame(x), activationDf], axis=1)
    result.columns=['X', 'Activated']
    result.set_index('X', inplace=True)
    if plot:
    result.plot(title=f)

    return result


    # Now we can visualize them (assuming default settings) :
    actFuncs = ['linear', 'softmax', 'sigmoid', 'tanh', 'softsign', 'hard_sigmoid', 'softplus', 'selu', 'elu']

    from keras.layers import LeakyReLU
    figure = plt.figure()
    for i, f in enumerate(actFuncs):
    # 依次画图
    figure.add_subplot(3, 3, i+1)
    out=VisualActivation(activationFunc=f, plot=False)
    plt.plot(out.index, out.Activated)
    plt.title(u'激活函数:'+f)

    This figure is the output from above code. As we can see, the geometric property of each activation function is well captured.

    28
    Jun

    Using Python to work with SAS Viya and CAS

    One of the big benefits of the SAS Viya platform is how approachable it is for programmers of other languages. You don't have to learn SAS in order to become productive quickly. We've seen a lot of interest from people who code in Python, maybe because that language has become known for its application in machine learning. SAS has a new product called SAS Visual Data Mining and Machine Learning. And these days, you can't offer such a product without also offering something special to those Python enthusiasts.

    Introducing Python SWAT

    And so, SAS has published the Python SWAT project (where "SWAT" stands for the SAS scripting wapper for analytical transfer. The project is a Python code library that SAS released using an open source model. That means that you can download it for free, make changes locally, and even contribute those changes back to the community (as some developers have already done!). You'll find it at github.com/sassoftware/python-swat.

    SAS developer Kevin Smith is the main contributor on Python SWAT, and he's a big fan of Python. He's also an expert in SAS and in many programming languages. If you're a SAS user, you probably run Kevin's code every day; he was an original developer on the SAS Output Delivery System (ODS). Now he's a member of the cloud analytics team in SAS R&D. (He's also the author of more than a few conference papers and SAS books.)

    Kevin enjoys the dynamic, fluid style that a scripting language like Python affords - versus the more formal "code-compile-build-execute" model of a compiled language. Watch this video (about 14 minutes) in which Kevin talks about what he likes in Python, and shows off how Python SWAT can drive SAS' machine learning capabilities.

    New -- but familiar -- syntax for Python coders

    The analytics engine behind the SAS Viya platform is called CAS, or SAS Cloud Analytic Services. You'll want to learn that term, because "CAS" is used throughout the SAS documentation and APIs. And while CAS might be new to you, the Python approach to CAS should feel very familiar for users of Python libraries, especially users of pandas, the Python Data Analysis Library.

    CAS and SAS' Python SWAT extends these concepts to provide intuitive, high-performance analytics from SAS Viya in your favorite Python environment, whether that's a Jupyter notebook or a simple console. Watch the video to see Kevin's demo and discussion about how to get started. You'll learn:

    • How to connect your Python session to the CAS server
    • How to upload data from your client to the CAS server
    • How SWAT extends the concept of the DataFrame API in pandas to leverage CAS capabilities
    • How to coax CAS to provide descriptive statistics about your data, and then go beyond what's built into the traditional DataFrame methods.

    Learn more about SAS Viya and Python

    There are plenty of helpful resources to help you learn about using Python with SAS Viya:

    And finally, what if you don't have SAS Viya yet, but you're interested in using Python with SAS 9.4? Check out the SASPy project, which allows you to access your traditional SAS features from a Jupyter notebook or Python console. It's another popular open source project from SAS R&D.

    The post Using Python to work with SAS Viya and CAS appeared first on The SAS Dummy.

    25
    Jun

    Python Data Structures

    This post explains the data structures used in Python. It is essential to understand the data structures in a programming language. In python, there are many data structures available. They are as follows :
    1. strings
    2. lists
    3. tuples
    4. dictionaries
    5. sets

    Python Data Structures

    1. Strings

    Python String is a sequence of characters.

    How to create a string in Python

    You can create Python string using a single or double quote.
    mystring = "Hello Python3.6"
    print(mystring)
    Output:
    Hello Python3.6

    Can I use multiple single or double quotes to define string?

    Answer is Yes. See examples below -

    Multiple Single Quotes
    mystring = '''Hello Python3.6'''
    print(mystring)
    Output:
    Hello Python3.6
    Multiple Double Quotes
    mystring = """Hello Python3.6"""
    print(mystring)
    Output:
    Hello Python3.6

    How to include quotes within a string?
    mystring = r'Hello"Python"'
    print(mystring)
    Output:
    Hello"Python"

    How to extract Nth letter or word?

    You can use the syntax below to get first letter.
    mystring = 'Hi How are you?'
    mystring[0]
    Output
    'H'
    mystring[0] refers to first letter as indexing in python starts from 0. Similarly, mystring[1] refers to second letter.

    To pull last letter, you can use -1 as index.
    mystring[-1]

    To get first word
    mystring.split(' ')[0]
    Output : Hi

    How it works -

    1. mystring.split(' ') tells Python to use space as a delimiter.

    Output : ['Hi', 'How', 'are', 'you?']

    2. mystring.split(' ')[0] tells Python to pick first word of a string.


    2. List

    Unlike String, List can contain different types of objects such as integer, float, string etc.
    1. x = [142, 124, 234, 345, 465]
    2. y = [‘A’, ‘C’, ‘E’, ‘M’]
    3. z = [‘AA’, 44, 5.1, ‘KK’]

    Get List Item

    We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
    k = [124, 225, 305, 246, 259]
    k[0]
    k[1]
    k[-1]
    k[0]
    124

    k[1]
    225

    k[-1]
    259

    Explanation :
    k[0] picks first element from list. Negative sign tells Python to search list item from right to left. k[-1] selects the last element from list.

    To select multiple elements from a list, you can use the following method :
    k[:3] returns [124, 225, 305]

    Add 5 to each element of a list

    In the program below, len() function is used to count the number of elements in a list. In this case, it returns 5. With the help of range() function, range(5) returns 0,1,2,3,4.
    x = [1, 2, 3, 4, 5]
    for i in range(len(x)):
        x[i] = x[i] + 5
    print(x)
    [6, 7, 8, 9, 10]

    It can also be written like this -
    for i in range(len(x)):
       x[i] += 5
    print(x)

    Combine / Join two lists

    The '+' operator is concatenating two lists.
    X = [1, 2, 3]
    Y = [4, 5, 6]
    Z = X + Y
    print(Z)
    [1, 2, 3, 4, 5, 6]

    Sum of values of two list
    X = [1, 2, 3]
    Y = [4, 5, 6]
    import numpy as np
    Z = np.add(X, Y)
    print(Z)
    print(Z)
    [5 7 9]
    Similarly, you can use np.multiply(X, Y) to multiply values of two list.

    Repeat List N times

    The '*' operator is repeating list N times.
    X = [1, 2, 3]
    Z = X * 3
    print(Z)
    [1, 2, 3, 1, 2, 3, 1, 2, 3]

    Note : The above two methods also work for string list.

    Modify / Replace a list item

    Suppose you need to replace third value to a different value.
    X = [1, 2, 3]
    X[2]=5
    print(X)
    print(X)
    [1, 2, 5]

    Add / Remove a list item

    We can add a list item by using append method.
    X = ['AA', 'BB', 'CC']
    X.append('DD')
    print(X)
    Result : ['AA', 'BB', 'CC', 'DD']

    Similarly, we can remove a list item by using remove method.
    X = ['AA', 'BB', 'CC']
    X.remove('BB')
    print(X)
    Result : ['AA', 'CC']

    Sort list
    k = [124, 225, 305, 246, 259]
    k.sort()
    print(k)
    Output : [124, 225, 246, 259, 305]


    3. Tuple

    Like list, tuple can also contain mixed data. But tuple cannot be changed or altered once created whereas list can be modified. Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]

    Examples
    mytuple = (123,223,323)
    City = ('Delhi','Mumbai','Bangalore')
    Perform for loop on Tuple
    for i in City:
        print(i)
    Delhi
    Mumbai
    Bangalore

    Tuple cannot be altered

    Run the following command and check error
    X = (1, 2, 3)
    X[2]=5
    TypeError: 'tuple' object does not support item assignment

    4. Dictionary

    It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings, numbers, or tuples.

    Create a dictionary

    It is defined in curly braces {}. Each key is followed by a colon (:) and then values.
    teams = {'Dave' : 'team A',
             'Tim' : 'team B',
             'Babita' : 'team C',
             'Sam' : 'team B',
             'Ravi' : 'team C'
            }

    Find Values
    teams['Sam']
    Output : 'team B'

    Delete an item
    del teams['Ravi']

    Add an item
    teams['Deep'] = 'team B'
    Output :
    {'Babita': 'team C',
     'Dave': 'team A',
     'Deep': 'team B',
     'Sam': 'team B',
     'Tim': 'team B'}

    5. Sets

    Sets are unordered collections of simple objects.
    X = set(['A', 'B', 'C'])

    Q. Does 'A' exist in set X?
    'A' in X
    Result : True

    Q. Does 'D' exist in set X?
    'D' in X
    Result : False

    Q. How to add 'D' in set X?
    X.add('D')
    Q. How to remove 'C' from set X?
    X.remove('C')
    Q. How to create a copy of set X?
    Y = X.copy()
    Q. Which items are common in both sets X and Y?
    Y & X
    19
    Jun

    Data Science Tool Market Share Leading Indicator: Scholarly Articles

    Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

    New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, IBM Watson, Pentaho, and Google’s TensorFlow.

    Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

    Scholarly Articles

    Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

    Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

    Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

    SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

    Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

    Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

    The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

    Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

    From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

    Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

    The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

    Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

    Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

    While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

    Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.

     

    Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

    Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

    IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

    While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

    In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

    From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

    In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

    The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

    I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

    Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

    As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

    Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

    Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.

     

    Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

    Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

    The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

    The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.

    21
    May

    Python for Data Science : Learn in 3 Days

    This tutorial helps you to learn Data Science with Python with examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.
    Data Science with Python Tutorial

    Table of Contents
    1. Getting Started with Python
    2. Data Structures and Conditional Statements
    3. Python Libraries
    4. Data Manipulation using Pandas
    5. Data Science with Python

    Python 2.7 vs 3.6

    Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

    1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

    2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

    What's new in Python 3.6

    It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

    Key Takeaway
    You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

    Python for Data Science : Introduction

    Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

    There are several reasons to learn Python. Some of them are as follows -
    1. Python runs well in automating various steps of a predictive model. 
    2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence. 
    3. Python wins over R when it comes to deploying machine learning models in production.
    4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
    5. Python has a great online community support.
    Do you know these sites are developed in Python?
    1. YouTube
    2. Instagram
    3. Reddit
    4. Dropbox
    5. Disqus

    How to Install Python

    There are two ways to download and install Python
    1. Download Anaconda. It comes with Python software along with preinstalled popular libraries.
    2. Download Python from its official website. You have to manually install libraries.

    Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python

    Coding Environments

    Anaconda comes with two popular IDE :
    1. Jupyter (Ipython) Notebook
    2. Spyder
    Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!
    Spyder - Python Coding Environment
    Jupyter (Ipython) Notebook

    Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

    Spyder Shortcut Keys

    The following is a list of some useful spyder shortcut keys which makes you more productive.
    1. Press F5 to run the entire script
    2. Press F9 to run selection or line 
    3. Press Ctrl + 1 to comment / uncomment
    4. Go to front of function and then press Ctrl + I to see documentation of the function
    5. Run %reset -f to clean workspace
    6. Ctrl + Left click on object to see source code 
    7. Ctrl+Enter executes the current cell.
    8. Shift+Enter executes the current cell and advances the cursor to the next cell

    List of arithmetic operators with examples

    Arithmetic Operators Operation Example
    + Addition 10 + 2 = 12
    Subtraction 10 – 2 = 8
    * Multiplication 10 * 2 = 20
    / Division 10 / 2 = 5.0
    % Modulus (Remainder) 10 % 3 = 1
    ** Power 10 ** 2 = 100
    // Floor 17 // 3 = 5
    (x + (d-1)) // d Ceiling (17 +(3-1)) // 3 = 6

    Basic Programs

    Example 1
    #Basics
    x = 10
    y = 3
    print("10 divided by 3 is", x/y)
    print("remainder after 10 divided by 3 is", x%y)
    Result :
    10 divided by 3 is 3.33
    remainder after 10 divided by 3 is 1

    Example 2
    x = 100
    x > 80 and x <=95
    x > 35 or x < 60
    x > 80 and x <=95
    Out[45]: False
    x > 35 or x < 60
    Out[46]: True

    Comparison & Logical Operators Description Example
    > Greater than 5 > 3 returns True
    < Less than 5 < 3 returns False
    >= Greater than or equal to 5 >= 3 returns True
    <= Less than or equal to 5 <= 3 return False
    == Equal to 5 == 3 returns False
    != Not equal to 5 != 3 returns True
    and Check both the conditions x > 18 and x <=35
    or If atleast one condition hold True x > 35 or x < 60
    not Opposite of Condition not(x>7)

    Assignment Operators

    It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.
    x = 100
    y = 10
    x += y
    print(x)
    print(x)
    110
    In this case, x+=y implies x=x+y which is x = 100 + 10.
    Similarly, you can use x-=y, x*=y and x /=y

    Python Data Structure

    In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

    1. List

    It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.
    1. x = [1, 2, 3, 4, 5]
    2. y = [‘A’, ‘O’, ‘G’, ‘M’]
    3. z = [‘A’, 4, 5.1, ‘M’]
    Get List Item

    We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
    x = [1, 2, 3, 4, 5]
    x[0]
    x[1]
    x[4]
    x[-1]
    x[-2]
    x[0]
    Out[68]: 1

    x[1]
    Out[69]: 2

    x[4]
    Out[70]: 5

    x[-1]
    Out[71]: 5

    x[-2]
    Out[72]: 4

    x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

    You can select multiple elements from a list using the following method
    x[:3] returns [1, 2, 3]

    2. Tuple

    A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -
    1. A tuple cannot be changed once constructed whereas list can be modified.
    2. A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]
    Examples
    K = (1,2,3)
    State = ('Delhi','Maharashtra','Karnataka')
    Perform for loop on Tuple
    for i in State:
        print(i)
    Delhi
    Maharashtra
    Karnataka
    Functions

    Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

    Rules to define a function
    1. Function starts with def keyword followed by function name and ( )
    2. Function body starts with a colon (:) and is indented
    3. The keyword return ends a function  and give value of previous expression.
    def sum_fun(a, b):
        result = a + b
        return result 
    z = sum_fun(10, 15)
    Result : z = 25

    Suppose you want python to assume 0 as default value if no value is specified for parameter b.
    def sum_fun(a, b=0):
        result = a + b
        return result
    z = sum_fun(10)
    In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

    Conditional Statements (if else)

    Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

    Note : The if and else statements ends with a colon :

    Example
    k = 27
    if k%5 == 0:
      print('Multiple of 5')
    else:
      print('Not a Multiple of 5')
    Result : Not a Multiple of 5

    Popular python packages for Data Analysis & Visualization

    Some of the leading packages in Python along with equivalent libraries in R are as follows-
    1. pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
    2. NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. Numpy Tutorial
    3. Scipy.  For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
    4. Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
    5. Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
    6. Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
    7. pandasql.  It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.
    Maximum of the above packages are already preinstalled in Spyder.
      Comparison of Python and R Packages by Data Mining Task

      Task Python Package R Package
      IDE Rodeo / Spyder Rstudio
      Data Manipulation pandas dplyr and reshape2
      Machine Learning Scikit-learn glm, knn, randomForest, rpart, e1071
      Data Visualization ggplot + seaborn + bokeh ggplot2
      Character Functions Built-In Functions stringr
      Reproducibility Jupyter Knitr
      SQL Queries pandasql sqldf
      Working with Dates datetime lubridate
      Web Scraping beautifulsoup rvest

      Popular Python Commands

      The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

      Install Package
      !pip install pandas

      Uninstall Package
      !pip uninstall pandas

      Show Information about Installed Package
      !pip show pandas

      List of Installed Packages
      !pip list

      Upgrade a package
      !pip install --upgrade pandas

        How to import a package

        There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

        1. import pandas as pd
        It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

        2. import pandas
        It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

        3. from pandas import *
        It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

        Pandas Data Structures : Series and DataFrame

        In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -
        1. Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.
        In the example below, we are generating 5 random values.
        import pandas as pd
        s1 = pd.Series(np.random.randn(5))
        s1
        0   -2.412015
        1 -0.451752
        2 1.174207
        3 0.766348
        4 -0.361815
        dtype: float64

        Extract first and second value

        You can get a particular element of a series using index value. See the examples below -

        s1[0]
        -2.412015
        s1[1]
        -0.451752
        s1[:3]
        0   -2.412015
        1 -0.451752
        2 1.174207

        2. DataFrame

        It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

        Comparison of Data Type in Python and Pandas

        The following table shows how Python and pandas package stores data.

        Data Type Pandas Standard Python
        For character variable object string
        For categorical variable category -
        For Numeric variable without decimals int64 int
        Numeric characters with decimals float64 float
        For date time variables datetime64 -

        Important Pandas Functions

        The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorize pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

        Functions R Python (pandas package)
        Installing a package install.packages('name') !pip install name
        Loading a package library(name) import name as other_name
        Checking working directory getwd() import os
        os.getcwd()
        Setting working directory setwd() os.chdir()
        List files in a directory dir() os.listdir()
        Remove an object rm('name') del object
        Select Variables select(df, x1, x2) df[['x1', 'x2']]
        Drop Variables select(df, -(x1:x2)) df.drop(['x1', 'x2'], axis = 1)
        Filter Data filter(df, x1 >= 100) df.query('x1 >= 100')
        Structure of a DataFrame str(df) df.info()
        Summarize dataframe summary(df) df.describe()
        Get row names of dataframe "df" rownames(df) df.index
        Get column names colnames(df) df.columns
        View Top N rows head(df,N) df.head(N)
        View Bottom N rows tail(df,N) df.tail(N)
        Get dimension of data frame dim(df) df.shape
        Get number of rows nrow(df) df.shape[0]
        Get number of columns ncol(df) df.shape[1]
        Length of data frame length(df) len(df)
        Get random 3 rows from dataframe sample_n(df, 3) df.sample(n=3)
        Get random 10% rows sample_frac(df, 0.1) df.sample(frac=0.1)
        Check Missing Values is.na(df$x) pd.isnull(df.x)
        Sorting arrange(df, x1, x2) df.sort_values(['x1', 'x2'])
        Rename Variables rename(df, newvar = x1) df.rename(columns={'x1': 'newvar'})


        Data Manipulation with pandas - Examples

        1. Import Required Packages

        You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.
        import numpy as np
        import pandas as pd

        2. Build DataFrame

        We can build dataframe using DataFrame() function of pandas package.
        mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
                'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
                'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
        df = pd.DataFrame(mydata)
         In this dataframe, we have three variables - productcode, sales, cost.
        Sample DataFrame

        To import data from CSV file


        You can use read_csv() function from pandas package to get data into python from CSV file.
        mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
        Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

        Detailed Tutorial : Import Data in Python

        3. To see number of rows and columns

        You can run the command below to find out number of rows and columns.
        df.shape
         Result : (6, 3). It means 6 rows and 3 columns.

        4. To view first 3 rows

        The df.head(N) function can be used to check out first some N rows.
        df.head(3)
             cost productcode   sales
        0 1020.0 AA 1010.0
        1 1625.2 AA 1025.2
        2 1204.0 AA 1404.2

        5. Select or Drop Variables

        To keep a single variable, you can write in any of the following three methods -
        df.productcode
        df["productcode"]
        df.loc[: , "productcode"]
        To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.
        df.iloc[: , 1]
        We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.
        df[["productcode", "cost"]]
        df.loc[ : , ["productcode", "cost"]]

        Drop Variable

        We can remove variables by using df.drop() function. See the example below -
        df2 = df.drop(['sales'], axis = 1)

        6. To summarize data frame

        To summarize or explore data, you can submit the command below.
        df.describe()
                      cost       sales
        count 6.000000 6.00000
        mean 1166.150000 1242.65000
        std 237.926793 230.46669
        min 1003.700000 1010.00000
        25% 1020.000000 1058.90000
        50% 1072.000000 1205.85000
        75% 1184.000000 1366.07500
        max 1625.200000 1604.80000

        To summarise all the character variables, you can use the following script.
        df.describe(include=['O'])
        Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

        To select only a particular variable, you can write the following code -
        df.productcode.describe()
        OR
        df["productcode"].describe()
        count      6
        unique 2
        top BB
        freq 3
        Name: productcode, dtype: object

        7. To calculate summary statistics

        We can manually find out summary statistics such as count, mean, median by using commands below
        df.sales.mean()
        df.sales.median()
        df.sales.count()
        df.sales.min()
        df.sales.max()

        8. Filter Data

        Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.
        df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]
        It can also be written like :
        df1 = df.query('(productcode == "AA") & (sales >= 1250)')
        In the second query, we do not need to specify DataFrame along with variable name.

        9. Sort Data

        In the code below, we are arrange data in ascending order by sales.
        df.sort_values(['sales'])

        10.  Group By : Summary by Grouping Variable

        Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.
        df.groupby(df.productcode).mean()
                            cost        sales
        productcode
        AA 1283.066667 1146.466667
        BB 1049.233333 1338.833333
        Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.
        df["sales"].groupby(df.productcode).mean()

        11. Define Categorical Variable

        Let's create a classification variable - id which contains only 3 unique values - 1/2/3.
        df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})
        Let's define as a categorical variable.
        We can use astype() function to make id as a categorical variable.
        df0.id = df0["id"].astype('category')
        Summarize this classification variable to check descriptive statistics.
        df0.describe()
               id
        count 7
        unique 3
        top 2
        freq 3

        Frequency Distribution

        You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.
        df['productcode'].value_counts()
        BB    3
        AA 3

        12. Generate Histogram

        Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.
        df['sales'].hist()
        Histogram

        13. BoxPlot

        Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.
        df.boxplot(column='sales')
        BoxPlot

        Detailed Tutorial : Data Analysis with Pandas Tutorial

        Data Science using Python - Examples

        In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

        1. Install the required libraries

        Import the following libraries before reading or exploring data
        #Import required libraries
        import pandas as pd
        import statsmodels.api as sm
        import numpy as np

        2. Download and import data into Python

        With the use of python library, we can easily get data from web into python.
        # Read data from web
        df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
        Variables Type Description
        gre Continuous Graduate Record Exam score
        gpa Continuous Grade Point Average
        rank Categorical Prestige of the undergraduate institution
        admit Binary Admission in graduate school

        The binary variable admit is a target variable.

        3. Explore Data

        Let's explore data. We'll answer the following questions -
        1. How many rows and columns in the data file?
        2. What are the distribution of variables?
        3. Check if any outlier(s)
        4. If outlier(s), treat them
        5. Check if any missing value(s)
        6. Impute Missing values (if any)
        # See no. of rows and columns
        df.shape
        Result : 400 rows and 4 columns

        In the code below, we rename the variable rank to 'position' as rank is already a function in python.
        # rename rank column
        df = df.rename(columns={'rank': 'position'}) 
        Summarize and plot all the columns.
        # Summarize
        df.describe()
        # plot all of the columns
        df.hist()
        Categorical variable Analysis

        It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.
        # Summarize
        df.position.value_counts(ascending=True)
        1     61
        4 67
        3 121
        2 151

        Generating Crosstab 

        By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.
        pd.crosstab(df['admit'], df['position'])
        position   1   2   3   4
        admit
        0 28 97 93 55
        1 33 54 28 12

        Number of Missing Values

        We can write a simple loop to figure out the number of blank values in all variables in a dataset.
        for i in list(df.columns) :
            k = sum(pd.isnull(df[i]))
            print(i, k)
        In this case, there are no missing values in the dataset.

        4. Logistic Regression Model

        Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

        In python, we can write R-style model formula y ~ x1 + x2 + x3 using  patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.
        #Reference Category
        from patsy import dmatrices, Treatment
        y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')
        It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -
        P  P_1 P_2 P_3
        3 0 0 1
        3 0 0 1
        1 1 0 0
        4 0 0 0
        4 0 0 0
        2 0 1 0


        Split Data into two parts

        80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

        Build Logistic Regression Model

        By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.
        #Fit Logit model
        logit = sm.Logit(y_train, X_train)
        result = logit.fit()

        #Summary of Logistic regression model
        result.summary()
        result.params
                                  Logit Regression Results                           
        ==============================================================================
        Dep. Variable: admit No. Observations: 320
        Model: Logit Df Residuals: 315
        Method: MLE Df Model: 4
        Date: Sat, 20 May 2017 Pseudo R-squ.: 0.03399
        Time: 19:57:24 Log-Likelihood: -193.49
        converged: True LL-Null: -200.30
        LLR p-value: 0.008627
        =======================================================================================
        coef std err z P|z| [95.0% Conf. Int.]
        ---------------------------------------------------------------------------------------
        C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356
        C(position)[T.2] 0.6771 0.373 1.813 0.070 -0.055 1.409
        C(position)[T.3] 0.1071 0.410 0.261 0.794 -0.696 0.910
        gre 0.0005 0.001 0.442 0.659 -0.002 0.003
        gpa 0.4613 0.214 -2.152 0.031 -0.881 -0.041
        ======================================================================================

        Confusion Matrix and Odd Ratio

        Odd ratio is exponential value of parameter estimates.
        #Confusion Matrix
        result.pred_table()
        #Odd Ratio
        np.exp(result.params)

        Prediction on Test Data
        In this step, we take estimates of logit model which was built on training data and then later apply it into test data.
        #prediction on test data
        y_pred = result.predict(X_test)

        Calculate Area under Curve (ROC)
        # AUC on test data
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
        auc(false_positive_rate, true_positive_rate)
        Result : AUC = 0.6763

        Calculate Accuracy Score
        accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

        Decision Tree Model

        Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

        #Drop Intercept from predictors for tree algorithms
        X_train = X_train.drop(['Intercept'], axis = 1)
        X_test = X_test.drop(['Intercept'], axis = 1)

        #Decision Tree
        from sklearn.tree import DecisionTreeClassifier
        model_tree = DecisionTreeClassifier(max_depth=7)

        #Fit the model:
        model_tree.fit(X_train,y_train)

        #Make predictions on test set
        predictions_tree = model_tree.predict_proba(X_test)

        #AUC
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
        auc(false_positive_rate, true_positive_rate)
        Result : AUC = 0.664

        Important Note
        Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

        Random Forest Model

        Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

        #Random Forest
        from sklearn.ensemble import RandomForestClassifier
        model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

        #Fit the model:
        target = y_train['admit']
        model_rf.fit(X_train,target)

        #Make predictions on test set
        predictions_rf = model_rf.predict_proba(X_test)

        #AUC
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
        auc(false_positive_rate, true_positive_rate)

        #Variable Importance
        importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
        print(importances)
        importances.plot.bar()

        Result : AUC = 0.6974

        Grid Search - Hyper Parameters Tuning

        The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

        from sklearn.model_selection import GridSearchCV
        rf = RandomForestClassifier()
        target = y_train['admit']

        param_grid = {
        'n_estimators': [100, 200, 300],
        'max_features': ['sqrt', 3, 4]
        }

        CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')
        CV_rfc.fit(X_train,target)

        #Parameters with Scores
        CV_rfc.grid_scores_

        #Best Parameters
        CV_rfc.best_params_
        CV_rfc.best_estimator_

        #Make predictions on test set
        predictions_rf = CV_rfc.predict_proba(X_test)

        #AUC
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
        auc(false_positive_rate, true_positive_rate)

        Cross Validation
        # Cross Validation
        from sklearn.linear_model import LogisticRegression
        from sklearn.model_selection import cross_val_predict,cross_val_score
        target = y['admit']
        prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
        #AUC
        cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

        Data Mining : PreProcessing Steps

        1.  The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn,  there is already a function for this step.

        from sklearn.preprocessing import LabelEncoder
        def ConverttoNumeric(df):
        cols = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for i in cols:
        try:
        df[i] = le.fit_transform(df[i])
        except:
        print('Error in Variable :'+i)
        return df

        ConverttoNumeric(df)
        Encoding

        2. Create Dummy Variables

        Suppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.
        productcode_dummy = pd.get_dummies(df["productcode"])
        df2 = pd.concat([df, productcode_dummy], axis=1)

        The output looks like below -
           AA  BB
        0 1 0
        1 1 0
        2 1 0
        3 0 1
        4 0 1
        5 0 1

        Create k-1 Categories

        To avoid multi-collinearity, you can set one of the category as reference category and leave it while creating dummy variables. In the script below, we are leaving first category.
        productcode_dummy = pd.get_dummies(df["productcode"], prefix='pcode', drop_first=True)
        df2 = pd.concat([df, productcode_dummy], axis=1)

        3. Impute Missing Values

        Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.

        Fill missing values of a particular variable
        # fill missing values with 0
        df['var1'] = df['var1'].fillna(0)
        # fill missing values with mean
        df['var1'] = df['var1'].fillna(df['var1'].mean())

        Apply imputation to the whole dataset
        from sklearn.preprocessing import Imputer

        # Set an imputer object
        mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

        # Train the imputor
        mean_imputer = mean_imputer.fit(df)

        # Apply imputation
        df_new = mean_imputer.transform(df.values)

        4. Outlier Treatment

        There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -
        1. Cap extreme values at 95th / 99th percentile depending on distribution
        2. Apply log transformation of variables. See below the implementation of log transformation in Python.
        import numpy as np
        df['var1'] = np.log(df['var1'])

        5. Standardization

        In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).

        #load dataset
        dataset = load_boston()
        predictors = dataset.data
        target = dataset.target
        df = pd.DataFrame(predictors, columns = dataset.feature_names)

        #Apply Standardization
        from sklearn.preprocessing import StandardScaler
        k = StandardScaler()
        df2 = k.fit_transform(df)


        Next Steps

        Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.
        Back to Top