Identify Person, Place and Organisation in content using Python

This article outlines the concept and python implementation of Named Entity Recognition using StanfordNERTagger. The technical challenges such as installation issues, version conflict issues, operating system issues that are very common to this analysis are out of scope for this article.

NER NLP using Python

Table of contents:

1. Named Entity Recognition defined
2. Business Use cases
3. Installation Pre-requisites
4. Python Code for implementation
5. Additional Reading: CRF model, Multiple models available in the package
6. Disclaimer

1. Named Entity Recognition Defined
The process of detecting and classifying proper names mentioned in a text can be defined as Named Entity Recognition (NER). In simple words, it locates person name, organization and location etc. in the content. This is generally the first step in most of the Information Extraction (IE) tasks of Natural Language Processing.
NER Sample

2. Business Use Cases

There is a need for NER across multiple domains. Below are a few sample business use cases for your reference.
  1. Investment research: To identify the various announcements of the companies, people’s reaction towards them and its impact on the stock prices, one needs to identify people and organisation names in the text
  2. Chat-bots in multiple domains: To identify places and dates for booking hotel rooms, air tickets etc.
  3. Insurance domain: Identify and mask people’s names in the feedback forms before analyzing. This is needed for being regulatory compliant(example: HIPAA)

3. Installation Prerequisites
2. Unzip the zipped folder and save in a drive.
3. Copy the “stanford-ner.jar” from the folder and save it just outside the folder as shown in the image
4. Download the caseless models from https://stanfordnlp.github.io/CoreNLP/history.html by clicking on “caseless” as given below. The models in the first link work as well. However, the caseless models help in identifying named entities even when they are not capitalised as required by formal grammar rules. 
5. Save the folder in the same location as the Stanford NER folder for ease of access
Stanford NER Installation - Step1

NER Installation - Step2

4. Python Code for implementation:
#Import all the required libraries.
import os
from nltk.tag import StanfordNERTagger
import pandas as pd

#Set environmental variables programmatically.
#Set the classpath to the path where the jar file is located
os.environ['CLASSPATH'] = "<path to the file>/stanford-ner-2015-04-20/stanford-ner.jar"

#Set the Stanford models to the path where the models are stored
os.environ['STANFORD_MODELS'] = '<path to the file>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner'

#Set the java jdk path
java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path

#Set the path to the model that you would like to use
stanford_classifier  =  '<path to the file>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz'

#Build NER tagger object
st = StanfordNERTagger(stanford_classifier)

#A sample text for NER tagging
text = 'srinivas ramanujan went to the united kingdom. There he studied at cambridge university.'

#Tag the sentence and print output
tagged = st.tag(str(text).split())

[(u'srinivas', u'PERSON'), 
(u'ramanujan', u'PERSON'),
(u'went', u'O'),
(u'to', u'O'),
(u'the', u'O'),
(u'united', u'LOCATION'),
(u'kingdom.', u'LOCATION'),
(u'There', u'O'),
(u'he', u'O'),
(u'studied', u'O'),
(u'at', u'O'),
(u'cambridge', u'ORGANIZATION'),
(u'university', u'ORGANIZATION')]

5. Additional Reading

StanfordNER algorithm leverages a general implementation of linear chain Conditional Random Fields (CRFs) sequence models. CRFs seem very similar to Hidden Markov Model but are very different.

Below are some key points to note about the CRFs in general.
  1. It is a discriminative model unlike the HMM model and thus models the conditional probability
  2. It does not assume independence of features unlike the HMM model. This means that the current word, previous word, next word are all considered for model as features
  3. Relative to HMM or Max ent Markov Models, CRFs are the slowest

6. Disclaimer
This article explains the implementation of StanfordNER algorithm for research purposes and does not promote it for commercial gains. For any questions on commercial aspects of implementing this algorithm, please contact Stanford University

Case Study : Sentiment analysis using Python

In this article, we will walk you through an application of topic modelling and sentiment analysis to solve a real world business problem. This approach has a onetime effort of building a robust taxonomy and allows it to be regularly updated as new topics emerge. This approach is widely used in topic mapping tools. Please note that this is not a replacement of the topic modelling methodologies such as Latent Dirichlet allocation (LDA) and it is beyond them.
Text Mining Case Study using Python

Case Study : Topic Modeling and Sentiment Analysis

Suppose you are head of the analytics team with a leading Hotel chain “Tourist Hotel”. Each day, you receive hundreds of reviews of your hotel on the company’s website and multiple other social media pages. The business has a challenge of scale in analysing such data and identify areas of improvements. You use a taxonomy based approach to identify topics and then use a built-in functionality of Python NLTK package to attribute sentiment to the comments. This will help you in identifying what the customers like or dislike about your hotel.

Data Structure

The customer review data consists of a serial number, an arbitrary identifier to identify each review uniquely and a text field that has the customer review.
Example : Sentiment Analysis

Steps to topic mapping and sentiment analysis

1. Identify Topics and Sub Topics
2. Build Taxonomy
3. Map customer reviews to topics
4. Map customer reviews to sentiment

Step 1 : Identifying Topics
The first step is to identify the different topics in the reviews. You can use simple approaches such as Term Frequency and Inverse Document Frequency or more popular methodologies such as LDA to identify the topics in the reviews. In addition, it is a good practice to consult a subject matter expert in that domain to identify the common topics. For example, the topics in the “Tourist Hotel” example could be “Room booking”, “Room Price”, “Room Cleanliness”, “Staff Courtesy”, “Staff Availability ”etc.

Step 2 : Build Taxonomy

I. Build Topic Hierarchy

Based on the topics from Step 1, Build a Taxonomy. A Taxonomy can be considered as a network of topics, sub topics and key words.
Topic Hierarchy
II. Build Keywords
The taxonomy is built in a CSV file format. There are 3 levels of key words for each sub topic namely, Primary key words, Additional key words and Exclude key words. The keywords for the topics need to be manually identified and added to the taxonomy file. The TfIDf, Bigram frequencies and LDA methodologies can help you in identifying the right set of keywords. Although there is no one best way for building key words, below is a suggested approach.

i. Primary key words are the key words that are mostly specific to the topic. These key words need to be mutually exclusive across different topics as far as possible.

ii. Additional key words are specific to the sub topic. These key words need not be mutually exclusive between the topics but it is advised to maintain exclusivity between sub topics under the same sub topic. To explain further, let us say, there is a sub topic “Price” under the topics “Room” as well as “Food”, then the additional key words will have an overlap. This will not create any issue as the primary key words are mutually exclusive.

iii. Exclude key words are key words that are used relatively less than the other two types. If there are two sub topics that have some overlap of additional words OR for example, if the sub topic “booking” is incorrectly mapping comments regarding taxi bookings as room booking, such key words could be used in exclude words to solve the problem.

Snapshot of sample taxonomy:

Sample Taxonomy

Note: while building the key word list, you can put an “*” at the end as it helps as wild character. For example, all the different inflections of “clean” such as “cleaned”, “cleanly”, “cleanliness” can be handled by one keyword “clean*”. If you need to add a phrase or any keyword with a special character in it, you can wrap it in quotes. For example, “online booking”, Wi-Fi” etc need to be in double quotes.

Benefits of using taxonomic approach
Topic modelling approaches identify topics based on the keywords that are present in the content. For novel keywords that are similar to the topics but may come up in the future are not identified. There could be use cases where businesses want to track certain topics that may not always be identified as topics by the topic modelling approaches.

Step 3 : Map customer reviews to topic

Each customer comment is mapped to one or more sub topics. Some of the comments may not be mapped to any comment. Such instances need to be manually inspected to check if we missed any topics in the taxonomy so that it can be updated. Generally, about 90% of the comments have at least one topic. The rest of the comments could be vague. For example: “it was good experience” does not tell us anything specific and it is fine to leave it unmapped.
Snapshot of how the topics are mapped:
Topic Mapping

Below is the python code that helps in mapping reviews to categories. Firstly, import all the libraries needed for this task. Install these libraries if needed.
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Download Datafiles
Customer Review

Download Python Code
If you copy-paste the code from the article, some of the lines of code might not work as python follows indentation very strictly so download python code from the link below.  The code is built in Python 2.7

Import reviews data
df = pd.read_csv("D:/customer_reviews.csv");
Import taxonomy
df_tx = pd.read_csv("D:/ taxonomy.csv");

Build functions for handling the various repetitive tasks during the mapping exercise. This function identifies taxonomy words ending with (*) and treats it as a wild character. This takes the Keywords as input and uses regular expression to identify all the other keyword matches as output.
def asterix_handler(asterixw, lookupw):
mtch = "F"
for word in asterixw:
for lword in lookupw:
if(bool(re.search("^"+ word[:-1],lword))==True):
mtch = "T"
This function removes all punctuations. This is helpful in terms of data cleaning. You can edit the list of punctuations for your own custom punctuation removal at the place highlighted in amber.
def remov_punct(withpunct):
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
without_punct = ""
char = 'nan'
for char in withpunct:
if char not in punctuations:
without_punct = without_punct + char

Function to remove just the quotes(""). This is different from the above as this only handles double quotes. Recall that we wrap phrases or key words with special characters in double quotes.
def remov_quote(withquote):
quote = '"'
without_quote = ""
char = 'nan'
for char in withquote:
if char not in quote:
without_quote = without_quote + char

Split each document by sentences and append one below the other for sentence level topic mapping.
sentence_data = pd.DataFrame(columns=['slno','text'])
for d in range(len(df)):
doc = (df.iloc[d,1].split('.'))
for s in ((doc)):
temp = {'slno': [df['slno'][d]], 'text': [s]}
sentence_data = pd.concat([sentence_data,pd.DataFrame(temp)])
temp = ""
Drop empty text rows if any and export data
sentence_data.dropna(subset=['text'], inplace=True);

data = sentence_data
cat2list = list(set(df_tx['Subtopic']))
data['Category'] = 0
mapped_data = pd.DataFrame(columns = ['slno','text','Category']);
for k in range(len(data)):
comment = remov_punct(data.iloc[k,1])
data_words = [str(x.strip()).lower() for x in str(comment).split()]
data_words = filter(None, data_words)
output = []

for l in range(len(df_tx)):
key_flag = False
and_flag = False
not_flag = False
if (str(df_tx['PrimaryKeywords'][l])!='nan'):
kw_clean = (remov_quote(df_tx['PrimaryKeywords'][l]))
if (str(df_tx['AdditionalKeywords'][l])!='nan'):
aw_clean = (remov_quote(df_tx['AdditionalKeywords'][l]))
aw_clean = df_tx['AdditionalKeywords'][l]
if (str(df_tx['ExcludeKeywords'][l])!='nan'):
nw_clean = remov_quote(df_tx['ExcludeKeywords'][l])
nw_clean = df_tx['ExcludeKeywords'][l]
Key_words = 'nan'
and_words = 'nan'
and_words2 = 'nan'
not_words = 'nan'
not_words2 = 'nan'

key_words = [str(x.strip()).lower() for x in kw_clean.split(',')]
key_words2 = set(w.lower() for w in key_words)

and_words = [str(x.strip()).lower() for x in aw_clean.split(',')]
and_words2 = set(w.lower() for w in and_words)

if(str(nw_clean)!= 'nan'):
not_words = [str(x.strip()).lower() for x in nw_clean.split(',')]
not_words2 = set(w.lower() for w in not_words)

if(str(kw_clean) == 'nan'):
key_flag = False
if set(data_words) & key_words2:
key_flag = True
if(asterix_handler(key_words2, data_words)=='T'):
key_flag = True

and_flag = True
if set(data_words) & and_words2:
and_flag = True
and_flag = True
if(str(nw_clean) == 'nan'):
not_flag = False
if set(data_words) & not_words2:
not_flag = True
if(asterix_handler(not_words2, data_words)=='T'):
not_flag = True
if(key_flag == True and and_flag == True and not_flag == False):
temp = {'slno': [data.iloc[k,0]], 'text': [data.iloc[k,1]], 'Category': [df_tx['Subtopic'][l]]}
mapped_data = pd.concat([mapped_data,pd.DataFrame(temp)])
#data['Category'][k] = ','.join(output)
#output mapped data
mapped_data.to_csv("D:/mapped_data.csv",index = False)

Step 4: Map customer reviews to sentiment
#read category mapped data for sentiment mapping
catdata = pd.read_csv("D:/mapped_data.csv")
#Build a function to leverage the built-in NLTK functionality of identifying sentiment. The output 1 means positive, 0 means neutral and -1 means negative. You can choose your own set of thresholds for positive, neutral and negative sentiment.

def findpolar(test_data):
sia = SentimentIntensityAnalyzer()
polarity = sia.polarity_scores(test_data)["compound"];
if(polarity >= 0.1):
foundpolar = 1
if(polarity <= -0.1):
foundpolar = -1
if(polarity>= -0.1 and polarity<= 0.1):
foundpolar = 0


Output the sentiment mapped data
catdata.to_csv("D:/sentiment_mapped_data.csv",index = False)
Output : Sentiment Analysis

Additional Reading

Polarity Scoring Explained: 

NLTK offers Valence Aware Dictionary for sEntiment Reasoning(VADER) model that helps in identifying both the direction (polarity) as well as the magnitude(intensity) of the text. Below is the high-level explanation of the methodology.

VADER is a combination of lexical features and rules to identify sentiment and intensity. Hence, this does not need any training data.  To explain further, if we take an example of the sentence “the food is good”, it is easy to identify that it is positive in sentiment. VADER goes a step ahead and identifies intensity based on rule based approach such as punctuation, capitalised words and degree modifications.

The polarity scores for the different variations of similar sentences is as follows:

Polarity Score

Use cases where training sentiment models is suggested over Sentiment Intensity Analyzer:

Although VADER works well on multiple domains, there are could be some domains where it is preferred to build one’s own sentiment training models. Below are the two examples of such use cases.
  1. Customer reviews on alcoholic beverages:
  2. It is common to observe people using otherwise negative sentiment words to describe positive experience. For example, the sentence “this sh*t is fu**ing good” means that this drink is good but VADER approach gives it a “-10” suggesting negative sentiment

  3. Patient reviews regarding hospital treatment
  4. Patient’s description of their problem is a neutral sentiment but VADER approach considers it as negative sentiment. For example, the sentence “I had an unbearable back pain and your medication cured me in no time” is given “-0.67” suggesting negative sentiment.

Run Python from R

This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It's always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.
Combine Python and R

RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

Install and Load Reticulate Package

Run the command below to get this package installed and imported to your system.
# Install reticulate package

# Load reticulate package

Check whether Python is available on your system
It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

Import a python module within R

You can use the function import( ) to import a particular package or module.
os <- import("os")
The above program returns working directory.
[1] "C:\\Users\\DELL\\Documents"

You can use listdir( ) function from os package to see all the files in working directory
 [1] ".conda"                       ".gitignore"                   ".httr-oauth"                 
[4] ".matplotlib" ".RData" ".RDataTmp"
[7] ".Rhistory" "1.pdf" "12.pdf"
[10] "122.pdf" "124.pdf" "13.pdf"
[13] "1403.2805.pdf" "2.pdf" "3.pdf"
[16] "AIR.xlsx" "app.r" "Apps"
[19] "articles.csv" "Attrition_Telecom.xlsx" "AUC.R"

Install Python Package

Step 1 : Create a new environment 
Step 2 : Install a package within a conda environment
conda_install("r-reticulate", "numpy")
Since numpy is already installed, you don't need to install it again. The above example is just for demonstration.

Step 3 : Load the package
numpy <- import("numpy")

Working with numpy array

Let's create a sample numpy array
y <- array(1:4, c(2, 2))
x <- numpy$array(y)
     [,1] [,2]
[1,] 1 3
[2,] 2 4

Transpose the above array
    [,1] [,2]
[1,] 1 2
[2,] 3 4

Eigenvalues and eigen vectors
[1] -0.3722813 5.3722813

[,1] [,2]
[1,] -0.9093767 -0.5657675
[2,] 0.4159736 -0.8245648

Mathematical Functions

Working with Python interactively

You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).

By using repl_python() function, you can make it interactive. Download the dataset used in the program below.

# Load Pandas package
import pandas as pd

# Importing Dataset
travel = pd.read_excel("AIR.xlsx")

# Number of rows and columns

# Select random no. of rows
travel.sample(n = 10)

# Group By

# Filter
t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

# Return to R
Note : You need to enter exit to return to the R environment.
call python from R
Run Python from R

How to access objects created in python from R

You can use the py object to access objects created within python.
In this case, I am using R's summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.
# Line chart using ggplot2
ggplot(py$t, aes(AIR, Year)) + geom_line()

How to access objects created in R from Python

You can use the r object to accomplish this task. 

1. Let's create a object in R
mydata = head(cars, n=15)
2. Use the R created object within Python REPL
import pandas as pd

Building Logistic Regression Model using sklearn package

The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.

# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# load the iris datasets
iris = datasets.load_iris()

# Developing logit model
model = LogisticRegression()
model.fit(iris.data, iris.target)

# Scoring
actual = iris.target
predicted = model.predict(iris.data)

# Performance Metrics
print(metrics.classification_report(actual, predicted))
print(metrics.confusion_matrix(actual, predicted))

Other Useful Functions

To see configuration of python

Run the py_config( ) command to find the version of python installed on your system.It also shows details about anaconda and numpy.
python:         C:\Users\DELL\ANACON~1\python.exe
libpython: C:/Users/DELL/ANACON~1/python36.dll
pythonhome: C:\Users\DELL\ANACON~1
version: 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Architecture: 64bit
numpy: C:\Users\DELL\ANACON~1\lib\site-packages\numpy
numpy_version: 1.14.2

To check whether a particular package is installed

In the following program, we are checking whether pandas package is installed or not.

Gartner’s 2018 Take on Data Science Tools

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2018 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging though all 40+ pages of my report, here’s just the new section:

IT Research Firms

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. While these reports focus on companies, they often also describe how their commercial tools integrate open source tools such as R, Python, H2O, TensoFlow, and others.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.

Gartner, Inc. is one of the companies that provides such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue, or lower revenue combined with high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Hereafter, I refer to these as simply vision and ability. Figure 3a shows the resulting “Magic Quadrant” plot for 2018, and 3b shows the plot for the previous year.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The further to the upper-right corner, the better the combined score. KNIME is in the prime position, with H2O.ai showing greater vision but lower ability to execute. This year KNIME gained the ability to run H2O.ai algorithms, so these two may be viewed as complementary tools rather than outright competitors.

Alteryx and SAS have nearly the same combined scores, but note that Gartner studied only SAS Enterprise Miner and SAS Visual Analytics. The latter includes Visual Statistics, and Visual Data Mining and Machine Learning. Excluded was the SAS System itself since Gartner focuses on tools that are integrated. This lack of integration may explain SAS’ decline in vision from last year.

KNIME and RapidMiner are quite similar tools as they are both driven by an easy to use and reproducible workflow interface. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license. In the previous year’s Magic Quadrant, RapidMiner was slightly ahead, but now KNIME is in the lead.

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Figure 3b. Gartner Magic Quadrant for Data Science Platforms 2017.

The companies in the Visionaries Quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, IBM took a big hit by landing here after being in the Leader’s Quadrant for several years. Now they’re in a near-tie with Microsoft and Domino. Domino shot up from the bottom of that quadrant to towards the top. They integrate many different open source and commercial software (e.g. SAS, MATLAB) into their Domino Data Science Platform. Databricks and Dataiku offer cloud-based analytics similar to Domino, though lacking in access to commercial tools.

Those in the Challenger’s Quadrant have ample resources but less customer confidence on their future plans, or vision. Mathworks, the makers of MATLAB, continues to “stay the course” with its proprietary tools while most of the competition offers much better integration into the ever-expanding universe of open source tools.  Tibco replaces Quest in this quadrant due to their purchase of Statistica. Whatever will become of the red-headed stepchild of data science? Statistica has been owned by four companies in four years! (Statsoft, Dell, Quest, Tibco) Users of the software have got to be considering other options. Tibco also purchased Alpine Data in 2017, accounting for its disappearance from Figure 3b to 3a.

Members of the Niche Players quadrant offer tools that are not as broadly applicable. Anaconda is new to Gartner coverage this year. It offers in-depth support for Python. SAP has a toolchain that Gartner calls “fragmented and ambiguous.”  Angoss was recently purchased by Datawatch. Gartner points out that after 20 years in business, Angoss has only 300 loyal customers. With competition fierce in the data science arena, one can’t help but wonder how long they’ll be around. Speaking of deathwatches, once the king of Big Data, Teradata has been hammered by competition from open source tools such as Hadoop and Spark. Teradata’s net income was higher in 2008 than it is today.

As of 2/26/2018, RapidMiner is giving away copies of the Gartner report here.


Coding in Python with SAS University Edition

Good news learners! SAS University Edition has gone back to school and learned some new tricks.

With the December 2017 update, SAS University Edition now includes the SASPy package, available in its Jupyter Notebook interface. If you're keeping track, you know that SAS University Edition has long had support for Jupyter Notebook. With that, you can write and run SAS programs in a notebook-style environment. But until now, you could not use that Jupyter Notebook to run Python programs. With the latest update, you can -- and you can use the SASPy library to drive SAS features like a Python coder.

Oh, and there's another new trick that you'll find in this version: you can now use SAS (and Python) to access data from HTTPS websites -- that is, sites that use SSL encryption. Previous releases of SAS University Edition did not include the components that are needed to support these encrypted connections. That's going to make downloading web data much easier, not to mention using REST APIs. I'll show one HTTPS-enabled example in this post.

How to create a Python notebook in SAS University Edition

When you first access SAS University Edition in your web browser, you'll see a colorful "Welcome" window. From here, you can (A) start SAS Studio or (B) start Jupyter Notebook. For this article, I'll assume that you select choice (B). However, if you want to learn to use SAS and all of its capabilities, SAS Studio remains the best method for doing that in SAS University Edition.

When you start the notebook interface, you're brought into the Jupyter Home page. To get started with Python, select New->Python 3 from the menu on the right. You'll get a new empty Untitled notebook. I'm going to assume that you know how to work with the notebook interface and that you want to use those skills in a new way...with SAS. That is why you're reading this, right?

Move data from a pandas data frame to SAS

pandas is the standard for Python programmers who work with data. The pandas module is included in SAS University Edition -- you can use it to read and manipulate data frames (which you can think of like a table). Here's an example of retrieving a data file from GitHub and loading it into a data frame. (Read more about this particular file in this article. Note that GitHub uses HTTPS -- now possible to access in SAS University Edition!)

import saspy
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv')

Here's the result. This is all straight Python stuff; we haven't started using any SAS yet.

Before we can use SAS features with this data, we need to move the data into a SAS data set. SASPy provides a dataframe2sasdata() method (shorter alias: df2sd) that can import your Python pandas data frame into a SAS library and data set. The method returns a SASdata object. This example copies the data into WORK.PROBLY in the SAS session:

sas = saspy.SASsession()
probly = sas.df2sd(df,'PROBLY')

The SASdata object also includes a describe() method that yields a result that's similar to what you get from pandas:

Drive SAS procedures with Python

SASPy includes a collection of built-in objects and methods that provide APIs to the most commonly used SAS procedures. The APIs present a simple "Python-ic" style approach to the work you're trying to accomplish. For example, to create a SAS-based histogram for a variable in a data set, simply use the hist() method.

SASPy offers dozens of simple API methods that represent statistics, machine learning, time series, and more. You can find them documented on the GitHub project page. Note that since SAS University Edition does not include all SAS products, some of these API methods might not work for you. For example, the SASml.forest() method (representing

In SASPy, all methods generate SAS program code behind the scenes. If you like the results you see and want to learn the SAS code that was used, you can flip on the "teach me SAS" mode in SASPy.


Here's what SASPy reveals about the describe() and hist() methods we've already seen:

Interesting code, right? Does it make you want to learn more about SCALE= option on PROC SGPLOT?

If you want to experiment with SAS statements that you've learned, you don't need to leave the current notebook and start over. There's also a built-in %%SAS "magic command" that you can use to try out a few of these SAS statements.

proc means data=sashelp.cars stackodsoutput n nmiss median mean std min p25 p50 p75 max;run;

Python limitations in SAS University Edition

SAS University Edition includes over 300 Python modules to support your work in Jupyter Notebook. To see a complete list, run the help('modules') command from within a Python notebook. This list includes the common Python packages required to work with data, such as pandas and NumPy. However, it does not include any of the popular Python-based machine learning modules, nor any modules to support data visualization. Of course, SASPy has support for most of this within its APIs, so why would you need anything else...right?

Because SAS University Edition is packaged in a virtual machine that you cannot alter, you don't have the option of installing additional Python modules. You also don't have access to the Jupyter terminal, which would allow you to control the system from a shell-like interface. All of this is possible (and encouraged) when you have your own SAS installation with your own instance of SASPy. It's all waiting for you when you've outgrown the learning environment of SAS University Edition and you're ready to apply your SAS skills and tech to your official work!

Learn more

The post Coding in Python with SAS University Edition appeared first on The SAS Dummy.


Come on in, we're open: The openness of SAS® 9.4

The SAS® platform is now open to be accessed from open-source clients such as Python, Lua, Java, the R language, and REST APIs to leverage the capabilities of SAS® Viya® products and solutions. You can analyze your data in a cloud-enabled environment that handles large amounts of data in a variety of different formats. To find out more about SAS Viya, see the “SAS Viya: What's in it for me? The user.” article.

This blog post focuses on the openness of SAS® 9.4 and discusses features such as the SASPy package and the SAS kernel for Jupyter Notebook and more as clients to SAS. Note: This blog post is relevant for all maintenance releases of SAS 9.4.


The SASPy package enables you to connect to and run your analysis from SAS 9.4 using the object-oriented methods and objects from the Python language as well as the Python magic methods. SASPy translates the objects and methods added into the SAS code before executing the code. To use SASPy, you must have SAS 9.4 and Python 3.x or later.
Note: SASPy is an open-source project that encourages your contributions.

After you have completed the installation and configuration of SASPy, you can import the SASPy package as demonstrated below:
Note: I used Jupyter Notebook to run the examples in this blog post.

1.   Import the SASPy package:

Openness of SAS® 9.4

2.   Start a new session. The sas object is created as a result of starting a SAS session using a locally installed version of SAS under Microsoft Windows. After this session is successfully established, the following note is generated:

Adding Data

Now that the SAS session is started, you need to add some data to analyze. This example uses SASPy to read a CSV file that provides census data based on the ZIP Codes in Los Angeles County and create a SASdata object named tabl:

To view the attributes of this SASdata object named tabl, use the PRINT() function below, which shows the libref and the SAS data set name. It shows the results as Pandas, which is the default result output for tables.

Using Methods to Display and Analyze Data

This section provides some examples of how to use different methods to interact with SAS data via SASPy.

Head() Method

After loading the data, you can look at the first few records of the ZIP Code data, which is easy using the familiar head() method in Python. This example uses the head() method on the SASdata object tabl to display the first five records. The output is shown below:

Describe() Method

After verifying that the data is what you expected, you can now analyze the data. To generate a simple summary of the data, use the Python describe() method in conjunction with the index [1:3]. This combination generates a summary of all the numeric fields within the table and displays only the second and third records. The subscript works only when the result is set to Pandas and does not work if set to HTML or Text, which are also valid options.

Teach_me_SAS() Method

The SAS code generated from the object-oriented Python syntax can also be displayed using SASPy with the teach_me_SAS() method. When you set the argument in this method to True, which is done using a Boolean value, the SAS code is displayed without executing the code:

ColumnInfo() Method

In the next cell, use the columnInfo() method to display the information about each variable in the SAS data set. Note: The SAS code is generated as a result of adding the teach_me_SAS() method in the last section:

Submit() Method

Then, use the submit() method to execute the PROC CONTENTS that are displayed in the cell above directly from Python. The submit method returns a dictionary with two keys, LST and LOG. The LST key contains the results and the LOG key returns the SAS log. The results are displayed as HTML. The HTML package is imported  to display the results.

The SAS Kernel Using Jupyter Notebook

Jupyter Notebook can run programs in various programming languages including SAS when you install and configure the SAS kernel. Using the SAS kernel is another way to run SAS interactively using a web-based program, which also enables you to save the analysis in a notebook. See the links above for details about installation and configuration of the SAS kernel. To verify that the SAS kernel installed successfully, you can run the following code: jupyter kernelspec list

From the command line, use the following command to start the Jupyter Notebook: Jupyter notebook. The screenshot below shows the Jupyter Notebook session that starts when you run the code. To execute SAS syntax from Jupyter Notebook, select SAS from the New drop-down list as shown below:

You can add SAS code to a cell in Jupyter Notebook and execute it. The following code adds a PRINT procedure and a SGPLOT procedure. The output is in HTML5 by default. However, you can specify a different output format if needed.

You can also use magics in the cell such as the %%python magic even though you are using the SAS kernel. You can do this for any kernel that you have installed.

Other SAS Goodness

There are more ways of interacting with other languages with SAS as well. For example, you can use the Groovy procedure to run Groovy statements on the Java Virtual Machine (JVM). You can also use the LUA procedure to run LUA code from SAS along with the ability to call most SAS functions from Lua. For more information, see “Using Lua within your SAS programs.” Another very powerful feature is the DATA step JavaObject, which provides the ability to instantiate Java classes and access fields and methods. The DATA step JavaObject has been available since SAS® 9.2.


SASPy Documentation

Introducing SASPy: Use Python code to access SAS

Come on in, we're open: The openness of SAS® 9.4 was published on SAS Users.


Linear Regression in Python

Linear Regression is a supervised statistical technique where we try to estimate the dependent variable with a given set of independent variables. We assume the relationship to be linear and our dependent variable must be continuous in nature.
Python : Linear Regression
In the following diagram we can see that as horsepower increases mileage decreases thus we can think to fit linear regression. The red line is the fitted line of regression and the points denote the actual observations.

The vertical distance between the points and the fitted line (line of best fit) are called errors. The main idea is to fit this line of regression by minimizing the sum of squares of these errors. This is also known as principle of least squares.

  • Estimating the price (Y) of a house on the basis of its Area (X1), Number of bedrooms (X2), proximity to market (X3) etc. 
  • Estimating the mileage of a car (Y) on the basis of its displacement (X1), horsepower(X2), number of cylinders(X3), whether it is automatic or manual (X4) etc. 
  • To find the treatment cost or to predict the treatment cost on the basis of factors like age, weight, past medical history, or even if there are blood reports, we can use the information from the blood report.
Simple Linear Regression Model: In this we try to predict the value of dependent variable (Y) with only one regressor or independent variable(X).

Multiple Linear Regression Model: Here we try to predict the value of dependent variable (Y) with more than one regressor or independent variables.

The linear regression model:
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term.
Multiple Regression Equation

Assumptions of linear regression:
  • There must be a linear relationship between the dependent and independent variables.
  • Sample observations are independent.
  • Error terms are normally distributed with mean 0. 
  • No multicollinearity -  When the independent variables in my model are highly linearly related then such a situation is called multicollinearity.
  • Error terms are identically and independently distributed. (Independence means absence of autocorrelation).
  • Error terms have constant variance i.e. there is no heteroscedasticity.
  • No outliers are present in the data.

Important Model Performance Metrics

Coefficient of Determination (R square)
It suggests the proportion of variation in Y which can be explained with the independent variables. Mathematically, it is the ratio of predicted values and observed values, i.e.

If our fit is perfect then

If then R2 = 0 indicates a poor fit. Thus it lies between 0 and 1.
If the value of R2 is 0.912 then this suggests that 91.2% of the variation in Y can be explained with the help of given explanatory variables in that model. In other words, it explains the proportion of variation in the dependent variable that is explained by the independent variables.
R square solely not such a good measure:
On addition of a new variable the error is sure to decrease, thus R square always increases whenever a new variable is added to our model. This may not describe the importance of a variable.
For eg. In a model determining the price of the house, suppose we had the variables GDP, Inflation rate, Area. If we add a new variable: no. of plane crashes (which is irrelevant) then still R square will increase.

Adjusted R square:

Adjusted R square is given by:
Adjusted R-Square

where k is the no. of regressors or predictors.

Hence adjusted R square will always be less than or equal to R square.

On addition of a variable then R square in numerator and 'k' in the denominator will increase.
If the variable is actually useful then R square will increase by a large amount and 'k' in the denominator will be increased by 1. Thus the magnitude of increase in R square will compensate for increase in 'k'. On the other hand, if a variable is irrelevant then on its addition R square will not increase much and hence eventually adjusted R square will increase.
Thus as a general thumb rule if adjusted R square increases when a new variable is added to the model, the variable should remain in the model. If the adjusted R square decreases when the new variable is added then the variable should not remain in the model.
Why error terms should be normally distributed?
For parameter estimate (i.e. estimating the βi’s) we don't need that assumption. But, if it is not a normal distribution, some of those hypotheses tests which we will be doing as part of diagnostics may not be valid. 
For example:  To check whether the Beta (the regression coefficient) is significant or not, I'll do a T-test. So, if my error is not a normal distribution, then the statistic I derive may not be a T-distribution. So, my diagnostic test or hypotheses test is not valid. Similarly, F-test for linear regression which checks whether any of the independent variables in a multiple linear regression model are significant will be not be viable.
Why is expectation of error always zero?

The error term is the deviation between observed points and the fitted line. The observed points will be above and below the fitted line, so if I took the average of all the deviations, it should be 0 or near 0. Zero conditional mean is there which says that there are both negative and positive errors which cancel out on an average. This helps us to estimate dependent variable precisely.

Why multicollinearity is a problem? 

If my Xi’s are highly correlated then |X’X| will be close to 0 and hence inverse of (X’X) will not exist or will be indefinitely large. Mathematically, which will be indefinitely large in presence of multicollinearity. Long story in short, multicollinearity increases the estimate of standard error of regression coefficients which makes some variables statistically insignificant when they should be significant.

How can you detect multicollinearity?1. Bunch Map Analysis: By plotting scatter plots between various Xi’ s we can have a visual description of how the variables are related.

2. Correlation Method: By calculating the correlation coefficients between the variables we can get to know about the extent of multicollinearity in the data.

3.  VIF (Variance Inflation Factor) Method: Firstly we fit a model with all the variables and then calculate the variance inflation factor (VIF) for each variable. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. The higher the value of VIF for ith regressor, the more it is highly correlated to other variables.

So what is Variance Inflation Factor?
Variance inflation factor (VIF) for an explanatory variable is given 1/(1-R^2 )  . Here, we take that particular X as response variable and all other explanatory variables as independent variables. So, we run a regression between one of those explanatory variables with remaining explanatory variables. 
Detecting heteroscedasticity!
  1. Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.
  2. Goldfeld Quandt (GQ)Test: It assumes that heteroscedastic variance σi2 is positively related to one of the explanatory variables And errors are assumed to be normal. Thus if heteroscedasticity is present then the variance would be high for large values of X.

Steps for GQ test:
  1. Order/ rank (ascending) the observations according to the value of Xi beginning with the lowest X value.
  2. Omit ‘c’ central observations and divide the remaining (n-c) observations into 2 groups of (n-c)/2 observations each.
  3. Fit separate OLS regression to both the groups and obtain residual sum of squares (RSS1 and RSS2) for both the groups.
  4. Obtain F = RSS2/ RSS1 
It follows F with ((n-c)/2-k) d.f. both both numerator and denominator.
Where k is the no. of parameters to be estimated including the intercept.
If errors are homoscedastic then the two variances RSS2 and RSS1 turn out to be equal i. e. F will tend to 1. 
Dataset used:
We have 1030 observations on 9 variables. We try to estimate the Complete compressive strength(CRS) using:

  1. Cement - kg in a m3 mixture
  2. Blast Furnace Slag - kg in a m3 mixture
  3. Fly Ash - kg in a m3 mixture
  4. Water - kg in a m3 mixture
  5. Superplasticizer - kg in a m3 mixture
  6. Coarse Aggregate - kg in a m3 mixture
  7. Fine Aggregate - kg in a m3 mixture
  8. Age - Day (1-365)

Dataset - Download Data 

Importing the libraries:

Numpy, pandas and matplotlib.pyplot are imported with aliases np, pd and plt respectively.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading the data
We load our data using pd.read_csv( )
data = pd.read_csv("Concrete_Data.csv")
Now the data is divided into independent (x) and dependent variables (y)
x = data.iloc[:,0:8]
y = data.iloc[:,8:]

Splitting the data into training and test sets

Using sklearn we split 80% of our data into training set and rest in test set. Setting random_state will give the same training and test set everytime on running the code.
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 100) 

Running linear regression using sklearn
Using sklearn linear regression can be carried out using LinearRegression( ) class. sklearn automatically adds an intercept term to our model.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm = lm.fit(x_train,y_train)   #lm.fit(input,output)
The coefficients are given by:
array([[ 0.12415357,  0.10366839,  0.093371  , -0.13429401,  0.28804259,
0.02065756, 0.02563037, 0.11461733]])

To store coefficients in a data frame along with their respective independent variables -
coefficients = pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))], axis = 1)
0            Cement  0.124154
1 Blast 0.103668
2 Fly Ash 0.093371
3 Water -0.134294
4 Superplasticizer 0.288043
5 CA 0.020658
6 FA 0.025630
7 Age 0.114617
The intercept is:
To predict the values of y on the test set we use lm.predict( )
y_pred = lm.predict(x_test)
Errors are the difference between observed and predicted values.
y_error = y_test - y_pred
R square can be obbtained using sklearn.metrics ( ):
from sklearn.metrics import r2_score

Running linear regression using statsmodels:

It is to be noted that statsmodels does not add intercept term automatically thus we need to create an intercept to our model.
import statsmodels.api as sma
X_train = sma.add_constant(x_train) ## let's add an intercept (beta_0) to our model
X_test = sma.add_constant(x_test) 
Linear regression can be run by using sm.OLS:
import statsmodels.formula.api as sm
lm2 = sm.OLS(y_train,X_train).fit()
The summary of our model can be obtained via:
OLS Regression Results
Dep. Variable: CMS R-squared: 0.613
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 161.0
Date: Wed, 03 Jan 2018 Prob (F-statistic): 4.37e-162
Time: 21:29:10 Log-Likelihood: -3090.4
No. Observations: 824 AIC: 6199.
Df Residuals: 815 BIC: 6241.
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -34.2735 29.931 -1.145 0.253 -93.025 24.478
Cement 0.1242 0.010 13.054 0.000 0.105 0.143
Blast 0.1037 0.011 9.229 0.000 0.082 0.126
Fly Ash 0.0934 0.014 6.687 0.000 0.066 0.121
Water -0.1343 0.046 -2.947 0.003 -0.224 -0.045
Superplasticizer 0.2880 0.102 2.810 0.005 0.087 0.489
CA 0.0207 0.011 1.966 0.050 2.79e-05 0.041
FA 0.0256 0.012 2.131 0.033 0.002 0.049
Age 0.1146 0.006 19.064 0.000 0.103 0.126
Omnibus: 3.757 Durbin-Watson: 2.033
Prob(Omnibus): 0.153 Jarque-Bera (JB): 3.762
Skew: -0.165 Prob(JB): 0.152
Kurtosis: 2.974 Cond. No. 1.07e+05

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
The predicted values for test set are given by:
y_pred2 = lm2.predict(X_test) 
Note that both y_pred and y_pred2 are same. It's just these are calculated via different packages.

Calculate R-Squared and Adjusted R-Squared Manually on Test data

We can also calculate r-squared and adjusted r-squared via formula without using any package.
import numpy as np
y_test = pd.to_numeric(y_test.CMS, errors='coerce')
RSS = np.sum((y_pred2 - y_test)**2)
y_mean = np.mean(y_test)
TSS = np.sum((y_test - y_mean)**2)
R2 = 1 - RSS/TSS

p=X_test.shape[1] - 1

adj_rsquared = 1 - (1 - R2) * ((n - 1)/(n-p-1))

R-Squared : 0.6225
Adjusted RSquared : 0.60719

Detecting Outliers:
Firstly we try to get the studentized residuals using get_influence( ). The studentized residuals are saved in resid_student.
influence = lm2.get_influence() 
resid_student = influence.resid_studentized_external
Combining the training set and the residuals we have:
   Cement  Blast  Fly Ash  Water  Superplasticizer      CA     FA    Age  \
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28.0
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28.0
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270.0
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365.0
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360.0

Studentized Residuals
0 1.559672
1 -0.917354
2 1.057443
3 0.637504
4 -1.170290
resid = pd.concat([x_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
 If the absolute value of studentized residuals is more than 3 then that observation is considered as an outlier and hence should be removed. We try to create a logical vector for the absolute studentized residuals more than 3
     Cement  Blast  Fly Ash  Water  Superplasticizer     CA     FA  Age  \
649 166.8 250.2 0.0 203.5 0.0 975.6 692.6 3.0

Studentized Residuals
649 3.161183
resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]
The index of the outliers are given by ind:
ind = resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:].index
 Int64Index([649], dtype='int64')

Dropping Outlier 
Using the drop( ) function we remove the outlier from our training sets!
y_train.drop(ind,axis = 0,inplace = True)
x_train.drop(ind,axis = 0,inplace = True)  #Interept column is not there
X_train.drop(ind,axis = 0,inplace = True)  #Intercept column is there

Detecting and Removing Multicollinearity 
We use the statsmodels library to calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
[variance_inflation_factor(x_train.values, j) for j in range(x_train.shape[1])]

We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.
def calculate_vif(x):
    thresh = 5.0
    output = pd.DataFrame()
    k = x.shape[1]
    vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
    for i in range(1,k):
        print("Iteration no.")
        a = np.argmax(vif)
        print("Max VIF is for variable no.:")
        if vif[a] <= thresh :
        if i == 1 :         
            output = x.drop(x.columns[a], axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
        elif i > 1 :
            output = output.drop(output.columns[a],axis = 1)
            vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
train_out = calculate_vif(x_train) 
Now we view the training set
     Cement Blast Fly Ash Superplasticizer Age
337   275.1    0.0    121.4               9.9   56
384 516.0 0.0 0.0 8.2 28
805 393.0 0.0 0.0 0.0 90
682 183.9 122.6 0.0 0.0 28
329 246.8 0.0 125.1 12.0 3

Removing the variables from the test set.

x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
     Cement  Blast  Fly Ash  Superplasticizer  Age
173 318.8 212.5 0.0 14.3 91
134 362.6 189.0 0.0 11.6 28
822 322.0 0.0 0.0 0.0 28
264 212.0 0.0 124.8 7.8 3
479 446.0 24.0 79.0 11.6 7

Running linear regression again on our new training set (without multicollinearity)
import statsmodels.api as sma
import statsmodels.formula.api as sm
train_out = sma.add_constant(train_out) ## let's add an intercept (beta_0) to our model
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
X_test = sma.add_constant(x_test)
lm2 = sm.OLS(y_train,train_out).fit()
OLS Regression Results
Dep. Variable: CMS R-squared: 0.570
Model: OLS Adj. R-squared: 0.567
Method: Least Squares F-statistic: 216.3
Date: Wed, 10 Jan 2018 Prob (F-statistic): 6.88e-147
Time: 15:14:59 Log-Likelihood: -3128.8
No. Observations: 823 AIC: 6270.
Df Residuals: 817 BIC: 6298.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -11.1119 1.915 -5.803 0.000 -14.871 -7.353
Cement 0.1031 0.005 20.941 0.000 0.093 0.113
Blast 0.0721 0.006 12.622 0.000 0.061 0.083
Fly Ash 0.0614 0.009 6.749 0.000 0.044 0.079
Superplasticizer 0.7519 0.077 9.739 0.000 0.600 0.903
Age 0.1021 0.006 16.582 0.000 0.090 0.114
Omnibus: 0.870 Durbin-Watson: 2.090
Prob(Omnibus): 0.647 Jarque-Bera (JB): 0.945
Skew: 0.039 Prob(JB): 0.623
Kurtosis: 2.853 Cond. No. 1.59e+03

Checking normality of residualsWe use Shapiro Wilk test  from scipy library to check the normality of residuals.
  1. Null Hypothesis: The residuals are normally distributed.
  2. Alternative Hypothesis: The residuals are not normally distributed.
from scipy import stats
(0.9983407258987427, 0.6269884705543518)

Since the pvalue is 0.6269 thus at 5% level of significance we can say that the residuals are normally distributed.

Checking for autocorrelationTo ensure the absence of autocorrelation we use Ljungbox test.
  1. Null Hypothesis: Autocorrelation is absent.
  2. Alternative Hypothesis: Autocorrelation is present.
from statsmodels.stats import diagnostic as diag
diag.acorr_ljungbox(lm2.resid , lags = 1) 
(array([ 1.97177212]), array([ 0.16025989]))

Since p-value is 0.1602 thus we can accept the null hypothesis and can say that autocorrelation is absent.

Checking heteroscedasticityUsing Goldfeld Quandt we test for heteroscedasticity.
  1. Null Hypothesis: Error terms are homoscedastic
  2. Alternative Hypothesis: Error terms are heteroscedastic.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(lm2.resid, lm2.model.exog)
lzip(name, test)
[('F statistic', 0.9903), ('p-value', 0.539)]

The p-value is 0.539 hence we can say that the residuals have constant variance. Hence we can say that all the assumptions of our linear regression model are satisfied.

NumPy Tutorial with Exercises

NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

Python : Numpy Tutorial

Why NumPy instead of lists?

One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:
  1. Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
  2. They are more speedy to work with and hence are more efficient than the lists.
  3. They are more convenient to deal with.

    NumPy vs. Pandas

    Pandas is built on top of NumPy. In other words, Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additional method or provides more streamlined way of working with numerical and tabular data in Python.

    Importing numpy
    Firstly you need to import the numpy library. Importing numpy can be done by running the following command:
    import numpy as np
    It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

    Functions Tasks
    array Create numpy array
    ndim Dimension of the array
    shape Size of the array (Number of rows and Columns)
    size Total number of elements in the array
    dtype Type of elements in the array, i.e., int64, character
    reshape Reshapes the array without changing the original shape
    resize Reshapes the array. Also change the original shape
    arrange Create sequence of numbers in array
    Itemsize Size in bytes of each item
    diag Create a diagonal matrix
    vstack Stacking vertically
    hstack Stacking horizontally
    1D array
    Using numpy an array is created by using np.array:
    a = np.array([15,25,14,78,96])
    Output: array([15, 25, 14, 78, 96])

    Output: [15 25 14 78 96]
    Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

    Changing the datatype
    np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.
    a = np.array([15,25,14,78,96],dtype = "float")
    Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

    1. int32 refers to number without a decimal point. '32' means number can be in between -2147483648 and 2147483647. Similarly, int16 implies number can be in range -32768 to 32767
    2. float64 refers to number with decimal place.

    Creating the sequence of numbers
    If you want to create a sequence of numbers then using np.arange we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.
    b = np.arange(start = 20,stop = 30) 
    array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
     In np.arange the end point is always excluded.

    Create an Arithmetic Progression 
    np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

    Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.
    c = np.arange(20,30,2)   #30 is excluded.
    array([20, 22, 24, 26, 28])
    It is to be taken care that in np.arange( ) the stop argument is always excluded.

    Reshaping the arrays
    To reshape the array we can use reshape( ). 
    f = np.arange(101,113)
     array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

    Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( ) 
    array([[101, 102, 103, 104],
    [105, 106, 107, 108],
    [109, 110, 111, 112]])

    If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.
    array([[101, 102, 103, 104],
    [105, 106, 107, 108],
    [109, 110, 111, 112]])

    In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

    Missing Data
    The missing data is represented by NaN (acronym for Not a Number). You can use the command np.nan
    val = np.array([15,10, np.nan, 3, 2, 5, 6, 4])
    Out: nan
    To ignore missing values, you can use np.nansum(val) which returns 45

    To check whether array contains missing value, you can use the function isnan( )

    2D arrays
    A 2D array in numpy can be created in the following manner:
    g = np.array([(10,20,30),(40,50,60)])
    g = np.array([[10,20,30],[40,50,60]])
    The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:
    Output: 2

    Output: 6

    Output: (2, 3)

    Creating some usual matrices
    numpy provides the utility to create some usual matrices which are commonly used for linear algebra.
    To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):
    np.zeros( (2,4) )
    array([[ 0.,  0.,  0.,  0.],
    [ 0., 0., 0., 0.]])
    Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'
    array([[0, 0, 0, 0],
    [0, 0, 0, 0]], dtype=int16)
    To get a matrix of all random numbers from 0 to 1 we write np.empty. 
    np.empty( (2,3) )       
    array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
    [ 2.29175545e-312, 2.33419537e-312, 2.37663529e-312]])
    Note: The results may vary everytime you run np.empty.
    To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:
    array([[ 1.,  1.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 1.]])
    To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:
    array([[14,  0,  0,  0],
    [ 0, 15, 0, 0],
    [ 0, 0, 16, 0],
    [ 0, 0, 0, 17]])
    To create an identity matrix we can use np.eye( ) .
    np.eye(5,dtype = "int")
    array([[1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 1]])
    By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

    Reshaping 2D arrays
    To get a flattened 1D array we can use ravel( ) 
    g = np.array([(10,20,30),(40,50,60)])
     array([10, 20, 30, 40, 50, 60])
    To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.
    g.reshape(3,-1)  # returns the array with a modified shape
    #It does not modify the original array
     (2, 3)
    Similar to 1D arrays, using resize( ) will modify the shape in the original array.
    g   #resize modifies the original array
    array([[10, 20],
    [30, 40],
    [50, 60]])

    Time for some matrix algebra
    Let us create some arrays A,b and B and they will be used for this section:
    A = np.array([[2,0,1],[4,3,8],[7,6,9]])
    b = np.array([1,101,14])
    B = np.array([[10,20,30],[40,50,60],[70,80,90]])
     In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.
    A.T             #transpose
    A.transpose()  #transpose
    np.trace(A)  # trace
    np.linalg.inv(A)  #Inverse
    A.transpose()  #transpose
    array([[2, 4, 7],
    [0, 3, 6],
    [1, 8, 9]])

    np.trace(A) # trace
    Output: 14

    np.linalg.inv(A) #Inverse
    array([[ 0.53846154, -0.15384615, 0.07692308],
    [-0.51282051, -0.28205128, 0.30769231],
    [-0.07692308, 0.30769231, -0.15384615]])
    Note that transpose does not modify the original array.

    Matrix addition and subtraction can be done in the usual way:
    array([[12, 20, 31],
    [44, 53, 68],
    [77, 86, 99]])

    array([[ -8, -20, -29],
    [-36, -47, -52],
    [-63, -74, -81]])
    Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.
    array([[  90,  120,  150],
    [ 720, 870, 1020],
    [ 940, 1160, 1380]])
    To solve the system of linear equations: Ax = b we use np.linalg.solve( )
    array([-13.92307692, -24.69230769,  28.84615385])
    The eigen values and eigen vectors can be calculated using np.linalg.eig( ) 
    (array([ 14.0874236 ,   1.62072127,  -1.70814487]),
    array([[-0.06599631, -0.78226966, -0.14996331],
    [-0.59939873, 0.54774477, -0.81748379],
    [-0.7977253 , 0.29669824, 0.55608566]]))
    The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

    Some Mathematics functions

    We can have various trigonometric functions like sin, cosine etc. using numpy:
    B = np.array([[0,-20,36],[40,50,1]])
    array([[ 0.        , -0.91294525, -0.99177885],
    [ 0.74511316, -0.26237485, 0.84147098]])
    The resultant is the matrix of all sin( ) elements.
    In order to get the exponents we use **
    array([[   0,  400, 1296],
    [1600, 2500, 1]], dtype=int32)
    We get the matrix of the square of all elements of B.
    In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:
    array([[False, False,  True],
    [ True, True, False]], dtype=bool)
    We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
    In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.
    Now we consider a matrix A of shape 3*3:
    A = np.arange(1,10).reshape(3,3)
    array([[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]])
    To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:
    A.std()   #Standard deviation
    A.var()  #Variance
    Output: 45

    Output: 1

    Output: 9

    Output: 5.0

    A.std() #Standard deviation
    Output: 2.5819888974716112

    Output: 6.666666666666667
    In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.
    Output: 0

    Output: 8
    If we wish to find the above statistics for each row or column then we need to specify the axis:
    A.mean(axis = 0)
    A.std(axis = 0)
    A.argmin(axis = 0)
    A.sum(axis=0)                 # sum of each column, it will move in downward direction
    Output: array([12, 15, 18])

    A.mean(axis = 0)
    Output: array([ 4., 5., 6.])

    A.std(axis = 0)
    Output: array([ 2.44948974, 2.44948974, 2.44948974])

    A.argmin(axis = 0)
    Output: array([0, 0, 0], dtype=int64)
    By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column.
    To find the min and index of maximum element fow each row, we need to move in rightwise direction so we write axis = 1:
    A.argmax(axis = 1)
    A.min(axis=1)                  # min of each row, it will move in rightwise direction
    Output: array([1, 4, 7])

    A.argmax(axis = 1)
    Output: array([2, 2, 2], dtype=int64)
       To find the cumulative sum along each row we use cumsum( )
    array([[ 1,  3,  6],
    [ 4, 9, 15],
    [ 7, 15, 24]], dtype=int32)

    Creating 3D arrays
    Numpy also provides the facility to create 3D arrays. A 3D array can be created as:
    X = np.array( [[[  1, 2,3],             
                    [ 4, 5, 6]],
    X contains two 2D arrays  Thus the shape is 2,2,3. Totol number of elements is 12.
    To calculate the sum along a particular axis we use the axis parameter as follows:
    X.sum(axis = 0)
    X.sum(axis = 1)
    X.sum(axis = 2)
    X.sum(axis = 0)
    array([[ 8, 10, 12],
    [14, 16, 18]])

    X.sum(axis = 1)
    array([[ 5, 7, 9],
    [17, 19, 21]])

    X.sum(axis = 2)
    array([[ 6, 15],
    [24, 33]])
    axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.
     array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
    ravel( ) writes all the elements in a single array.

    Indexing in arrays
    It is important to note that Python indexing starts from 0. The syntax of indexing is as follows -
    1. x[start:end] : Elements in array x start through the end (but the end is excluded)
    2. x[start:]       : Elements start through the end
    3. x[:end]        : Elements from the beginning through the end (but the end is excluded)

    If we want to extract 3rd element we write the index as 2 as it starts from 0.
    x = np.arange(10)
    Output: 2

    Output: array([2, 3, 4])
    Note that in x[2:5] elements starting from 2nd index upto 5th index(exclusive) are selected.
    If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:
    x[:7:3] = 123
     array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])
    To reverse a given array we write:
    x = np.arange(10)
    x[ : :-1]                                 # reversed x
    array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
    Note that the above command does not modify the original array.
    Consider a 3D array:
    X = np.array( [[[  1, 2,3],           
                    [ 4, 5, 6]],
    To extract the 2nd matrix we write:
    X[1,...]                                   # same as X[1,:,:] or X[1]
    array([[ 7,  8,  9],
    [10, 11, 12]])
    Remember python indexing starts from 0 that is why we wrote 1 to extract the2nd 2D array.
    To extract the first element from all the rows we write:
    X[...,0]                                   # same as X[:,:,0] 
    array([[ 1,  4],
    [ 7, 10]])

    Find out position of elements that satisfy a given condition
    a = np.array([8, 3, 7, 0, 4, 2, 5, 2])
    np.where(a > 4)
    array([0, 2, 6]
    np.where locates the positions in the array where element of array is greater than 4.

    Indexing with Arrays of Indices
    Consider a 1D array.
    x = np.arange(11,35,2)                   
    array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])
    We form a 1D array i which subsets the elements of x as follows:
    i = np.array( [0,1,5,3,7,9 ] )
    array([11, 13, 21, 17, 25, 29])
    In a similar manner we create a 2D array j of indices to subset x.
    j = np.array( [ [ 0, 1], [ 6, 2 ] ] )   
    array([[11, 13],
    [23, 15]])
    Similarly we can create both  i and j as 2D arrays of indices for x
    x = np.arange(15).reshape(3,5)
    i = np.array( [ [0,1],                        # indices for the first dim
                    [2,0] ] )
    j = np.array( [ [1,1],                        # indices for the second dim
                    [2,0] ] )
    To get the ith index in row and jth index for columns we write:
    x[i,j]                                     # i and j must have equal shape
    array([[ 1,  6],
    [12, 0]])
    To extract ith index from 3rd column we write:
    array([[ 2,  7],
    [12, 2]])
    For each row if we want to find the jth index we write:
    array([[[ 1,  1],
    [ 2, 0]],

    [[ 6, 6],
    [ 7, 5]],

    [[11, 11],
    [12, 10]]])
    Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

    You can also use indexing with arrays to assign the values:
    x = np.arange(10)
    x[[4,5,8,1,2]] = 0
    array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])
    0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
    When the list of indices contains repetitions then it assigns the last value to that index:
    x = np.arange(10)
    x[[4,4,2,3]] = [100,200,300,400]
    array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])
    Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
    Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.
    x = np.arange(10)
     array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])
    Although index 1 and 7 are repeated but they are incremented only once.

    Indexing with Boolean arrays 
    We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.
    a = np.arange(12).reshape(3,4)
    b = a > 4
    array([[False, False, False, False],
    [False, True, True, True],
    [ True, True, True, True]], dtype=bool)
    Note that 'b' is a Boolean with same shape as that of 'a'.
    To select the elements from 'a' which adhere to condition 'b' we write:
    array([ 5,  6,  7,  8,  9, 10, 11])
    Now 'a' becomes a 1D array with the selected elements
    This property can be very useful in assignments:
    a[b] = 0                                 
    array([[0, 1, 2, 3],
    [4, 0, 0, 0],
    [0, 0, 0, 0]])
    All elements of 'a' higher than 4 become 0
    As done in integer indexing we can use indexing via Booleans:
    Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.
    x = np.arange(15).reshape(3,5)
    y = np.array([True,True,False])             # first dim selection
    z = np.array([True,True,False,True,False])       # second dim selection
    We write the x[y,:] which will select only those rows where y is True.
    x[y,:]                                   # selecting rows
    x[y]                                     # same thing
    Writing x[:,z] will select only those columns where z is True.
    x[:,z]                                   # selecting columns
    x[y,:]                                   # selecting rows
    array([[0, 1, 2, 3, 4],
    [5, 6, 7, 8, 9]])

    x[y] # same thing
    array([[0, 1, 2, 3, 4],
    [5, 6, 7, 8, 9]])

    x[:,z] # selecting columns
    array([[ 0, 1, 3],
    [ 5, 6, 8],
    [10, 11, 13]])

    Stacking various arrays
    Let us consider 2 arrays A and B:
    A = np.array([[10,20,30],[40,50,60]])
    B = np.array([[100,200,300],[400,500,600]])
    To join them vertically we use np.vstack( ).
    np.vstack((A,B))   #Stacking vertically
    array([[ 10,  20,  30],
    [ 40, 50, 60],
    [100, 200, 300],
    [400, 500, 600]])
    To join them horizontally we use np.hstack( ).
    np.hstack((A,B))   #Stacking horizontally
    array([[ 10,  20,  30, 100, 200, 300],
    [ 40, 50, 60, 400, 500, 600]])
    newaxis  helps in transforming a 1D row vector to a 1D column vector.
    from numpy import newaxis
    a = np.array([4.,1.])
    b = np.array([2.,8.])
    array([[ 4.],
    [ 1.]])
    #The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:
    np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack
    array([[ 4., 2.],
    [ 1., 8.]])

    array([[ 4., 2.],
    [ 1., 8.]])

    Splitting the arrays
    Consider an array 'z' of 15 elements:
    z = np.arange(1,16)
    Using np.hsplit( ) one can split the arrays
    np.hsplit(z,5)   # Split a into 5 arrays
    [array([1, 2, 3]),
    array([4, 5, 6]),
    array([7, 8, 9]),
    array([10, 11, 12]),
    array([13, 14, 15])]
    It splits 'z' into 5 arrays of eqaual length.
    On passing 2 elements we get:
    [array([1, 2, 3]),
    array([4, 5]),
    array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
    It splits 'z' after the third and the fifth element.
    For 2D arrays np.hsplit( ) works as follows:
    A = np.arange(1,31).reshape(3,10)
    np.hsplit(A,5)   # Split a into 5 arrays
    [array([[ 1,  2],
    [11, 12],
    [21, 22]]), array([[ 3, 4],
    [13, 14],
    [23, 24]]), array([[ 5, 6],
    [15, 16],
    [25, 26]]), array([[ 7, 8],
    [17, 18],
    [27, 28]]), array([[ 9, 10],
    [19, 20],
    [29, 30]])]
    In the above command A gets split into 5 arrays of same shape.
    To split after the third and the fifth column we write:
    [array([[ 1,  2,  3],
    [11, 12, 13],
    [21, 22, 23]]), array([[ 4, 5],
    [14, 15],
    [24, 25]]), array([[ 6, 7, 8, 9, 10],
    [16, 17, 18, 19, 20],
    [26, 27, 28, 29, 30]])]

    Consider an array x
    x = np.arange(1,16)
    We assign y as x and then say 'y is x'
    y = x         
    y is x         
    Let us change the shape of y
    y.shape = 3,5  
    Note that it alters the shape of x
    (3, 5)

    Creating a view of the data
    Let us store z as a view of x by:
    z = x.view()
    z is x   
    Thus z is not x.
    Changing the shape of z
    z.shape = 5,3                   
    Creating a view does not alter the shape of x
    (3, 5)
    Changing an element in z
    z[0,0] = 1234                      
    Note that the value in x also get alters:
    array([[1234,    2,    3,    4,    5],
    [ 6, 7, 8, 9, 10],
    [ 11, 12, 13, 14, 15]])
    Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.

    Creating a copy of the data:
    Now let us create z as a copy of x:
    z = x.copy()                        
    Note that z is not x
    z is x
    Changing the value in z
    z[0,0] = 9999
    No alterations are made in x.
    array([[1234,    2,    3,    4,    5],
    [ 6, 7, 8, 9, 10],
    [ 11, 12, 13, 14, 15]])
    Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

    Exercises : Numpy

    1. How to extract even numbers from array?

    arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    Desired Output : array([0, 2, 4, 6, 8])

    Show Solution
    arr[arr % 2 == 0]

    2. How to find out the position where elements of x and y are same

    x = np.array([5,6,7,8,3,4])
    y = np.array([5,3,4,5,2,4])
    Desired Output : array([0, 5]

    Show Solution
    np.where(x == y)

    3. How to standardize values so that it lies between 0 and 1

    k = np.array([5,3,4,5,2,4])
    Hint : k-min(k)/(max(k)-min(k))

    Show Solution
    kmax, kmin = k.max(), k.min()
    k_new = (k - kmin)/(kmax - kmin)

    4. How to calculate the percentile scores of an array

    p = np.array([15,10, 3,2,5,6,4])

    Show Solution
    np.percentile(p, q=[5, 95])

    5. Print the number of missing values in an array

    p = np.array([5,10, np.nan, 3, 2, 5, 6, np.nan])

    Show Solution
    print("Number of missing values =", np.isnan(p).sum())

    Pandas Python Tutorial - Learn by Examples

    Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

    The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

    Data Analysis with Python : Pandas Step by Step Guide

    Why pandas?
    It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -
    1. Create a structured data set similar to R's data frame and Excel spreadsheet.
    2. Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
    3. Selecting particular rows or columns from data set
    4. Arranging data in ascending or descending order
    5. Filtering data based on some conditions
    6. Summarizing data by classification variable
    7. Reshape data into wide or long format
    8. Time series analysis
    9. Merging and concatenating two datasets
    10. Iterate over the rows of dataset
    11. Writing or Exporting data in CSV or Excel format


    In this tutorial we will use two datasets: 'income' and 'iris'.
    1. 'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
    2. 'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link 

    Important pandas functions to remember

    The following is a list of common tasks along with pandas functions.
    Utility Functions
    Extract Column Names df.columns
    Select first 2 rows df.iloc[:2]
    Select first 2 columns df.iloc[:,:2]
    Select columns by name df.loc[:,["col1","col2"]]
    Select random no. of rows df.sample(n = 10)
    Select fraction of random rows df.sample(frac = 0.2)
    Rename the variables df.rename( )
    Selecting a column as index df.set_index( )
    Removing rows or columns df.drop( )
    Sorting values df.sort_values( )
    Grouping variables df.groupby( )
    Filtering df.query( )
    Finding the missing values df.isnull( )
    Dropping the missing values df.dropna( )
    Removing the duplicates df.drop_duplicates( )
    Creating dummies pd.get_dummies( )
    Ranking df.rank( )
    Cumulative sum df.cumsum( )
    Quantiles df.quantile( )
    Selecting numeric variables df.select_dtypes( )
    Concatenating two dataframes pd.concat()
    Merging on basis of common variable pd.merge( )

    Importing pandas library

    You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:
    import pandas as pd
    The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of  pandas.function every time you need to apply it.

    Importing Dataset

    To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.
    income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")
     Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
    4 C California 1685349 1675807 1889570 1480280 1735069 1812546

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
    0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    4 1487315 1663809 1624509 1639670 1921845 1156536 1388461 1644607

    Get Variable Names

    By using income.columnscommand, you can fetch the names of variables of a data frame.
    Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
    'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
    income.columns[0:2] returns first two column names 'Index', 'State'. In python, indexing starts from 0.

    Knowing the Variable types

    You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.
    Index    object
    State object
    Y2002 int64
    Y2003 int64
    Y2004 int64
    Y2005 int64
    Y2006 int64
    Y2007 int64
    Y2008 int64
    Y2009 int64
    Y2010 int64
    Y2011 int64
    Y2012 int64
    Y2013 int64
    Y2014 int64
    Y2015 int64
    dtype: object

    Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

    To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -
    It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

    Changing the data types

    Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:
    income.Y2008 = income.Y2008.astype(float)
    Index     object
    State object
    Y2002 int64
    Y2003 int64
    Y2004 int64
    Y2005 int64
    Y2006 int64
    Y2007 int64
    Y2008 float64
    Y2009 int64
    Y2010 int64
    Y2011 int64
    Y2012 int64
    Y2013 int64
    Y2014 int64
    Y2015 int64
    dtype: object

    To view the dimensions or shape of the data
     (51, 16)

    51 is the number of rows and 16 is the number of columns.

    You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R). 

    To view only some of the rows

    By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.
    income.head(2) #shows first 2 rows.
    income.tail(2) #shows last 2 rows

    Alternatively, any of the following commands can be used to fetch first five rows.

    Extract Unique Values

    The unique() function shows the unique levels or categories in the dataset.
    array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)

    The nunique( ) shows the number of unique values.
    It returns 19 as index column contains distinct 19 values.

    Generate Cross Tab

    pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.

    Creating a frequency distribution

    income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.
    income.Index.value_counts(ascending = True)
    F    1
    G 1
    U 1
    L 1
    H 1
    P 1
    R 1
    D 2
    T 2
    S 2
    V 2
    K 2
    O 3
    C 3
    I 4
    W 4
    A 4
    M 8
    N 8
    Name: Index, dtype: int64

    To draw the samples
    income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.
    income.sample(n = 5)
    income.sample(frac = 0.1)
    Selecting only a few of the columns
    To select only a specific columns we use either loc[ ] or iloc[ ] commands. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.
    income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns
    #In the above command both Index and Y2008 are included.
    income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included
    The difference between loc and iloc is that loc requires the column(rows) names to be selected while iloc requires the column(rows) indices (position).

    You can also use the following syntax to select specific variables.

    Renaming the variables
    We create a dataframe 'data' for information of people and their respective zodiac signs.
    data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
           A          B
    0 John Libra
    1 Mary Capricorn
    2 Julia Aries
    3 Kenny Scorpio
    4 Henry Aquarius
    If all the columns are to be renamed then we can use data.columns and assign the list of new column names.
    #Renaming all the variables.
    data.columns = ['Names','Zodiac Signs']

       Names Zodiac Signs
    0 John Libra
    1 Mary Capricorn
    2 Julia Aries
    3 Kenny Scorpio
    4 Henry Aquarius
    If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.
    #Renaming only some of the variables.
    data.rename(columns = {"Names":"Cust_Name"},inplace = True)
      Cust_Name Zodiac Signs
    0 John Libra
    1 Mary Capricorn
    2 Julia Aries
    3 Kenny Scorpio
    4 Henry Aquarius
    By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

    Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"
    income.columns = income.columns.str.replace('Y' , 'Year ')
    Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
    'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
    'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],

    Setting one column in the data frame as the index
    Using set_index("column name") we can set the indices as that column and that column gets removed.
    income.set_index("Index",inplace = True)
    #Note that the indices have changed and Index column is now no more a column
    income.reset_index(inplace = True)
    reset_index( ) tells us that one should use the by default indices.

    Removing the columns and rows
    To drop a column we use drop( ) where the first argument is a list of columns to be removed.

    By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.
    income.drop('Index',axis = 1)

    income.drop("Index",axis = "columns")
    income.drop(['Index','State'],axis = 1)
    income.drop(0,axis = 0)
    income.drop(0,axis = "index")
    income.drop([0,1,2,3],axis = 0)
     Also inplace = False by default thus no alterations are made in the original dataset.  axis = "columns"  and axis = "index" means the column and row(index) should be removed respectively.

    Sorting the data
    To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.
    income.sort_values("State",ascending = False)
    income.sort_values("State",ascending = False,inplace = True)
    We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002

    Create new variables
    Using eval( ) arithmetic operations on various columns can be carried out in a dataset.
    income["difference"] = income.Y2008-income.Y2009

    income["difference2"] = income.eval("Y2008 - Y2009")
      Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
    4 C California 1685349 1675807 1889570 1480280 1735069 1812546

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 \
    0 1945229.0 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826.0 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886.0 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104.0 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    4 1487315.0 1663809 1624509 1639670 1921845 1156536 1388461 1644607

    difference difference2
    0 1056.0 1056.0
    1 115285.0 115285.0
    2 198556.0 198556.0
    3 -440876.0 -440876.0
    4 -176494.0 -176494.0

    income.ratio = income.Y2008/income.Y2009
    The above command does not work, thus to create new columns we need to use square brackets.
    We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.
    data = income.assign(ratio = (income.Y2008 / income.Y2009))

    Finding Descriptive Statistics
    describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.
    income.describe() #for numeric variables
    To find the total count, maximum occuring string and its frequency we write include = ['object']
    income.describe(include = ['object'])  #Only for strings / objects
    Mean, median, maximum and minimum can be obtained for a particular column(s) as:

    Groupby function
    To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.
    agg( ) function is used to find all the functions for a given variable.
    The following command finds minimum and maximum values for Y2002 and only mean for Y2003
    income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
              Y2002                 Y2003
    min max mean
    A 1170302 1742027 1810289.000
    C 1343824 1685349 1595708.000
    D 1111437 1330403 1631207.000
    F 1964626 1964626 1468852.000
    G 1929009 1929009 1541565.000
    H 1461570 1461570 1200280.000
    I 1353210 1776918 1536164.500
    K 1509054 1813878 1369773.000
    L 1584734 1584734 1110625.000
    M 1221316 1983285 1535717.625
    N 1395149 1885081 1382499.625
    O 1173918 1802132 1569934.000
    P 1320191 1320191 1446723.000
    R 1501744 1501744 1942942.000
    S 1159037 1631522 1477072.000
    T 1520591 1811867 1398343.000
    U 1771096 1771096 1195861.000
    V 1134317 1146902 1498122.500
    W 1677347 1977749 1521118.500

    To filter only those rows which have Index as "A" we write:
    income[income.Index == "A"]

    income.loc[income.Index == "A",:]
      Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
    0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    To select the States having Index as "A":
    income.loc[income.Index == "A","State"]
    income.loc[income.Index == "A",:].State
    To filter the rows with Index as "A" and income for 2002 > 1500000"
    income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]
    To filter the rows with index either "A" or "W", we can use isin( ) function:
    income.loc[(income.Index == "A") | (income.Index == "W"),:]

       Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
    0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
    1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
    2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
    3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
    47 W Washington 1977749 1687136 1199490 1163092 1334864 1621989
    48 W West Virginia 1677347 1380662 1176100 1888948 1922085 1740826
    49 W Wisconsin 1788920 1518578 1289663 1436888 1251678 1721874
    50 W Wyoming 1775190 1498098 1198212 1881688 1750527 1523124

    Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
    0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
    1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
    2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
    3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
    47 1545621 1555554 1179331 1150089 1775787 1273834 1387428 1377341
    48 1238174 1539322 1539603 1872519 1462137 1683127 1204344 1198791
    49 1980167 1901394 1648755 1940943 1729177 1510119 1701650 1846238
    50 1587602 1504455 1282142 1881814 1673668 1994022 1204029 1853858
    Alternatively we can use query( ) function and write our filtering criteria:
    income.query('Y2002>1700000 & Y2003 > 1500000')

    Dealing with missing values
    We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.
    import numpy as np
    mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
            'Yield': [1010, 1025.2, 1404.2, 1251.7],
            'cost' : [102, np.nan, 20, 68]}
    crops = pd.DataFrame(mydata)
    isnull( ) returns True and notnull( ) returns False if the value is NaN.
    crops.isnull()  #same as is.na in R
    crops.notnull()  #opposite of previous command.
    crops.isnull().sum()  #No. of missing values.
    crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

    crops[crops.cost.isnull()] #shows the rows with NAs.
    crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
    crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop
    To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

    crops.dropna(how = "any").shape
    crops.dropna(how = "all").shape  
    To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:
    crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
    crops.dropna(subset = ['Yield',"cost"],how = 'all').shape
    Replacing the missing values by "UNKNOWN" sub attribute in Column name.
    crops['cost'].fillna(value = "UNKNOWN",inplace = True)

    Dealing with duplicates
    We create a new dataframe comprising of items and their respective prices.
    data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
                 Items  Price
    0 TV 10000
    1 Washing Machine 50000
    2 Mobile 20000
    3 TV 10000
    4 TV 10000
    5 Washing Machine 40000
    duplicated() returns a logical vector returning True when encounters duplicated.
    data.loc[data.duplicated(keep = "first"),:]
    By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
    If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.
    data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.
    If keep = "False" then it considers all the occurences of the repeated observations as duplicates.
    data.loc[data.duplicated(keep = False),:]  #all the duplicates, including unique are shown.
    To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )
    data.drop_duplicates(keep = "first")
    data.drop_duplicates(keep = "last")
    data.drop_duplicates(keep = False,inplace = True)  #by default inplace = False

    Creating dummies
    Now we will consider the iris dataset
    iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
    0 5.1 3.5 1.4 0.2 setosa
    1 4.9 3.0 1.4 0.2 setosa
    2 4.7 3.2 1.3 0.2 setosa
    3 4.6 3.1 1.5 0.2 setosa
    4 5.0 3.6 1.4 0.2 setosa
    map( ) function is used to match the values and replace them in the new series automatically created.
    iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
    To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.
    pd.get_dummies(iris.Species,prefix = "Species")
    pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1]  #1 is not included
    species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]
    With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.
    iris = pd.concat([iris,species_dummies],axis = 1)
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
    0 5.1 3.5 1.4 0.2 setosa
    1 4.9 3.0 1.4 0.2 setosa
    2 4.7 3.2 1.3 0.2 setosa
    3 4.6 3.1 1.5 0.2 setosa
    4 5.0 3.6 1.4 0.2 setosa

    Species_setosa Species_versicolor Species_virginica
    0 1 0 0
    1 1 0 0
    2 1 0 0
    3 1 0 0
    4 1 0 0
    It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True
    pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

     To create a dataframe of all the ranks we use rank( )
    Ranking by a specific variable
    Suppose we want to rank the Sepal.Length for different species in ascending order:
    iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
    iris.head( )

    iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)

    Calculating the Cumulative sum
    Using cumsum( ) function we can obtain the cumulative sum
    iris['cum_sum'] = iris["Sepal.Length"].cumsum()
    Cumulative sum by a variable
    To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )
    iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()

    Calculating the percentiles.
    Various quantiles can be obtained by using quantile( )

    if else in Python
    We create a new dataframe of students' name and their respective zodiac signs.
    students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                             'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
    def name(row):
    if row["Names"] in ["John","Henry"]:
    return "yes"
    return "no"

    students['flag'] = students.apply(name, axis=1)
    Functions in python are defined using the block keyword def , followed with the function's name as the block's name. apply( ) function applies function along rows or columns of dataframe.

    Note :If using simple 'if else' we need to take care of the indentation . Python does not involve curly braces for the loops and if else.

          Names Zodiac Signs flag
    0 John Aquarius yes
    1 Mary Libra no
    2 Henry Gemini yes
    3 Augustus Pisces no
    4 Kenny Virgo no

    Alternatively, By importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.
    import numpy as np
    students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')

    Multiple Conditions : If Else-if Else
    def mname(row):
    if row["Names"] == "John" and row["Zodiac Signs"] == "Aquarius" :
    return "yellow"
    elif row["Names"] == "Mary" and row["Zodiac Signs"] == "Libra" :
    return "blue"
    elif row["Zodiac Signs"] == "Pisces" :
    return "blue"
    return "black"

    students['color'] = students.apply(mname, axis=1)

    We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False
    conditions = [
        (students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
        (students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
        (students['Zodiac Signs'] == 'Pisces')]
    choices = ['yellow', 'blue', 'purple']
    students['color'] = np.select(conditions, choices, default='black')
          Names Zodiac Signs flag   color
    0 John Aquarius yes yellow
    1 Mary Libra no blue
    2 Henry Gemini yes black
    3 Augustus Pisces no purple
    4 Kenny Virgo no black

    Select numeric or categorical columns only
    To include numeric columns we use select_dtypes( ) 
    data1 = iris.select_dtypes(include=[np.number])
     _get_numeric_data also provides utility to select the numeric columns only.
    data3 = iris._get_numeric_data()
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
    0 5.1 3.5 1.4 0.2 5.1 5.1
    1 4.9 3.0 1.4 0.2 10.0 10.0
    2 4.7 3.2 1.3 0.2 14.7 14.7
    For selecting categorical variables
    data4 = iris.select_dtypes(include = ['object'])
    0 setosa
    1 setosa

    We create 2 dataframes containing the details of the students:
    students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                             'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
    students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                              'Marks' : [50,81,98,25,35]})
     using pd.concat( ) function we can join the 2 dataframes:
    data = pd.concat([students,students2])  #by default axis = 0
       Marks     Names Zodiac Signs
    0 NaN John Aquarius
    1 NaN Mary Libra
    2 NaN Henry Gemini
    3 NaN Augustus Pisces
    4 NaN Kenny Virgo
    0 50.0 John NaN
    1 81.0 Mary NaN
    2 98.0 Henry NaN
    3 25.0 Augustus NaN
    4 35.0 Kenny NaN
    By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1
    data = pd.concat([students,students2],axis = 1)
          Names Zodiac Signs  Marks     Names
    0 John Aquarius 50 John
    1 Mary Libra 81 Mary
    2 Henry Gemini 98 Henry
    3 Augustus Pisces 25 Augustus
    4 Kenny Virgo 35 Kenny
    Using append function we can join the dataframes row-wise
    students.append(students2)  #for rows
    Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise
    classes = {'x': students, 'y': students2}
     result = pd.concat(classes)
         Marks     Names Zodiac Signs
    x 0 NaN John Aquarius
    1 NaN Mary Libra
    2 NaN Henry Gemini
    3 NaN Augustus Pisces
    4 NaN Kenny Virgo
    y 0 50.0 John NaN
    1 81.0 Mary NaN
    2 98.0 Henry NaN
    3 25.0 Augustus NaN
    4 35.0 Kenny NaN

    Merging or joining on the basis of common variable.
    We take 2 dataframes with different number of observations:
    students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
                             'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
    students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                              'Marks' : [50,81,98,25,35]})
    Using pd.merge we can join the two dataframes. on = 'Names' denotes the common variable on the basis of which the dataframes are to be combined is 'Names'
    result = pd.merge(students, students2, on='Names')  #it only takes intersections
       Names Zodiac Signs  Marks
    0 John Aquarius 50
    1 Mary Libra 81
    2 Henry Gemini 98
     By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"
     result = pd.merge(students, students2, on='Names',how = "outer")  #it only takes unions
          Names Zodiac Signs  Marks
    0 John Aquarius 50.0
    1 Mary Libra 81.0
    2 Henry Gemini 98.0
    3 Maria Capricorn NaN
    4 Augustus NaN 25.0
    5 Kenny NaN 35.0
    To take only intersections and all the values in left df set how = 'left'
    result = pd.merge(students, students2, on='Names',how = "left")
       Names Zodiac Signs  Marks
    0 John Aquarius 50.0
    1 Mary Libra 81.0
    2 Henry Gemini 98.0
    3 Maria Capricorn NaN
    Similarly how = 'right' takes only intersections and all the values in right df.
    result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
          Names Zodiac Signs  Marks      _merge
    0 John Aquarius 50 both
    1 Mary Libra 81 both
    2 Henry Gemini 98 both
    3 Augustus NaN 25 right_only
    4 Kenny NaN 35 right_only
    indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.

    Visualize activations functions using keras

    In keras, we can visualize activation functions' geometric properties using backend functions over layers of a model.

    We all know the exact function of popular activation functions such as 'sigmoid', 'tanh', 'relu', etc, and we can feed data to these functions to directly obtain their output. But how to do that via keras without explicitly specifying their functional forms?

    This can be done following the four steps below:

    1. define a simple MLP model with a one dimension input data, a one neuron dense network as the hidden layer, and the output layer will have a 'linear' activation function for one neuron.
    2. Extract layers' output of the model (fitted or not) via iterating through model.layers
    3. Using backend function K.function() to obtain calculated output for a given input data
    4. Feed desired data to the above functions to obtain the output from appropriate activation function.

    The code below is a demo:

    from keras.layers import Dense, Activation
    from keras.models import Sequential
    import keras.backend as K
    import numpy as np
    import matplotlib.pyplot as plt

    # 以下设置显示中文文方法根据 http://blog.csdn.net/rumswell/article/details/6544377
    plt.rcParams['font.sans-serif'] = ['SimHei'] #指定默认字体
    plt.rcParams['axes.unicode_minus'] = False #解决图像中中文符号显示为方块的问题

    def NNmodel(activationFunc='linear'):
    if (activationFunc=='softplus') | (activationFunc=='sigmoid'):
    elif activationFunc=='hard_sigmoid':
    model = Sequential()
    model.add(Dense(1, input_shape=(1,), activation=activationFunc,

    model.add(Dense(1, activation='linear', name='Output'))
    model.compile(loss='mse', optimizer='sgd')
    return model

    def VisualActivation(activationFunc='relu', plot=True):
    x = (np.arange(100)-50)/10
    y = np.log(x+x.max()+1)

    model = NNmodel(activationFunc = activationFunc)

    inX = model.input
    outputs = [layer.output for layer in model.layers if layer.name=='Hidden']
    functions = [K.function([inX], [out]) for out in outputs]

    layer_outs = [func([x.reshape(-1, 1)]) for func in functions]
    activationLayer = layer_outs[0][0]

    activationDf = pd.DataFrame(activationLayer)
    result=pd.concat([pd.DataFrame(x), activationDf], axis=1)
    result.columns=['X', 'Activated']
    result.set_index('X', inplace=True)
    if plot:

    return result

    # Now we can visualize them (assuming default settings) :
    actFuncs = ['linear', 'softmax', 'sigmoid', 'tanh', 'softsign', 'hard_sigmoid', 'softplus', 'selu', 'elu']

    from keras.layers import LeakyReLU
    figure = plt.figure()
    for i, f in enumerate(actFuncs):
    # 依次画图
    figure.add_subplot(3, 3, i+1)
    out=VisualActivation(activationFunc=f, plot=False)
    plt.plot(out.index, out.Activated)

    This figure is the output from above code. As we can see, the geometric property of each activation function is well captured.

    Back to Top