Using Python to work with SAS Viya and CAS

One of the big benefits of the SAS Viya platform is how approachable it is for programmers of other languages. You don't have to learn SAS in order to become productive quickly. We've seen a lot of interest from people who code in Python, maybe because that language has become known for its application in machine learning. SAS has a new product called SAS Visual Data Mining and Machine Learning. And these days, you can't offer such a product without also offering something special to those Python enthusiasts.

Introducing Python SWAT

And so, SAS has published the Python SWAT project (where "SWAT" stands for the SAS scripting wapper for analytical transfer. The project is a Python code library that SAS released using an open source model. That means that you can download it for free, make changes locally, and even contribute those changes back to the community (as some developers have already done!). You'll find it at github.com/sassoftware/python-swat.

SAS developer Kevin Smith is the main contributor on Python SWAT, and he's a big fan of Python. He's also an expert in SAS and in many programming languages. If you're a SAS user, you probably run Kevin's code every day; he was an original developer on the SAS Output Delivery System (ODS). Now he's a member of the cloud analytics team in SAS R&D. (He's also the author of more than a few conference papers and SAS books.)

Kevin enjoys the dynamic, fluid style that a scripting language like Python affords - versus the more formal "code-compile-build-execute" model of a compiled language. Watch this video (about 14 minutes) in which Kevin talks about what he likes in Python, and shows off how Python SWAT can drive SAS' machine learning capabilities.

New -- but familiar -- syntax for Python coders

The analytics engine behind the SAS Viya platform is called CAS, or SAS Cloud Analytic Services. You'll want to learn that term, because "CAS" is used throughout the SAS documentation and APIs. And while CAS might be new to you, the Python approach to CAS should feel very familiar for users of Python libraries, especially users of pandas, the Python Data Analysis Library.

CAS and SAS' Python SWAT extends these concepts to provide intuitive, high-performance analytics from SAS Viya in your favorite Python environment, whether that's a Jupyter notebook or a simple console. Watch the video to see Kevin's demo and discussion about how to get started. You'll learn:

  • How to connect your Python session to the CAS server
  • How to upload data from your client to the CAS server
  • How SWAT extends the concept of the DataFrame API in pandas to leverage CAS capabilities
  • How to coax CAS to provide descriptive statistics about your data, and then go beyond what's built into the traditional DataFrame methods.

Learn more about SAS Viya and Python

There are plenty of helpful resources to help you learn about using Python with SAS Viya:

And finally, what if you don't have SAS Viya yet, but you're interested in using Python with SAS 9.4? Check out the SASPy project, which allows you to access your traditional SAS features from a Jupyter notebook or Python console. It's another popular open source project from SAS R&D.

The post Using Python to work with SAS Viya and CAS appeared first on The SAS Dummy.


Python Data Structures

This post explains the data structures used in Python. It is essential to understand the data structures in a programming language. In python, there are many data structures available. They are as follows :
  1. strings
  2. lists
  3. tuples
  4. dictionaries
  5. sets

Python Data Structures

1. Strings

Python String is a sequence of characters.

How to create a string in Python

You can create Python string using a single or double quote.
mystring = "Hello Python3.6"
Hello Python3.6

Can I use multiple single or double quotes to define string?

Answer is Yes. See examples below -

Multiple Single Quotes
mystring = '''Hello Python3.6'''
Hello Python3.6
Multiple Double Quotes
mystring = """Hello Python3.6"""
Hello Python3.6

How to include quotes within a string?
mystring = r'Hello"Python"'

How to extract Nth letter or word?

You can use the syntax below to get first letter.
mystring = 'Hi How are you?'
mystring[0] refers to first letter as indexing in python starts from 0. Similarly, mystring[1] refers to second letter.

To pull last letter, you can use -1 as index.

To get first word
mystring.split(' ')[0]
Output : Hi

How it works -

1. mystring.split(' ') tells Python to use space as a delimiter.

Output : ['Hi', 'How', 'are', 'you?']

2. mystring.split(' ')[0] tells Python to pick first word of a string.

2. List

Unlike String, List can contain different types of objects such as integer, float, string etc.
  1. x = [142, 124, 234, 345, 465]
  2. y = [‘A’, ‘C’, ‘E’, ‘M’]
  3. z = [‘AA’, 44, 5.1, ‘KK’]

Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
k = [124, 225, 305, 246, 259]



Explanation :
k[0] picks first element from list. Negative sign tells Python to search list item from right to left. k[-1] selects the last element from list.

To select multiple elements from a list, you can use the following method :
k[:3] returns [124, 225, 305]

Add 5 to each element of a list

In the program below, len() function is used to count the number of elements in a list. In this case, it returns 5. With the help of range() function, range(5) returns 0,1,2,3,4.
x = [1, 2, 3, 4, 5]
for i in range(len(x)):
    x[i] = x[i] + 5
[6, 7, 8, 9, 10]

It can also be written like this -
for i in range(len(x)):
   x[i] += 5

Combine / Join two lists

The '+' operator is concatenating two lists.
X = [1, 2, 3]
Y = [4, 5, 6]
Z = X + Y
[1, 2, 3, 4, 5, 6]

Sum of values of two list
X = [1, 2, 3]
Y = [4, 5, 6]
import numpy as np
Z = np.add(X, Y)
[5 7 9]
Similarly, you can use np.multiply(X, Y) to multiply values of two list.

Repeat List N times

The '*' operator is repeating list N times.
X = [1, 2, 3]
Z = X * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]

Note : The above two methods also work for string list.

Modify / Replace a list item

Suppose you need to replace third value to a different value.
X = [1, 2, 3]
[1, 2, 5]

Add / Remove a list item

We can add a list item by using append method.
X = ['AA', 'BB', 'CC']
Result : ['AA', 'BB', 'CC', 'DD']

Similarly, we can remove a list item by using remove method.
X = ['AA', 'BB', 'CC']
Result : ['AA', 'CC']

Sort list
k = [124, 225, 305, 246, 259]
Output : [124, 225, 246, 259, 305]

3. Tuple

Like list, tuple can also contain mixed data. But tuple cannot be changed or altered once created whereas list can be modified. Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]

mytuple = (123,223,323)
City = ('Delhi','Mumbai','Bangalore')
Perform for loop on Tuple
for i in City:

Tuple cannot be altered

Run the following command and check error
X = (1, 2, 3)
TypeError: 'tuple' object does not support item assignment

4. Dictionary

It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings, numbers, or tuples.

Create a dictionary

It is defined in curly braces {}. Each key is followed by a colon (:) and then values.
teams = {'Dave' : 'team A',
         'Tim' : 'team B',
         'Babita' : 'team C',
         'Sam' : 'team B',
         'Ravi' : 'team C'

Find Values
Output : 'team B'

Delete an item
del teams['Ravi']

Add an item
teams['Deep'] = 'team B'
Output :
{'Babita': 'team C',
 'Dave': 'team A',
 'Deep': 'team B',
 'Sam': 'team B',
 'Tim': 'team B'}

5. Sets

Sets are unordered collections of simple objects.
X = set(['A', 'B', 'C'])

Q. Does 'A' exist in set X?
'A' in X
Result : True

Q. Does 'D' exist in set X?
'D' in X
Result : False

Q. How to add 'D' in set X?
Q. How to remove 'C' from set X?
Q. How to create a copy of set X?
Y = X.copy()
Q. Which items are common in both sets X and Y?
Y & X

Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.


Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.


Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.


Python Tutorial for Beginners: Learn in 3 Days

This tutorial helps you to get started with Python. It's a step by step practical guide to learn Python by examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.
Learn Python : Tutorial for Beginners

Table of Contents
  1. Getting Started with Python
    • Python 2.7 vs. 3.6
    • Python for Data Science
    • How to install Python?
    • Spyder Shortcut keys
    • Basic programs in Python
    • Comparison, Logical and Assignment Operators
  2. Data Structures and Conditional Statements
    • Python Data Structures
    • Python Conditional Statements
  3. Python Libraries
    • List of popular packages (comparison with R)
    • Popular python commands
    • How to import a package
  4. Data Manipulation using Pandas
    • Pandas Data Structures - Series and DataFrame
    • Important Pandas Functions (vs. R functions)
    • Examples - Data analysis with Pandas
  5. Data Science with Python
    • Logistic Regression
    • Decision Tree
    • Random Forest
    • Grid Search - Hyper Parameter Tuning
    • Cross Validation
    • Preprocessing Steps

Python 2.7 vs 3.6

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

What's new in Python 3.6

It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

Key Takeaway
You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

Python for Data Science

Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

There are several reasons to learn Python. Some of them are as follows -
  1. Python runs well in automating various steps of a predictive model. 
  2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence. 
  3. Python wins over R when it comes to deploying machine learning models in production.
  4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
  5. Python has a great online community support.
Do you know these sites are developed in Python?
  1. YouTube
  2. Instagram
  3. Reddit
  4. Dropbox
  5. Disqus

How to Install Python

There are two ways to download and install Python
  1. Download Anaconda. It comes with Python software along with preinstalled popular libraries.
  2. Download Python from its official website. You have to manually install libraries.

Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python

Coding Environments

Anaconda comes with two popular IDE :
  1. Jupyter (Ipython) Notebook
  2. Spyder
Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!
Spyder - Python Coding Environment
Jupyter (Ipython) Notebook

Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

Spyder Shortcut Keys

The following is a list of some useful spyder shortcut keys which makes you more productive.
  1. Press F5 to run the entire script
  2. Press F9 to run selection or line 
  3. Press Ctrl + 1 to comment / uncomment
  4. Go to front of function and then press Ctrl + I to see documentation of the function
  5. Run %reset -f to clean workspace
  6. Ctrl + Left click on object to see source code 
  7. Ctrl+Enter executes the current cell.
  8. Shift+Enter executes the current cell and advances the cursor to the next cell

List of arithmetic operators with examples

Arithmetic Operators Operation Example
+ Addition 10 + 2 = 12
Subtraction 10 – 2 = 8
* Multiplication 10 * 2 = 20
/ Division 10 / 2 = 5.0
% Modulus (Remainder) 10 % 3 = 1
** Power 10 ** 2 = 100
// Floor 17 // 3 = 5
(x + (d-1)) // d Ceiling (17 +(3-1)) // 3 = 6

Basic Programs

Example 1
x = 10
y = 3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)
Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1

Example 2
x = 100
x > 80 and x <=95
x > 35 or x < 60
x > 80 and x <=95
Out[45]: False
x > 35 or x < 60
Out[46]: True
Comparison & Logical Operators Description Example
> Greater than 5 > 3 returns True
< Less than 5 < 3 returns False
>= Greater than or equal to 5 >= 3 returns True
<= Less than or equal to 5 <= 3 return False
== Equal to 5 == 3 returns False
!= Not equal to 5 != 3 returns True
and Check both the conditions x > 18 and x <=35
or If atleast one condition hold True x > 35 or x < 60
not Opposite of Condition not(x>7)

Assignment Operators

It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.
x = 100
y = 10
x += y
In this case, x+=y implies x=x+y which is x = 100 + 10.
Similarly, you can use x-=y, x*=y and x /=y

Python Data Structure

In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

1. List

It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.
  1. x = [1, 2, 3, 4, 5]
  2. y = [‘A’, ‘O’, ‘G’, ‘M’]
  3. z = [‘A’, 4, 5.1, ‘M’]
Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
x = [1, 2, 3, 4, 5]
Out[68]: 1

Out[69]: 2

Out[70]: 5

Out[71]: 5

Out[72]: 4

x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

You can select multiple elements from a list using the following method
x[:3] returns [1, 2, 3]

2. Tuple

A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -
  1. A tuple cannot be changed once created whereas list can be modified.
  2. A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]
K = (1,2,3)
City = ('Delhi','Mumbai','Bangalore')
Perform for loop on Tuple
for i in City:

Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

Rules to define a function
  1. Function starts with def keyword followed by function name and ( )
  2. Function body starts with a colon (:) and is indented
  3. The keyword return ends a function  and give value of previous expression.
def sum_fun(a, b):
    result = a + b
    return result 
z = sum_fun(10, 15)
Result : z = 25

Suppose you want python to assume 0 as default value if no value is specified for parameter b.
def sum_fun(a, b=0):
    result = a + b
    return result
z = sum_fun(10)
In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

Conditional Statements (if else)

Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

Note : The if and else statements ends with a colon :

k = 27
if k%5 == 0:
  print('Multiple of 5')
  print('Not a Multiple of 5')
Result : Not a Multiple of 5

Popular python packages for Data Analysis & Visualization

Some of the leading packages in Python along with equivalent libraries in R are as follows-
  1. pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
  2. NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R.
  3. Scipy.  For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
  4. Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
  5. Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
  6. Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
  7. pandasql.  It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.
Maximum of the above packages are already preinstalled in Spyder.
    Comparison of Python and R Packages by Data Mining Task

    Task Python Package R Package
    IDE Rodeo / Spyder Rstudio
    Data Manipulation pandas dplyr and reshape2
    Machine Learning Scikit-learn glm, knn, randomForest, rpart, e1071
    Data Visualization ggplot + seaborn + bokeh ggplot2
    Character Functions Built-In Functions stringr
    Reproducibility Jupyter Knitr
    SQL Queries pandasql sqldf
    Working with Dates datetime lubridate
    Web Scraping beautifulsoup rvest

    Popular Python Commands

    The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

    Install Package
    !pip install pandas

    Uninstall Package
    !pip uninstall pandas

    Show Information about Installed Package
    !pip show pandas

    List of Installed Packages
    !pip list

    Upgrade a package
    !pip install --upgrade pandas

      How to import a package

      There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

      1. import pandas as pd
      It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

      2. import pandas
      It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

      3. from pandas import *
      It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

      Pandas Data Structures : Series and DataFrame

      In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -
      1. Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.
      In the example below, we are generating 5 random values.
      import pandas as pd
      s1 = pd.Series(np.random.randn(5))
      0   -2.412015
      1 -0.451752
      2 1.174207
      3 0.766348
      4 -0.361815
      dtype: float64

      Extract first and second value

      You can get a particular element of a series using index value. See the examples below -

      0   -2.412015
      1 -0.451752
      2 1.174207

      2. DataFrame

      It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

      Comparison of Data Type in Python and Pandas

      The following table shows how Python and pandas package stores data.

      Data Type Pandas Standard Python
      For character variable object string
      For categorical variable category -
      For Numeric variable without decimals int64 int
      Numeric characters with decimals float64 float
      For date time variables datetime64 -

      Important Pandas Functions

      The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorise pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

      Functions R Python (pandas package)
      Installing a package install.packages('name') !pip install name
      Loading a package library(name) import name as other_name
      Checking working directory getwd() import os
      Setting working directory setwd() os.chdir()
      List files in a directory dir() os.listdir()
      Remove an object rm('name') del object
      Select Variables select(df, x1, x2) df[['x1', 'x2']]
      Drop Variables select(df, -(x1:x2)) df.drop(['x1', 'x2'], axis = 1)
      Filter Data filter(df, x1 >= 100) df.query('x1 >= 100')
      Structure of a DataFrame str(df) df.info()
      Summarize dataframe summary(df) df.describe()
      Get row names of dataframe "df" rownames(df) df.index
      Get column names colnames(df) df.columns
      View Top N rows head(df,N) df.head(N)
      View Bottom N rows tail(df,N) df.tail(N)
      Get dimension of data frame dim(df) df.shape
      Get number of rows nrow(df) df.shape[0]
      Get number of columns ncol(df) df.shape[1]
      Length of data frame length(df) len(df)
      Get random 3 rows from dataframe sample_n(df, 3) df.sample(n=3)
      Get random 10% rows sample_frac(df, 0.1) df.sample(frac=0.1)
      Check Missing Values is.na(df$x) pd.isnull(df.x)
      Sorting arrange(df, x1, x2) df.sort_values(['x1', 'x2'])
      Rename Variables rename(df, newvar = x1) df.rename(columns={'x1': 'newvar'})

      Data Manipulation with pandas - Examples

      1. Import Required Packages

      You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.
      import numpy as np
      import pandas as pd

      2. Build DataFrame

      We can build dataframe using DataFrame() function of pandas package.
      mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
              'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
              'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
      df = pd.DataFrame(mydata)
       In this dataframe, we have three variables - productcode, sales, cost.
      Sample DataFrame

      To import data from CSV file

      You can use read_csv() function from pandas package to get data into python from CSV file.
      mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
      Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

      Detailed Tutorial : Import Data in Python

      3. To see number of rows and columns

      You can run the command below to find out number of rows and columns.
       Result : (6, 3). It means 6 rows and 3 columns.

      4. To view first 3 rows

      The df.head(N) function can be used to check out first some N rows.
           cost productcode   sales
      0 1020.0 AA 1010.0
      1 1625.2 AA 1025.2
      2 1204.0 AA 1404.2

      5. Select or Drop Variables

      To keep a single variable, you can write in any of the following three methods -
      df.loc[: , "productcode"]
      To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.
      df.iloc[: , 1]
      We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.
      df[["productcode", "cost"]]
      df.loc[ : , ["productcode", "cost"]]

      Drop Variable

      We can remove variables by using df.drop() function. See the example below -
      df2 = df.drop(['sales'], axis = 1)

      6. To summarize data frame

      To summarize or explore data, you can submit the command below.
                    cost       sales
      count 6.000000 6.00000
      mean 1166.150000 1242.65000
      std 237.926793 230.46669
      min 1003.700000 1010.00000
      25% 1020.000000 1058.90000
      50% 1072.000000 1205.85000
      75% 1184.000000 1366.07500
      max 1625.200000 1604.80000

      To summarise all the character variables, you can use the following script.
      Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

      To select only a particular variable, you can write the following code -
      count      6
      unique 2
      top BB
      freq 3
      Name: productcode, dtype: object

      7. To calculate summary statistics

      We can manually find out summary statistics such as count, mean, median by using commands below

      8. Filter Data

      Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.
      df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]
      It can also be written like :
      df1 = df.query('(productcode == "AA") & (sales >= 1250)')
      In the second query, we do not need to specify DataFrame along with variable name.

      9. Sort Data

      In the code below, we are arrange data in ascending order by sales.

      10.  Group By : Summary by Grouping Variable

      Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.
                          cost        sales
      AA 1283.066667 1146.466667
      BB 1049.233333 1338.833333
      Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.

      11. Define Categorical Variable

      Let's create a classification variable - id which contains only 3 unique values - 1/2/3.
      df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})
      Let's define as a categorical variable.
      We can use astype() function to make id as a categorical variable.
      df0.id = df0["id"].astype('category')
      Summarize this classification variable to check descriptive statistics.
      count 7
      unique 3
      top 2
      freq 3

      Frequency Distribution

      You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.
      BB    3
      AA 3

      12. Generate Histogram

      Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.

      13. BoxPlot

      Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.

      Data Science using Python - Examples

      In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

      1. Install the required libraries

      Import the following libraries before reading or exploring data
      #Import required libraries
      import pandas as pd
      import statsmodels.api as sm
      import numpy as np

      2. Download and import data into Python

      With the use of python library, we can easily get data from web into python.
      # Read data from web
      df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
      Variables Type Description
      gre Continuous Graduate Record Exam score
      gpa Continuous Grade Point Average
      rank Categorical Prestige of the undergraduate institution
      admit Binary Admission in graduate school

      The binary variable admit is a target variable.

      3. Explore Data

      Let's explore data. We'll answer the following questions -
      1. How many rows and columns in the data file?
      2. What are the distribution of variables?
      3. Check if any outlier(s)
      4. If outlier(s), treat them
      5. Check if any missing value(s)
      6. Impute Missing values (if any)
      # See no. of rows and columns
      Result : 400 rows and 4 columns

      In the code below, we rename the variable rank to 'position' as rank is already a function in python.
      # rename rank column
      df = df.rename(columns={'rank': 'position'}) 
      Summarize and plot all the columns.
      # Summarize
      # plot all of the columns
      Categorical variable Analysis

      It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.
      # Summarize
      1     61
      4 67
      3 121
      2 151

      Generating Crosstab 

      By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.
      pd.crosstab(df['admit'], df['position'])
      position   1   2   3   4
      0 28 97 93 55
      1 33 54 28 12

      Number of Missing Values

      We can write a simple loop to figure out the number of blank values in all variables in a dataset.
      for i in list(df.columns) :
          k = sum(pd.isnull(df[i]))
          print(i, k)
      In this case, there are no missing values in the dataset.

      4. Logistic Regression Model

      Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

      In python, we can write R-style model formula y ~ x1 + x2 + x3 using  patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.
      #Reference Category
      from patsy import dmatrices, Treatment
      y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')
      It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -
      P  P_1 P_2 P_3
      3 0 0 1
      3 0 0 1
      1 1 0 0
      4 0 0 0
      4 0 0 0
      2 0 1 0

      Split Data into two parts

      80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.
      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

      Build Logistic Regression Model

      By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.
      #Fit Logit model
      logit = sm.Logit(y_train, X_train)
      result = logit.fit()

      #Summary of Logistic regression model
                                Logit Regression Results                           
      Dep. Variable: admit No. Observations: 320
      Model: Logit Df Residuals: 315
      Method: MLE Df Model: 4
      Date: Sat, 20 May 2017 Pseudo R-squ.: 0.03399
      Time: 19:57:24 Log-Likelihood: -193.49
      converged: True LL-Null: -200.30
      LLR p-value: 0.008627
      coef std err z P|z| [95.0% Conf. Int.]
      C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356
      C(position)[T.2] 0.6771 0.373 1.813 0.070 -0.055 1.409
      C(position)[T.3] 0.1071 0.410 0.261 0.794 -0.696 0.910
      gre 0.0005 0.001 0.442 0.659 -0.002 0.003
      gpa 0.4613 0.214 -2.152 0.031 -0.881 -0.041

      Confusion Matrix and Odd Ratio

      Odd ratio is exponential value of parameter estimates.
      #Confusion Matrix
      #Odd Ratio

      Prediction on Test Data
      In this step, we take estimates of logit model which was built on training data and then later apply it into test data.
      #prediction on test data
      y_pred = result.predict(X_test)

      Calculate Area under Curve (ROC)
      # AUC on test data
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
      auc(false_positive_rate, true_positive_rate)
      Result : AUC = 0.6763

      Calculate Accuracy Score
      accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

      Decision Tree Model

      Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

      #Drop Intercept from predictors for tree algorithms
      X_train = X_train.drop(['Intercept'], axis = 1)
      X_test = X_test.drop(['Intercept'], axis = 1)

      #Decision Tree
      from sklearn.tree import DecisionTreeClassifier
      model_tree = DecisionTreeClassifier(max_depth=7)

      #Fit the model:

      #Make predictions on test set
      predictions_tree = model_tree.predict_proba(X_test)

      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
      auc(false_positive_rate, true_positive_rate)
      Result : AUC = 0.664

      Important Note
      Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

      Random Forest Model

      Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

      #Random Forest
      from sklearn.ensemble import RandomForestClassifier
      model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

      #Fit the model:
      target = y_train['admit']

      #Make predictions on test set
      predictions_rf = model_rf.predict_proba(X_test)

      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
      auc(false_positive_rate, true_positive_rate)

      #Variable Importance
      importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

      Result : AUC = 0.6974

      Grid Search - Hyper Parameters Tuning

      The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

      from sklearn.model_selection import GridSearchCV
      rf = RandomForestClassifier()
      target = y_train['admit']

      param_grid = {
      'n_estimators': [100, 200, 300],
      'max_features': ['sqrt', 3, 4]

      CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')

      #Parameters with Scores

      #Best Parameters

      #Make predictions on test set
      predictions_rf = CV_rfc.predict_proba(X_test)

      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
      auc(false_positive_rate, true_positive_rate)

      Cross Validation
      # Cross Validation
      from sklearn.linear_model import LogisticRegression
      from sklearn.model_selection import cross_val_predict,cross_val_score
      target = y['admit']
      prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
      cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

      Data Mining : PreProcessing Steps

      1.  The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn,  there is already a function for this step.

      from sklearn.preprocessing import LabelEncoder
      def ConverttoNumeric(df):
      cols = list(df.select_dtypes(include=['category','object']))
      le = LabelEncoder()
      for i in cols:
      df[i] = le.fit_transform(df[i])
      print('Error in Variable :'+i)
      return df


      2. Create Dummy Variables

      Suppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.
      productcode_dummy = pd.get_dummies(df["productcode"])
      df2 = pd.concat([df, productcode_dummy], axis=1)

      The output looks like below -
         AA  BB
      0 1 0
      1 1 0
      2 1 0
      3 0 1
      4 0 1
      5 0 1

      Create k-1 Categories

      To avoid multi-collinearity, you can set one of the category as reference category and leave it while creating dummy variables. In the script below, we are leaving first category.
      productcode_dummy = pd.get_dummies(df["productcode"], prefix='pcode', drop_first=True)
      df2 = pd.concat([df, productcode_dummy], axis=1)

      3. Impute Missing Values

      Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.

      Fill missing values of a particular variable
      # fill missing values with 0
      df['var1'] = df['var1'].fillna(0)
      # fill missing values with mean
      df['var1'] = df['var1'].fillna(df['var1'].mean())

      Apply imputation to the whole dataset
      from sklearn.preprocessing import Imputer

      # Set an imputer object
      mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

      # Train the imputor
      mean_imputer = mean_imputer.fit(df)

      # Apply imputation
      df_new = mean_imputer.transform(df.values)

      4. Outlier Treatment

      There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -
      1. Cap extreme values at 95th / 99th percentile depending on distribution
      2. Apply log transformation of variables. See below the implementation of log transformation in Python.
      import numpy as np
      df['var1'] = np.log(df['var1'])

      5. Standardization

      In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).

      #load dataset
      dataset = load_boston()
      predictors = dataset.data
      target = dataset.target
      df = pd.DataFrame(predictors, columns = dataset.feature_names)

      #Apply Standardization
      from sklearn.preprocessing import StandardScaler
      k = StandardScaler()
      df2 = k.fit_transform(df)

      Next Steps

      Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.

      Dueling Data Science Surveys: KDnuggets & Rexer Go Live

      What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.

      For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll.  Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.

      Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use.  It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors.  You can take the Rexer Analytics survey here (use Access Code M7UY4).  Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.

      As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).




      Introducing SASPy: Use Python code to access SAS

      Thanks to a new open source project from SAS, Python coders can now bring the power of SAS into their Python scripts. The project is SASPy, and it's available on the SAS Software GitHub. It works with SAS 9.4 and higher, and requires Python 3.x.

      I spoke with Jared Dean about the SASPy project. Jared is a Principal Data Scientist at SAS and one of the lead developers on SASPy and a related project called Pipefitter. Here's a video of our conversation, which includes an interactive demo. Jared is obviously pretty excited about the whole thing.

      Use SAS like a Python coder

      SASPy brings a "Python-ic" sensibility to this approach for using SAS. That means that all of your access to SAS data and methods are surfaced using objects and syntax that are familiar to Python users. This includes the ability to exchange data via pandas, the ubiquitous Python data analysis framework. And even the native SAS objects are accessed in a very "pandas-like" way.

      import saspy
      import pandas as pd
      sas = saspy.SASsession(cfgname='winlocal')
      cars = sas.sasdata("CARS","SASHELP")

      The output is what you expect from pandas...but with statistics that SAS users are accustomed to. PROC MEANS anyone?

      In[3]: cars.describe()
             Variable Label    N  NMiss   Median          Mean        StdDev  
      0         MSRP     .   428      0  27635.0  32774.855140  19431.716674   
      1      Invoice     .   428      0  25294.5  30014.700935  17642.117750   
      2   EngineSize     .   428      0      3.0      3.196729      1.108595   
      3    Cylinders     .   426      2      6.0      5.807512      1.558443   
      4   Horsepower     .   428      0    210.0    215.885514     71.836032   
      5     MPG_City     .   428      0     19.0     20.060748      5.238218   
      6  MPG_Highway     .   428      0     26.0     26.843458      5.741201   
      7       Weight     .   428      0   3474.5   3577.953271    758.983215   
      8    Wheelbase     .   428      0    107.0    108.154206      8.311813   
      9       Length     .   428      0    187.0    186.362150     14.357991   
             Min       P25      P50      P75       Max  
      0  10280.0  20329.50  27635.0  39215.0  192465.0  
      1   9875.0  18851.00  25294.5  35732.5  173560.0  
      2      1.3      2.35      3.0      3.9       8.3  
      3      3.0      4.00      6.0      6.0      12.0  
      4     73.0    165.00    210.0    255.0     500.0  
      5     10.0     17.00     19.0     21.5      60.0  
      6     12.0     24.00     26.0     29.0      66.0  
      7   1850.0   3103.00   3474.5   3978.5    7190.0  
      8     89.0    103.00    107.0    112.0     144.0  
      9    143.0    178.00    187.0    194.0     238.0  

      SASPy also provides high-level Python objects for the most popular and powerful SAS procedures. These are organized by SAS product, such as SAS/STAT, SAS/ETS and so on. To explore, issue a dir() command on your SAS session object. In this example, I've created a sasstat object and I used dot<TAB> to list the available SAS analyses:

      SAS/STAT object in SASPy

      The SAS Pipefitter project extends the SASPy project by providing access to advanced analytics and machine learning algorithms. In our video interview, Jared presents a cool example of a decision tree applied to the passenger survival factors on the Titanic. It's powered by PROC HPSPLIT behind the scenes, but Python users don't need to know all of that "inside baseball."

      Installing SASPy and getting started

      Like most things Python, installing the SASPy package is simple. You can use the pip installation manager to fetch the latest version:

      pip install saspy

      However, since you need to connect to a SAS session to get to the SAS goodness, you will need some additional files to broker that connection. Most notably, you need a few Java jar files that SAS provides. You can find these in the SAS Deployment Manager folder for your SAS installation:

      The jar files are compatible between Windows and Unix, so if you find them in a Unix SAS install you can still copy them to your Python Windows client. You'll need to modify the sascgf.py file (installed with the SASPy package) to point to where you've stashed these. If using local SAS on Windows, you also need to make sure that the sspiauth.dll is in your Windows system PATH. The easiest method to add SASHOMESASFoundation9.4coresasexe to your system PATH variable.

      All of this is documented in the "Installation and Configuration" section of the project documentation. The connectivity options support an impressively diverse set of SAS configs: Windows, Unix, SAS Grid Computing, and even SAS on the mainframe!

      Download, comment, contribute

      SASPy is an open source project, and all of the Python code is available for your inspection and improvement. The developers at SAS welcome you to give it a try and enter issues when you see something that needs to be improved. And if you're a hotshot Python coder, feel free to fork the project and issue a pull request with your suggested changes!

      The post Introducing SASPy: Use Python code to access SAS appeared first on The SAS Dummy.


      Importing Data into Python

      This tutorial explains various methods to read data into Python. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Rdata (R) etc.
      Import Data into Python
      While importing external files, we need to check the following points -
      1. Check whether header row exists or not
      2. Treatment of special values as missing values
      3. Consistent data type in a variable (column)
      4. Date Type variable in consistent date format.
      5. No truncation of rows while reading external data

      Install and Load pandas Package

      pandas is a powerful data analysis package. It makes data exploration and manipulation easy. It has several functions to read data from various sources.

      If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -
      import pandas as pd
      If pandas package is not installed, you can install it by running the following code in Ipython Console. If you are using Spyder, you can submit the following code in Ipython console within Spyder.
      !pip install pandas
      If you are using Anaconda, you can try the following line of code to install pandas -
      !conda install pandas
      1. Import CSV files

      It is important to note that a single backslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like below
      import pandas as pd
      mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
      If no header (title) in raw data file
      mydata1  = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)
      You need to include header = None option to tell Python there is no column name (header) in data.

      Add Column Names

      We can include column names by using names= option.
      mydata2  = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])
      The variable names can also be added separately by using the following command.
      mydata1.columns = ['ID', 'first_name', 'salary']

      2. Import File from URL

      You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files stored in URL).
      mydata  = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

      3. Read Text File 

      We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.
      mydata = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
      mydata  = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")

      4. Read Excel File

      The read_excel() function can be used to import excel data into Python.
      mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)
      If you do not specify name of sheet in sheetname= option, it would take by default first sheet.

      5. Read delimited file

      Suppose you need to import a file that is separated with white spaces.
      mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)
      To include variable names, use the names= option like below -
      mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])
      6. Read SAS File

      We can import SAS data file by using read_sas() function.
      mydata4 = pd.read_sas('cars.sas7bdat')
      7. Read SQL Table

      We can extract table from SQL database (Teradata / SQL Server). See the program below -
      import sqlite3
      from pandas.io import sql
      conn = sqlite3.connect('C:/Users/Deepanshu/Downloads/flight.db')
      query = "SELECT * FROM flight;"
      results = pd.read_sql(query, con=conn)
      print results.head()

      8. Read sample of rows and columns

      By specifying nrows= and usecols=, you can fetch specified number of rows and columns.
      mydata7  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))
      nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

      9. Skip rows while importing

      Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)
      mydata8  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
      10. Specify values as missing values

      By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.
      mydata9  = pd.read_csv("workingfile.csv", na_values=['.'])

      The Monty Hall Paradox - SAS vs. Python

      Recently, one of sons came to me and asked about something called “The Monty Hall Paradox.” They had discussed it in school and he was having a hard time understanding it (as you often do with paradoxes).

      For those of you who may not be familiar with the Monty Hall Paradox, it is named for the host of a popular TV game show called “Let’s Make a Deal.” On the show, a contestant would be selected and shown a valuable prize.  Monty Hall would then explain that the prize is located just behind one of three doors and asked the contestant to pick a door.  Once a door was selected, Monty would then tease the contestant with cash to get him/her to either abandon the game or switch to another door.  Invariably, the contestant would stand firm and then Monty would proceed to show the contestant what was behind one of the other doors.  Of course, it wouldn’t be any fun if the prize was behind the revealed door, so after showing the contestant an empty door Monty would then ply them with even more cash, in the hopes that they would abandon the game or switch to the remaining door.

      Almost without fail, the contestant would stand firm in their belief that their chosen door was the winner and would not switch to the other door.

      So where’s the paradox?

      When left with two doors, most people assume that they've got a 50/50 chance at winning. However, the truth is that the contestant will double his/her chance of winning by switching to the other door.

      After explaining this to my son, it occurred to me that this would be an excellent exercise for coding in Python and in SAS to see how the two languages compared. Like many of you reading this blog, I’ve been programming in SAS for years so the struggle for me was coding this in Python.

      I kept it simple. I generated my data randomly and then applied simple logic to each row and compared the results.  The only difference between the two is in how the languages approach it.  Once we look at the two approaches then we can look at the answer.

      First, let's look at SAS:

      data choices (drop=max);
      do i = 1 to 10000;
      	prize = ceil(max*u);
      	choice = ceil(max*u2);

      I started by generating two random numbers for each row in my data. The first random number will be used to randomize the prize door and the second will be used to randomize the choice that the contestant makes. The result is a dataset with 10,000 rows each with columns ‘prize’ and ‘choice’ to represent the doors.  They will be random integers between 1 and 3.  Our next task will be to determine which door will be revealed and determine a winner.

      If our prize and choice are two different doors, then we must reveal the third door. If the prize and choice are the same, then we must choose a door to reveal. (Note: I realize that my logic in the reveal portion is somewhat flawed, but given that I am using an IF…ELSE IF and the fact that the choices are random and there isn’t any risk of introducing bias, this way of coding it was much simpler.)

      data results;
      set choices;
      by i;
      if prize in (1,2) and choice in (1,2) then reveal=3;
      else if prize in (1,3) and choice in (1,3) then reveal=2;
      else if prize in (2,3) and choice in (2,3) then reveal=1;

      Once we reveal a door, we must now give the contestant the option to switch. Switch means they always switch, neverswitch means they never switch.

      if reveal in (1,3) and choice in (1,3) then do;
              switch = 2; neverswitch = choice; 
      else if reveal in (2,3) and choice in (2,3) then do;
      	switch = 1; neverswitch = choice; 
      else if reveal in (1,2) and choice in (1,2) then do;
      	switch = 3; neverswitch = choice; 

      Now we create a column for the winner.  1=win 0=loss.

      	switchwin = (switch=prize);
      	neverswitchwin = (neverswitch=prize);

      Next, let’s start accumulating our results across all of our observations.  We’ll take a running tally of how many times a contestant who switches win as well as for the contestant who never switches.

      data cumstats;
      set results;
      format cumswitch cumnever comma8.;
      format pctswitch pctnever percent8.2;
      retain cumswitch cumnever;
      if _N_ = 1 then do;
      	cumswitch = 0; cumnever = 0;
      else do;
      cumswitch = cumswitch+switchwin;
      cumnever = cumnever+neverswitchwin;
      pctswitch = cumswitch/i;
      pctnever = cumnever/i;
      proc means data=results n mean std;
      var switchwin neverswitchwin;
      frame	;
      symbol1 interpol=splines;
      pattern1 value=ms;
      	minor=none ;
      	minor=none ;
      	minor=none ;
      title1 " Cumulative chances of winning on Let's Make a Deal ";
      proc gplot data=work.cumstats;
      	plot pctnever * i  /
      frame	vaxis=axis1
      plot2 pctswitch * i  = 2 /
       	legend=legend1 ;
      run; quit;


      The output of PROC MEANS shows that people who always switch (switchwin) have a win percentage of nearly 67%, while the people who never switch (neverswitchwin) have a win percentage of only 33%. The Area Plot proves the point graphically showing that the win percentage of switchers to be well above the non-switchers.

      Now let’s take a look at how I approached the problem in Python (keeping in mind that this language is new to me).

      Now, let’s look at Python:

      Copied from Jupyter Notebook

      import random
      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      from itertools import accumulate
      %matplotlib inline

      First let's create a blank dataframe with 10,000 rows and 10 columns, then fill in the blanks with zeros.

      rawdata = {'index': range(10000)}
      df = pd.DataFrame(rawdata,columns=['index','prize','choice','reveal','switch','neverswitch','switchwin','neverswitchwin','cumswitch','cumnvrswt'])
      df = df.fillna(0)

      Now let's populate our columns. The prize column represents the door that contains the new car! The choice column represents the door that the contestant chose. We will populate them both with a random number between 1 and 3.

      for row in df['index']:

      Now that Monty Hall has given the contestant their choice of door, he reveals the blank door that they did not choose.

      for i in range(len(df)):
          if (df['prize'][i] in (1,2) and df['choice'][i] in (1,2)):
          elif (df['prize'][i] in (1,3) and df['choice'][i] in (1,3)):
          elif (df['prize'][i] in (2,3) and df['choice'][i] in (2,3)):
      df['reveal']= reveal

      Here's the rub. The contestant has chosen a door, Monty has revealed a blank door, and now he's given the contestant the option to switch to the other door. Most of the time the contestant will not switch even though they should. To prove this, we create a column called 'switch' that reflects a contestant that ALWAYS switches their choice. And, a column called 'neverswitch' that represents the opposite.

      for i in range(len(df)):
          if (df['reveal'][i] in (1,3) and df['choice'][i] in (1,3)):
          elif (df['reveal'][i] in (1,2) and df['choice'][i] in (1,2)):
          elif (df['reveal'][i] in (2,3) and df['choice'][i] in (2,3)):
          neverswitch = choice

      Now let's create a flag for when the Always Switch contestant wins and a flag for when the Never Switch contestant wins.

      for i in range(len(df)):
          if (df['switch'][i]==df['prize'][i]):
          if (df['neverswitch'][i]==df['prize'][i]):

      Now we accumulate the total number of wins for each contestant.


      …and divide by the number of observations for a win percentage.

      for i in range(len(df)):

      Now we are ready to plot the results. Green represents the win percentage of Always Switch, blue represents the win percentage of Never Switch.

      fig, ax = plt.subplots(1, 1, figsize=(12, 9))
      ax.plot(x,y,lw=3, label='Always', color='green')
      ax.plot(x,y2,lw=3, label='Never',color='blue',alpha=0.5)
      ax.fill_between(x,y2,y, facecolor='green',alpha=0.6)
      ax.fill_between(x,0,y2, facecolor='blue',alpha=0.5)
      ax.set_ylabel("Win Pct",size=14)
      plt.title("Cumulative chances of winning on Let's Make a Deal", size=16)


      Why does it work?

      Most people think that because there are two doors left (the door you chose and the door Monty didn’t show you) that there is a fifty-fifty chance that you’ve got the prize.  But we just proved that it’s not, so “what gives”?

      Remember that the door you chose at first has a 1/3 chance of winning.  That means that the other two doors combined have a 2/3 chance in winning.  Even though Monty showed us what’s behind one of those two doors, the two of them together still have a 2/3 chance of winning.  Since you know one of them is empty, that means the door you didn’t pick MUST have a 2/3 chance of winning.  You should switch.  The green line in the Python graph (or the red line in the SAS graph) shows that after having run 10,000 contestants through the game the people that always switched won 67% of the time while the people that never switched only won 33% of the time.

      My comparisons and thoughts between SAS and Python.

      In terms of number of lines of code required, SAS wins hands down.  I only needed 57 lines of code to get the result in SAS, compared to 74 lines in Python. I realize that experience has a lot to do with it, but I think there is an inherent verbosity to the Python code that is not necessarily there in SAS.

      In terms of ease of use, I’m going to give the edge to Python.  I really liked how easy it was to generate a random number between two values.  In SAS, you have to actually perform arithmetic functions to do it, whereas in Python it’s a built-in function. It was exactly the same for accumulating totals of numbers.

      In terms of iterative ability and working “free style,” I give the edge to SAS.  With Python, it is easy to iterate, but I felt myself having to start all over again having to pre-define columns, packages, etc., in order to complete my analysis.  With SAS, I could just code.  I didn’t have to start over because I created a new column.  I didn’t have to start over because I needed to figure out which package I needed, find it on Github, install it and then import it.

      In terms of tabular output, SAS wins.  Easy to read, easy to generate.

      In terms of graphical output, Python edges SAS out.  Both are verbose and tedious to get it to work. Python wins because the output is cleaner and there are way more options.

      In terms of speed, SAS wins.  On my laptop, I could change the number of rows from 10,000 to 100,000 without noticing much of a difference in speed (0.25 – 0.5 seconds).  In Python, anything over 10,000 got slow.  10,000 rows was 6 seconds, 100,000 rows was 2 minutes 20 seconds.

      Of course, this speed has a resource cost.  In those terms, Python wins.  My Anaconda installation is under 2GB of disk space, while my particular deployment of SAS requires 50GB of disk space.

      Finally, in terms of mathematics, they tied.  They both produce the same answer as expected.  Of course, I used extremely common packages that are well used and tested.  Newer or more sophisticated packages are often tested against SAS as the standard for accuracy.

      But in the end, comparing the two as languages is limited.  Python is much a more versatile object oriented language that has capabilities that SAS doesn’t have.  While SAS’ mature DATA step can do things to data that Python has difficulty with.   But most importantly, is the release of SAS Viya. Through Viya’s open APIs and micro-services, SAS is transforming itself into something more than just a coding language, it aims to be the analytical platform that all data scientists can use to their work done.

      tags: Python, SAS Programmers

      The Monty Hall Paradox - SAS vs. Python was published on SAS Users.


      Use Slack bot to monitor the server

      I used to install Datadog or other SaaS to monitor my Linux boxes on the cloud. Most times they are just overkill for my tiny servers with only 1GB or 2GB memory. Actually what I am most interested is the up-and-running processes, or/and the exact memory usage. And I need a mobile solution to monitor on-the-go.
      Now with the coming of Slack bot, and its real time Python client, I can just use a simple Python script to realize the purposes.
      from slackclient import SlackClient
      from subprocess import getoutput
      import logging
      import time

      message_channel = '#my-server-001'
      api_key = 'xoxb-slack-token'
      client = SlackClient(api_key)

      if client.rtm_connect():
      while True:
      last_read = client.rtm_read()
      if last_read:
      parsed = last_read[0]['text']
      if parsed and 'status' in parsed:
      result = getoutput('pstree')
      result += 'nn' + getoutput('free -h')
      client.rtm_send_message(message_channel, str(result))
      except Exception as e:
      Then I use systemd or other tools to daemonize it. No matter where and when I am, I enter status at the #my-server-001 channel on my phone, I will instantly get the result like -
      | `-{gmain}
      | |-{in:imuxsock}
      | `-{rs:main Q:Reg}
      | `-sshd---sshd

      total used free shared buff/cache available
      Mem: 2.0G 207M 527M 26M 1.2G 1.7G
      Swap: 255M 0B 255M

      Install Theano under Anaconda3 Python 3.5

      Update on 2017/01/05:

      With the release of Anaconda3-4.2.0 in September 2016, users are able to install mingw and libpython under conda, which makes  using Theano and keras in Python3.5 much easier.
      Here are the simple steps:
      1. Install Anaconda3-4.2.0. I used their Anaconda3-4.2.0-Windows-x86_64.exe installer, MD5=0ca5ef4dcfe84376aad073bbb3f8db00
      2. In your Anaconda Prompt, execute : >>conda install -c anaconda mingw libpython
      3. Install Theano and keras. I used their github repository
      4. Remove the following statements from __init__.py file under Theano's installation folder:

      if sys.platform == 'win32' and sys.version\_info[0:2] == (3, 5):
            raise RuntimeError( "Theano do not support Python 3.5 on Windows. Use Python 2.7 or 3.4.")

      5. Input the following statement in Environment Variable under system settings:

      THEANO_FLAG=floatX=float32,device=cpu,blas.ldflags=-LC:\\openblas -lopenblas

      Adding GPU capability follows the general GPU installation rule as well.

      Now you are good to go, enjoying Theano backed deep learning and  Python35
      As of writing, Deep Learning package Theano can't be installed under Python 3.5, which causes some problems with my project. ruslanagit provided a viable solution that works for me under Windows 10.

      For convenience, I copied his solution below:

      Well, the main problem with Python 3.5 on windows, I guess, is that mingw and libpython are not available (not complied with Python 3.5), so you cannot run $ conda install mingw libpython step.

      So you either need to downgrade to Python 3.4 (was not an option for me) and then follow standard instructions for installing Theano on Windows, or make a few tricks to make theano work with Python 3.5. For me the following steps worked on Windows 10 with Anaconda3 and Python 3.5:

      • Install mingw from https://sourceforge.net/projects/mingw-w64/ 
      • Add the bin directory of mingw to PATH, and make sure there is no other gcc compiler in PATH (i.e. TDM-GCC is not there). 
      • In .theanorc file add 
      cxxflags = -shared -I"[TDM-GCC path]\include" -I"[TDM-GCC path]\x86_64-w64-mingw32\include"

      You should update paths to TDM-GCC according to your system. Note, that TDM include directory is required, since the compilation will fail with mingw include directories for python 3.5 (I think they would work for Python 2, but I am not 100% sure)
      • Create libpython35.a manually and copy it to appropriate directory. For example: 
        • Create temp directory 
        • Copy python35.dll (I took it from Anaconda3 folder) into created directory 
        • Navigate into created directory 
        • Run: gendef python35.dll 
        • Run: dlltool --dllname python35.dll --def python35.def --output-lib libpython35.a 
        • Copy libpython35.a into Anaconda3\libs 
        • All other installations/configurations were done as described in the guide for installaing Theano on Windows.
      Back to Top