19
Mar

Data Science Software Reviews: Forrester vs. Gartner

In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I show Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.

Forrester Research, Inc. is another company that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.

Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports. Figure 3c shows the results of The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018. By “multimodal” they mean controllable by various means such as menus, workflows, wizards, or code.  Figure 3d shows the results from The Forrester Wave: Notebook-Based Solutions, Q3, 2018 (notebooks blend programming code and output in the same window). Those are the two most recent Forrester reports on the topic. Forrester plans to cover tools for automated modeling in a separate report. Given that automation is now a widely adopted feature of the several companies shown in Figure 3c, that seems like an odd approach.

Both plots use the x-axis to display the strength of each company’s strategy, while the y-axis measures the strength of each’s current offering. Blue shading is used to divide the vendors into Leaders, Strong Performers, Contenders, and Challengers. The size of the circle around each data point indicates the “presence” of each vendor in the marketplace, weighted by 70% by vendor size and 30% by ISV and service partners.

In Figure 3c, we see a perspective that is radically different from the latest Gartner plot, 3a (see previous post). Here IBM is considered a leader, instead of a middle-of-the-pack Visionary. SAS and RapidMiner are both considered leaders by Gartner and Forrester.

In the Strong Performers segment, we see KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. KNIME and SAP are next to each other in this segment, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating from Gartner.

The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.

Forrester’s Challengers segment World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems.  Neither were considered by Gartner.

Forrester 2018 Multimodal

Figure 3c. Forrester Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018

The notebook-based vendors shown in Figure 3d is also extremely different from Gartner’s perspective. Here Domino Data Labs is a leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a leader, though its strength is this market is minimal.

Forrester 2018 Notebook

Figure 3d. Forrester Wave Notebook-Based Predictive Analytics and Machine Learning Solutions.

In the Strong Performers segment are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this segment; neither were reviewed by Gartner. Cloudera is in this segment as well; it was left out by Gartner.

The Condenders segment contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.

The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.

As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.

To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!

26
Feb

Gartner’s 2019 Take on Data Science Software

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2019 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging through all 40+ pages of my report, here’s just the updated section:

IT Research Firms

IT research firms study software products and corporate strategies. They survey customers regarding their satisfaction with the products and services and provide their analysis in reports that they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. The reports exclude open source software that has no specific company backing, such as R, Python, or jamovi. Even open source projects that do have company backing, such as BlueSky Statistics, are excluded if they have yet to achieve sufficient market adoption. However, they do cover how company products integrate open source software into their proprietary ones.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal companies that are distributing them. On the date of this post, Datarobot is offering free copies.

Gartner, Inc. is one of the research firms that write such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 17 which offered “cohesive software.” That software performs a wide range of tasks including data importation, preparation, exploration, visualization, modeling, and deployment.

Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3a shows the resulting “Magic Quadrant” plot for 2019, and 3b shows the plot for the previous year. Here I provide some commentary on their choices, briefly summarize their take, and compare this year’s report to last year’s. The main reports from both years contain far more detail than I cover here.

Gartner-2019

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms from their 2019 report (plot done in November 2018, report released in 2019).

The Leaders quadrant is the place for companies whose vision is aligned with their customer’s needs and who have the resources to execute that vision. The further toward the upper-right corner of the plot, the better the combined score.

  • RapidMiner and KNIME reside in the best part of the Leaders quadrant this year and last. This year RapidMiner has the edge in ability to execute, while KNIME offers more vision. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license with “difficult-to-navigate pricing conditions.” These two offer very similar workflow-style user interfaces and have the ability to integrate many open sources tools into their workflows, including R, Python, Spark, and H2O.
  • Tibco moved from the Challengers quadrant last year to the Leaders this year. This is due to a number of factors, including the successful integration of all the tools they’ve purchased over the years, including Jaspersoft, Spotfire, Alpine Data, Streambase Systems, and Statistica.
  • SAS declined from being solidly in the Leaders quadrant last year to barely being in it this year. This is due to a substantial decline in its ability to execute. Given SAS Institute’s billions in revenue, that certainly can’t be a financial limitation. It may be due to SAS’ more limited ability to integrate as wide a range of tools as other vendors have. The SAS language itself continues to be an important research tool among those doing complex mixed-effects linear models. Those models are among the very few that R often fails to solve.

The companies in the Visionaries Quadrant are those that have good future plans but which may not have the resources to execute that vision.

  • Mathworks moved forward substantially in this quadrant due to MATLAB’s ability to handle unconventional data sources such as images, video, and the Internet of Things (IoT). It has also opened up more to open source deep learning projects.
  • H2O.ai is also in the Visionaries quadrant. This is the company behind the open source  H2O software, which is callable from many other packages or languages including R, Python, KNIME, and RapidMiner. While its own menu-based interface is primitive, its integration into KNIME and RapidMiner makes it easy to use for non-coders. H2O’s strength is in modeling but it is lacking in data access and preparation, as well as model management.
  • IBM dropped from the top of the Visionaries quadrant last year to the middle. The company has yet to fully integrate SPSS Statistics and SPSS Modeler into its Watson Studio. IBM has also had trouble getting Watson to deliver on its promises.
  • Databricks improved both its vision and its ability to execute, but not enough to move out of the Visionaries quadrant. It has done well with its integration of open-source tools into its Apache Spark-based system. However, it scored poorly in the predictability of costs.
  • Datarobot is new to the Gartner report this year. As its name indicates, its strength is in the automation of machine learning, which broadens its potential user base. The company’s policy of assigning a data scientist to each new client gets them up and running quickly.
  • Google’s position could be clarified by adding more dimensions to the plot. Its complex collection of a dozen products that work together is clearly aimed at software developers rather than data scientists or casual users. Simply figuring out what they all do and how they work together is a non-trivial task. In addition, the complete set runs only on Google’s cloud platform. Performance on big data is its forte, especially problems involving image or speech analysis/translation.
  • Microsoft offers several products, but only its cloud-only Azure Machine Learning (AML) was comprehensive enough to meet Gartner’s inclusion criteria. Gartner gives it high marks for ease-of-use, scalability, and strong partnerships. However, it is weak in automated modeling and AML’s relation to various other Microsoft components is overwhelming (same problem as Google’s toolset).

Figure 3b. Last year’s Gartner Magic Quadrant for Data Science and Machine Learning Platforms (January, 2018)

Those in the Challenger’s Quadrant have ample resources but less customer confidence in their future plans, or vision.

  • Alteryx dropped slightly in vision from last year, just enough to drop it out of the Leaders quadrant. Its workflow-based user interface is very similar to that of KNIME and RapidMiner, and it too gets top marks in ease-of-use. It also offers very strong data management capabilities, especially those that involve geographic data, spatial modeling, and mapping. It comes with geo-coded datasets, saving its customers from having to buy it elsewhere and figuring out how to import it. However, it has fallen behind in cutting edge modeling methods such as deep learning, auto-modeling, and the Internet of Things.
  • Dataiku strengthed its ability to execute significantly from last year. It added better scalability to its ease-of-use and teamwork collaboration. However, it is also perceived as expensive with a “cumbersome pricing structure.”

Members of the Niche Players quadrant offer tools that are not as broadly applicable. These include Anaconda, Datawatch (includes the former Angoss), Domino, and SAP.

  • Anaconda provides a useful distribution of Python and various data science libraries. They provide support and model management tools. The vast army of Python developers is its strength, but lack of stability in such a rapidly improving world can be frustrating to production-oriented organizations. This is a tool exclusively for experts in both programming and data science.
  • Datawatch offers the tools it acquired recently by purchasing Angoss, and its set of “Knowledge” tools continues to get high marks on ease-of-use and customer support. However, it’s weak in advanced methods and has yet to integrate the data management tools that Datawatch had before buying Angoss.
  • Domino Data Labs offers tools aimed only at expert programmers and data scientists. It gets high marks for openness and ability to integrate open source and proprietary tools, but low marks for data access and prep, integrating models into day-to-day operations, and customer support.
  • SAP’s machine learning tools integrate into its main SAP Enterprise Resource Planning system, but its fragmented toolset is weak, and its customer satisfaction ratings are low.

To see many other ways to rate this type of software, see my ongoing article, The Popularity of Data Science Software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!

23
Dec

Add JavaScript and CSS in Shiny

In this tutorial, I will cover how to include your own JavaScript, CSS and HTML code in your R shiny app. By including them, you can make a very powerful professional web app using R.

First let's understand the basics of a Webpage

In general, web page contains the following section of details.
  1. Content (Header, Paragraph, Footer, Listing)
  2. Font style, color, background, border
  3. Images and Videos
  4. Popups, widgets, special effects etc.

HTML, CSS and JavaScript

These 3 web programming languages in conjunction  take care of all the information webpage contains (from text to adding special effects).
  1. HTML determines the content and structure of a page (header, paragraph, footer etc.)
  2. CSS controls how webpage would look like (color, font type, border etc.)
  3. JavaScript decides advanced behaviors such as pop-up, animation etc.
Make JavaScript, CSS work for Shiny
Fundamentals of Webpage
One of the most common web development term you should know : rendering. It is the act of putting together a web page for presentation.
Shiny Dashboard Syntax

In this article, I will use shinydashboard library as it gives more professional and elegant look to app. The structure of shinydashboard syntax is similar to shiny library. Both requires ui and server components. However, functions are totally different. Refer the code below. Make sure to install library before using the following program.
# Load Library
library(shiny)
library(shinydashboard)

# User Interface
ui =
dashboardPage(
dashboardHeader(title = "Blank Shiny App"),
dashboardSidebar(),
dashboardBody()
)

# Server
server = function(input, output) { }

# Run App
runApp(list(ui = ui, server = server), launch.browser =T)

Example : Create Animation Effect

The program below generates animation in the web page. To test it, you can check out this link. When user hits "Click Me" button, it will trigger demojs() JavaScript which will initiate animation. It's a very basic animation. You can edit the code and make it as complex as you want.

HTML

CSS

#sampleanimation {
width: 50px;
height: 50px;
position: absolute;
background-color: blue;
}

#myContainer {
width: 400px;
height: 400px;
position: relative;
background: black;
}

JS

function demojs() {
var elem = document.getElementById('sampleanimation');
var position = 0;
var id = setInterval(frame, 10);
function frame() {
if (position == 350) {
clearInterval(id);
} else {
position++;
elem.style.top = position + 'px';
elem.style.left = position + 'px';
}
}
}

There are several ways to include custom JavaScript and CSS codes in Shiny. Some of the common ones are listed below with detailed explanation -

Method I : Use tags to insert HTML, CSS and JS Code in Shiny


HTML
tags$body(HTML("Your HTML Code"))
CSS
tags$head(HTML("<style type='text/css'>
Your CSS Code
</style>"))
OR

CSS code can also be defined using tags$style. 
tags$head(tags$style(HTML(" Your CSS Code ")))

JS
tags$head(HTML("<script type='text/javascript'>
Your JS Code
</script>"))

OR

JS code can be described with tags$script.
tags$head(tags$script(HTML(" Your JS Code ")))

Code specified in tags$head means it will be included and executed under <head> </head>. Similarly tags$body can also be used to make shiny run code within <body> </body>

tags$head vs. tags$body

In general, JavaScript and CSS files are defined inside <head> </head>. Things which we want to display under body section of the webpage should be defined within <body> </body>.

Animation Code in Shiny



Important Note
In JS, CSS and HTML code, make sure to replace double quotation mark with single quotation mark under shiny's HTML(" ") function as it considers double quotation mark as closing the function.

Method II : Call JavaScript and CSS files in Shiny

You can use includeScript( ) and includeCSS( ) functions to refer JS and CSS codes from files saved in your local directory. You can save the files anywhere and mention the file location of them in the functions.

How to create JS and CSS files manually
Open notepad and paste JS code and save it with .js file extension and file type "All files" (not text document). Similarly you can create css file using .css file extension.


When to use Method 2?
When you want to include a big (lengthy) JS / CSS code, use method 2. Method 1 should be used for small code snippets as RStudio does not support coloring and error-checking of JS / CSS code. Also it makes code unnecessary lengthy which makes difficult to maintain.

Method III : Add JS and CSS files under www directory

Step 1 : 
Create an app using shinyApp( ) function and save it as app.R. Refer the code below.



Step 2 :
Create a folder named www in your app directory (where your app app.r file is stored) and save .js and .css files under the folder. Refer the folder structure below.
├── app.R
└── www
└── animate.js
└── animation.css

Step 3 :
Submit runApp( ) function. Specify path of app directory.
runApp(appDir = "C:/Users/DELL/Documents", launch.browser = T)

Method IV : Using Shinyjs R Package

The shinyjs package allows you to perform most frequently used JavaScript tasks without knowing JavaScript programming at all. For example, you can hide, show or toggle element. You can also enable or disable input.

Example : Turn content on and off by pressing the same button

Make sure to install shinyjs package before loading it. You can install it by using install.packages("shinyjs").

Important Point : Use function useShinyjs( ) under dashboardBody( ) to initialize shinyjs library



In the above program, we have used toggle( ) function to turn content on and off.


Example : Enable or disable Numeric Input based on checkbox selection



Communication between R and JavaScript

You can also define and call your own JavaScript function using shinyjs package with the use of extendShinyjs( ) function inside dashboardBody( ).
  1. Make sure to define custom JavaScript function beginning with word shinyjs
  2. JS function should be inside quotes
  3. In server, you can call the function by writing js$function-name
The program below closes app when user clicks on action button.



End Notes

With the huge popularity of JavaScript and many recent advancements, it is recommended to learn basics of JavaScript so that you can use them in R Shiny app. According to latest survey, JavaScript is used by 95% of websites. Its huge popularity is because of active broad JS developers community and being used by big players like Google, Facebook, Microsoft, etc.
Do comment on how you use shiny app in the comment box below. If you are beginner and want to learn building webapp using shiny, check out this tutorial
2
Dec

Install and Load Multiple R Packages

In enterprise environment, we generally need to automate the process of installing multiple R packages so that user does not have to install them separately before submitting your program.

The function below performs the following operations -
  1. First it finds all the already installed R packages
  2. Check packages which we want to install are already installed or not.
  3. If package is already installed, it does not install it again.
  4. If package is missing (not installed), it installs the package.
  5. Loop through steps 2, 3 and 4 for multiple packages we want to install
  6. Load all the packages (both already available and new ones).

Install_And_Load <- function(packages) {
  k <- packages[!(packages %in% installed.packages()[,"Package"])];
  if(length(k))
  {install.packages(k, repos='https://cran.rstudio.com/');}

  for(package_name in packages)
  {library(package_name,character.only=TRUE, quietly = TRUE);}
}
Install_And_Load(c("fuzzyjoin", "quanteda", "stringdist", "stringr", "stringi"))

Explanation

1. installed.packages() returns details of all the already installed packages. installed.packages()[,"Package"] returns names of these packages.

To see version of the packages, submit the following command
installed.packages()[,c("Package","Version")]
2.  You can use any of the following repositories (URL of a CRAN mirror). You can experiment with these 3 repositories if one of them is blocked in your company due to firewall restriction.
https://cloud.r-project.org
https://cran.rstudio.com
http://www.stats.ox.ac.uk/pub/RWin
3. quietly = TRUE tells R not to print errors/warnings if package attaching (loading) fails.

How to check version of R while installation

In the program below, the package RDCOMClient refers repository - http://www.omegahat.net/R if R version is greater than or equal to 3.5. Else refers the repository http://www.stats.ox.ac.uk/pub/RWin
if (length("RDCOMClient"[!("RDCOMClient" %in% installed.packages()[,"Package"])])) {
  if (as.numeric(R.Version()$minor)>= 5)
    install.packages("RDCOMClient", repos = "http://www.omegahat.net/R")
  else
    install.packages("RDCOMClient", repos = "http://www.stats.ox.ac.uk/pub/RWin")
}
library("RDCOMClient")
28
May

Take Screenshot of Webpage using R

Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.
Take Screenshot from R


What is PhantomJS?

PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

PhantomJS is an optimal solution for the following:
  • Headless website testing
  • Screen Capture
  • Page Automation
  • Network Monitoring

Webshot : R Package 

The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).

Install and Load Package

The stable version of webshot is available on CRAN hence can be installed using the below code:
install.packages('webshot')
library('webshot')

Also, the latest development version of webshot is hosted on github and can be installed using the below code:
#install.packages('devtools')
devtools::install_github('wch/webshot')

Initial Setup

As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.
webshot::install_phantomjs()
The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.

Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.

Screenshot Function

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like '.jpg', '.png', '.pdf' based on which the output file is rendered. Below is the basic structure of how the function goes:
library(webshot)

#webshot(url, filename.extension)
webshot("https://www.listendata.com/", "listendata.png")

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases - starting with downloading/converting a webpage as a PDFcopy.

Case #1: PDF Copy of WebPage

Let us assume, we would like to download Bill Gates' notes on Best Books of 2017 as a PDF copy.

#loading the required library
 library(webshot)

#PDF copy of a web page / article
 webshot("https://www.gatesnotes.com/About-Bill-Gates/Best-Books-2017",
 "billgates_book.pdf",
 delay = 2)

The above code generates a PDF whose (partial) screenshot is below:
Snapshot of PDF Copy

Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.
  1. URL from which the screenshot has to be taken. 
  2. Output Filename along with its file extensions. 
  3. Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.
Thus, a webpage can be converted/downloaded as a PDF programmatically in R.

Case #2: Webpage Screenshot (Viewport Size)

Now, I'd like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.
#Screenshot of Viewport
webshot('https://www.livemint.com/','livemint.png', cliprect = 'viewport')
While the first two arguments are similar to the above function, there's a new third argument cliprect which specifies the size of the Clipping rectangle.

If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value 'viewport' which clips only the viewport part of the browser, as below.

Screenshot of Viewport of Browser

Case #3: Multiple Selector Based Screenshots

All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.
#Multiple Selector Based Screenshots
webshot("https://github.com/hadley",
 file = c("organizations.png","contributions.png"),
 selector = list("div.border-top.py-3.clearfix","div.js-contribution-graph"))
In the above code, we take screenshot of two CSS Selectors from the github profile page of  Hadley Wickham and save them in two PNG files - organizations.png and contributions.png.

Contributions.png

Organizations.png
Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.

References
28
Mar

Using Excel for Data Entry

This article shows you how to enter data so that you can easily open in statistics packages such as R, SAS, SPSS, or jamovi (code or GUI steps below). Excel has some statistical analysis capabilities, but they often provide incorrect answers. For a comprehensive list of these limitations, see http://www.forecastingprinciples.com/paperpdf/McCullough.pdf and http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction.

Simple Data Sets

Most data sets are easy to enter using the following rules.

  • All your data should be in a single spreadsheet of a single file (for an exception to this rule, see Relational Data Sets below.)
  • Enter variable names in the first row of the spreadsheet.
  • Consider the length of your variable names. If you know for sure what software you will use, follow its rules for how many characters names can contain. When in doubt, use variable names that are no longer than 8 characters, beginning with a letter. Those short names can be used by any software.
  • Variable names should not contain spaces, but may use the underscore character.
  • No other text rows such as titles should be in the spreadsheet.
  • No blank rows should appear in the data.
  • Always include an ID variable on your original data collection form and in the spreadsheet to help you find the case again if you need to correct errors. You may need to sort the data later, after which the row number in Excel would then apply to a different subject or sampling unit, making it hard to find.
  • Position the ID variable in the left-most column for easy reference. 
  • If you have multiple groups, put them in the same spreadsheet along with a variable that indicates group membership (see Gender example below).
  • Many statistics packages don’t work well with alphabetic characters representing categorical values. For example to enter political party, you might enter 1 instead of Democrat, 2 instead of Republican and 3 instead of Other.
  • Avoid the use of special characters in numeric columns. Currency signs ($, €, etc.) can cause trouble in some programs.
  • If your group has only two levels, coding them 0 and 1 makes some analyses (e.g. linear regression) much easier to do. If the data are logical, use 0 for false, and 1 for true.
    If the data represent gender, it’s common to use 0 for female, 1 for male.
  • For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, some software (like R) will read the column as character data so you will not be able to, for example, calculate the mean of a column without taking action to address the situation.
  • You can enter dates with slashes (8/31/2018) and times with colons (12:15 AM). Note that dates are recorded differently across countries, so make sure you are using a format that matches your locale.
  • For text analysis, you can enter up to 32K of text, or about 8 pages, in a single cell. However, if you cut & paste if from elsewhere, remove carriage returns first as they will cause it to jump to a new cell.

Relational Data Sets

Some data sets contain observations that are related in some way. They may be people who all live in the same home, or samples that all came from the same site. There may be higher levels of relations, such as students within classrooms, then classrooms within schools. Data that contains such relations (a.k.a. nesting) may be stored in a “relational” database, but those are harder to learn than spreadsheet software. Relational data can easily be entered as two or more spreadsheets and combined later during data analysis. This saves quite a lot of data entry as the higher level data (e.g. family house value, socio-economic status, etc.) only needs to be entered once, instead of on several lines (e.g. for each family member).

If you have such data, make sure that each data set contains a “key” variable that acts as a  common ID number for family, site, school, etc. You can later read two files at a time and combine them matching on that key variable. R calls this combination a join or merge; SAS calls it a merge; and SPSS calls it Add Variables.

Example of a Good Data Structure

This data set follows all the rules for simple data sets above. Any statistics software can read it easily.

ID
Gender Income

1

0

32000

2

1

23000

3

0

137000

4

1

54000

5

1

48500

Example of a Bad Data Structure

This is the same data shown above, but it violates the rules for simple data sets in several ways: there is no column for gender, the income values contain dollar signs and commas, variable names appear on more than one line, variable names are not even consistent (income vs. salary), and there is a blank line in the middle. This would not be easy to read!

Data for Female Subjects
ID Income

1

$32,000

3

$137,000

   
Data for Male Subjects
ID Salary

2

$23,000

4

$54,000

5

$48,500

Excel Tips for Data Entry

  • You can make sure your variable names are always visible at the top of your Excel spreadsheet by choosing View> Freeze Panes> Freeze Top Row. This helps you enter data in the proper columns.
  • Avoid using Excel to sort your data. It’s too easy to sort one column independent of the others, which destroys your data! Statistics packages can sort data and they understand the importance of keeping all the values in each row locked together.
  • If you need to enter a pattern of consecutive values such as an ID number with values such as 1,2,3 or 1001,1002,1003, enter the first two, select those cells, then drag the tiny square in the lower right corner as far downward as you wish. Excel will see the pattern of the first two entries and extend it as far as you drag your selection. This works for days of the week and dates too. You can create your own lists in Options>Lists, if you use a certain pattern often.
  • To help prevent typos, you can set minimum and maximum values, or create a list of valid values. Select a column or set of similar columns, then go to the Data tab, then the Data Tools group, and choose Validation. To set minimum and maximum values, choose Allow: Whole Number or Decimals and then fill in the values in the Minimum and Maximum boxes. To create a list of valid values, choose Allow: List and then fill in the numeric or character values separated by commas in the Source box. Note that these rules only operate as you enter data, they will not help you find improper values that you have already entered.
  • The gold standard for data accuracy is the dual entry method. With this method you actually enter all the data twice. Only this method can catch errors that are within the normal range of values, but still wrong. Excel can show you where the values differ. Enter the data first in Sheet1. Then enter it again using the exact same layout in Sheet2. Finally, in Sheet1 select all cells using CTRL-A. Then choose Conditional Formatting> New Rule. Choose “Use a formula to determine which cells to format,” enter this formula:
    =A1<>Sheet2!A1
    then click the Format button, make sure the Fill tab is selected, and choose a color. Then click OK twice. The inconsistencies between the two sheets will then be highlighted in Sheet1. You then check to see which entry was wrong and fix it. When you read the data into a statistics package, you will only need to read the data in Sheet1.
  • When looking for data errors, it can be very helpful to display only a subset of values. To do this, select all the columns you wish to scan for errors, then click the Filter icon on the Data tab. A downward-pointing triangle will appear at the top of each column selected. Clicking it displays a list of the values contained in that column. If you have entered values that are supposed to be, for example, between 1 and 5 and you see 6 on this list, choosing it will show you only those rows in which you made that error. Then you can fix them. You can also use click on Number Filters to use simple logic to find, for example, all rows with values greater than 5. When you are finished, click on the filter icon again to turn it off.

Backups

Save your data frequently and make backup copies often. Don’t leave all your backup copies connected to a computer which would leave them vulnerable to attack by viruses. Don’t store them all in the same building or you risk losing all your hard work in a fire or theft. Get a free account at http://drive.google.com, http://dropbox.com, or http://onedrive.live.com and save copies there.

 Steps for Reading Excel Data Into R

There are several ways to read an Excel file into R. Perhaps the easiest method uses the following commands. They read an excel file named mydata.xlsx into an R data frame called mydata. For examples on how to read many other file formats into R, see:
http://r4stats.com/examples/data-import/.

# Do this once to install:
install.packages("readxl")

# Each time you read a file, follow these steps
library("readxl")
mydata <- read_excel("mydata.xlsx")
mydata 

Steps for Reading Excel Data Into SPSS

  1. In SPSS, choose File> Open> Data.
  2. Change the “Files of file type” box to “Excel (*.xlsx)”
  3. When the Read Excel File box appears, select the Worksheet name and check the box for Read variable names from the first row of data, then click OK.
  4. When the data appears in the SPSS data editor spreadsheet, Choose File: Save as and leave the Save as type box to SPSS (*.sav).
  5. Enter the name of the file without the .sav extension and then click Save to save the file in SPSS format.
  6. Next time open the .sav version, you won’t need to convert the file again.
  7. If you create variable or value labels in the SPSS file and then need to read your data from Excel again you can copy them into the new file. First, make sure you use the same variable names. Next, after opening the file in SPSS, use Copy Data Properties from the Data menu. Simply name the SPSS file that has properties (such as labels) that you want to copy, check off the things you want to copy and click OK. 

Steps for Reading Excel Data Into SAS

The code below will read an excel file called mydata.xlsx and store it as a permanent SAS dataset called sasuser.mydata. If your organization is considering migrating from SAS to R, I offer some tips here: http://r4stats.com/articles/migrate-to-r/

proc import datafile="mydata.xlsx"
dbms=xlsx out=sasuser.mydata replace;
getnames=yes;
run;

Steps for Reading Excel Data into jamovi

At the moment, jamovi can open CSV, JASP, SAS, SPSS, and Stata files, but not Excel. So you must open the data in Excel and Save As a comma separated value (CSV) file. The ability to read Excel files should be added to a release in the near future. For more information about the free and open source jamovi software, see my review here:
http://r4stats.com/2018/02/13/jamovi-for-r-easy-but-controversial/.

More to Come

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter where I announce my blog posts.

27
Mar

Run Python from R

This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It's always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.
Combine Python and R

RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

Install and Load Reticulate Package

Run the command below to get this package installed and imported to your system.
# Install reticulate package
install.packages("reticulate")

# Load reticulate package
library(reticulate)

Check whether Python is available on your system
py_available()
It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

Import a python module within R

You can use the function import( ) to import a particular package or module.
os <- import("os")
os$getcwd()
The above program returns working directory.
[1] "C:\\Users\\DELL\\Documents"

You can use listdir( ) function from os package to see all the files in working directory
os$listdir()
 [1] ".conda"                       ".gitignore"                   ".httr-oauth"                 
[4] ".matplotlib" ".RData" ".RDataTmp"
[7] ".Rhistory" "1.pdf" "12.pdf"
[10] "122.pdf" "124.pdf" "13.pdf"
[13] "1403.2805.pdf" "2.pdf" "3.pdf"
[16] "AIR.xlsx" "app.r" "Apps"
[19] "articles.csv" "Attrition_Telecom.xlsx" "AUC.R"


Install Python Package

Step 1 : Create a new environment 
conda_create("r-reticulate")
Step 2 : Install a package within a conda environment
conda_install("r-reticulate", "numpy")
Since numpy is already installed, you don't need to install it again. The above example is just for demonstration.

Step 3 : Load the package
numpy <- import("numpy")

Working with numpy array

Let's create a sample numpy array
y <- array(1:4, c(2, 2))
x <- numpy$array(y)
     [,1] [,2]
[1,] 1 3
[2,] 2 4


Transpose the above array
numpy$transpose(x)
    [,1] [,2]
[1,] 1 2
[2,] 3 4

Eigenvalues and eigen vectors
numpy$linalg$eig(x)
[[1]]
[1] -0.3722813 5.3722813

[[2]]
[,1] [,2]
[1,] -0.9093767 -0.5657675
[2,] 0.4159736 -0.8245648

Mathematical Functions
numpy$sqrt(x)
numpy$exp(x)

Working with Python interactively

You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).

By using repl_python() function, you can make it interactive. Download the dataset used in the program below.
repl_python()

# Load Pandas package
import pandas as pd

# Importing Dataset
travel = pd.read_excel("AIR.xlsx")

# Number of rows and columns
travel.shape

# Select random no. of rows
travel.sample(n = 10)

# Group By
travel.groupby("Year").AIR.mean()

# Filter
t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

# Return to R
exit
Note : You need to enter exit to return to the R environment.
call python from R
Run Python from R

How to access objects created in python from R

You can use the py object to access objects created within python.
summary(py$t)
In this case, I am using R's summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.
# Line chart using ggplot2
library(ggplot2)
ggplot(py$t, aes(AIR, Year)) + geom_line()

How to access objects created in R from Python

You can use the r object to accomplish this task. 

1. Let's create a object in R
mydata = head(cars, n=15)
2. Use the R created object within Python REPL
repl_python()
import pandas as pd
r.mydata.describe()
pd.isnull(r.mydata.speed)
exit

Building Logistic Regression Model using sklearn package

The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.
repl_python()

# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# load the iris datasets
iris = datasets.load_iris()

# Developing logit model
model = LogisticRegression()
model.fit(iris.data, iris.target)

# Scoring
actual = iris.target
predicted = model.predict(iris.data)

# Performance Metrics
print(metrics.classification_report(actual, predicted))
print(metrics.confusion_matrix(actual, predicted))

Other Useful Functions

To see configuration of python

Run the py_config( ) command to find the version of python installed on your system.It also shows details about anaconda and numpy.
py_config()
python:         C:\Users\DELL\ANACON~1\python.exe
libpython: C:/Users/DELL/ANACON~1/python36.dll
pythonhome: C:\Users\DELL\ANACON~1
version: 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Architecture: 64bit
numpy: C:\Users\DELL\ANACON~1\lib\site-packages\numpy
numpy_version: 1.14.2


To check whether a particular package is installed

In the following program, we are checking whether pandas package is installed or not.
py_module_available("pandas")
25
Mar

15 Types of Regression you should know

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution.

Table of Contents
  1. What is Regression Analysis?
  2. Terminologies related to Regression
  3. Types of Regressions
    • Linear Regression
    • Polynomial Regression
    • Logistic Regression
    • Quantile Regression
    • Ridge Regression
    • Lasso Regression
    • ElasticNet Regression
    • Principal Component Regression
    • Partial Least Square Regression
    • Support Vector Regression
    • Ordinal Regression
    • Poisson Regression
    • Negative Binomial Regression
    • Quasi-Poisson Regression
    • Cox Regression
  4. How to choose the correct Regression Model?
Regression Analysis Simplified


What is Regression Analysis?

Lets take a simple example : Suppose your manager asked you to predict annual sales. There can be a hundred of factors (drivers) that affects sales. In this case, sales is your dependent variable. Factors affecting sales are independent variables. Regression analysis would help you to solve this problem.
In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

It helps us to answer the following questions -
  1. Which of the drivers have a significant impact on sales. 
  2. Which is the most important driver of sales
  3. How do the drivers interact with each other
  4. What would be the annual sales next year.

Terminologies related to regression analysis

1. Outliers
Suppose there is an observation in the dataset which is having a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is extreme value. An outlier is a problem because many times it hampers the results we get.

2. Multicollinearity
When the independent variables are highly correlated to each other then the variables are said to be multicollinear. Many types of regression techniques assumes multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes job difficult in selecting the most important independent variable (factor).

3. Heteroscedasticity
When dependent variable's variability is not equal across values of an independent variable, it is called heteroscedasticity. Example - As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

4. Underfitting and Overfitting
When we use unnecessary explanatory variables it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as problem of high variance.

When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as problem of high bias.

In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.e. it will lead to large errors even in the training set. Using a polynomial fit in fig 2 is balanced i.e. such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.
Underfitting vs Overfitting
Regression : Underfitting and Overfitting

Types of Regression

Every regression technique has some assumptions attached to it which we need to meet before running analysis. These techniques differ in terms of type of dependent and independent variables and distribution.

1. Linear Regression

It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. The green points are the actual observations while the black line fitted is the line of regression

regression analysis
Regression Analysis

When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression.
When you have more than 1 independent variable and 1 dependent variable, it is called Multiple linear regression.
The equation of multiple linear regression is listed below -

Multiple Regression Equation
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term. βi’s are the regression coefficients.

Assumptions of linear regression: 
  1. There must be a linear relation between independent and dependent variables. 
  2. There should not be any outliers present. 
  3. No heteroscedasticity 
  4. Sample observations should be independent. 
  5. Error terms should be normally distributed with mean 0 and constant variance. 
  6. Absence of multicollinearity and auto-correlation.

Estimating the parametersTo estimate the regression coefficients βi’s we use principle of least squares which is to minimize the sum of squares due to the error terms i.e.


On solving the above equation mathematically we obtain the regression coefficients as:

Interpretation of regression coefficients
Let us consider an example where the dependent variable is marks obtained by a student and explanatory variables are number of hours studied and no. of classes attended. Suppose on fitting linear regression we got the linear regression as:
Marks obtained = 5 + 2 (no. of hours studied) + 0.5(no. of classes attended)
Thus we can have the regression coefficients 2 and 0.5 which can interpreted as:
  1. If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
  2. Keeping no. of classes attended constant, if student studies for one hour more then he will score 2 more marks in the examination. 
  3. Similarly keeping no. of hours studied constant, if student attends one more class then he will attain 0.5 marks more.

Linear Regression in R
We consider the swiss data set for carrying out linear regression in R. We use lm() function in the base package. We try to estimate Fertility with the help of other variables.
library(datasets)
model = lm(Fertility ~ .,data = swiss)
lm_coeff = model$coefficients
lm_coeff
summary(model)

The output we get is:

> lm_coeff
     (Intercept)      Agriculture      Examination        Education         Catholic 
66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
Infant.Mortality
1.0770481
> summary(model)

Call:
lm(formula = Fertility ~ ., data = swiss)

Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
Hence we can see that 70% of the variation in Fertility rate can be explained via linear regression.

2. Polynomial Regression

It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable.
In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relation between the dependent and independent variable seems to be non-linear we can deploy Polynomial Regression Models.
Thus a polynomial of degree k in one variable is written as:
Here we can create new features like
and can fit linear regression in the similar manner.

In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.
Disclaimer: It is to be kept in mind that creating unnecessary extra features or fitting polynomials of higher degree may lead to overfitting.

Polynomial regression in R:
We are using poly.csv data for fitting polynomial regression where we try to estimate the Prices of the house given their area.

Firstly we read the data using read.csv( ) and divide it into the dependent and independent variable
data = read.csv("poly.csv")
x = data$Area
y = data$Price
In order to compare the results of linear and polynomial regression, firstly we fit linear regression:
model1 = lm(y ~x)
model1$fit
model1$coeff

The coefficients and predicted values obtained are:
> model1$fit
1 2 3 4 5 6 7 8 9 10
169.0995 178.9081 188.7167 218.1424 223.0467 266.6949 291.7068 296.6111 316.2282 335.8454
> model1$coeff
(Intercept) x
120.05663769 0.09808581
We create a dataframe where the new variable are x and x square.

new_x = cbind(x,x^2)

new_x
         x        
[1,] 500 250000
[2,] 600 360000
[3,] 700 490000
[4,] 1000 1000000
[5,] 1050 1102500
[6,] 1495 2235025
[7,] 1750 3062500
[8,] 1800 3240000
[9,] 2000 4000000
[10,] 2200 4840000
Now we fit usual OLS to the new data:
model2 = lm(y~new_x)
model2$fit
model2$coeff

The fitted values and regression coefficients of polynomial regression are:
> model2$fit
1 2 3 4 5 6 7 8 9 10
122.5388 153.9997 182.6550 251.7872 260.8543 310.6514 314.1467 312.6928 299.8631 275.8110
> model2$coeff
(Intercept) new_xx new_x
-7.684980e+01 4.689175e-01 -1.402805e-04

Using ggplot2 package we try to create a plot to compare the curves by both linear and polynomial regression.
library(ggplot2)
ggplot(data = data) + geom_point(aes(x = Area,y = Price)) +
geom_line(aes(x = Area,y = model1$fit),color = "red") +
geom_line(aes(x = Area,y = model2$fit),color = "blue") +
theme(panel.background = element_blank())



3. Logistic Regression

In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable.

Here my model is:
logistic regression
logistic regression equation

Why don't we use linear regression in this case?
  • The homoscedasticity assumption is violated.
  • Errors are not normally distributed
  • y follows binomial distribution and hence is not normal.

Examples
  • HR Analytics: IT firms recruit large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost over-runs because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome - Join / Not Join).

  • Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

Predicting the category of dependent variable for a given vector X of independent variables
Through logistic regression we have -
P(Y=1) = exp(a + BₙX)  / (1+ exp(a + BₙX))

Thus we choose a cut-off of probability say 'p'  and if P(Yi = 1) > p then we can say that Yi belongs to class 1 otherwise 0.

Interpreting the logistic regression coefficients (Concept of Odds Ratio)
If we take exponential of coefficients, then we’ll get odds ratio for ith explanatory variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than the odds of non-event. Suppose dependent variable is customer attrition (whether customer will close relationship with the company) and independent variable is citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the odds of a national attrite.

Logistic Regression in R:
In this case, we are trying to estimate whether a person will have cancer depending whether he smokes or not.


We fit logistic regression with glm( )  function and we set family = "binomial"
model <- glm(Lung.Cancer..Y.~Smoking..X.,data = data, family = "binomial")
The predicted probabilities are given by:
#Predicted Probablities

model$fitted.values
        1         2         3         4         5         6         7         8         9 
0.4545455 0.4545455 0.6428571 0.6428571 0.4545455 0.4545455 0.4545455 0.4545455 0.6428571
10 11 12 13 14 15 16 17 18
0.6428571 0.4545455 0.4545455 0.6428571 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
19 20 21 22 23 24 25
0.6428571 0.4545455 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
Predicting whether the person will have cancer or not when we choose the cut off probability to be 0.5
data$prediction <- model$fitted.values>0.5
> data$prediction
[1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[16] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE

4. Quantile Regression

Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data.

In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.
The term “quantile” is the same as “percentile”

Basic Idea of Quantile Regression:In quantile regression we try to estimate the quantile of the dependent variable given the values of X's. Note that the dependent variable should be continuous.

The quantile regression model:
For qth quantile we have the following regression model:
This seems similar to linear regression model but here the objective function we consider to minimize is:
where q is the qth quantile.

If q  = 0.5 i.e. if we are interested in the median then it becomes median regression (or least absolute deviation regression) and substituting the value of q = 0.5 in above equation we get the objective function as:
Interpreting the coefficients in quantile regression:
Suppose the regression equation for 25th quantile of regression is: 
y = 5.2333 + 700.823 x

It means that for one unit increase in x the estimated increase in 25th quantile of y by 700.823 units.
Advantages of Quantile over Linear Regression
  • Quite beneficial when heteroscedasticity is present in the data.
  • Robust to outliers
  • Distribution of dependent variable can be described via various quantiles.
  • It is more useful than linear regression when the data is skewed.

Disclaimer on using quantile regression!
It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.

Quantile Regression in R
We need to install quantreg package in order to carry out quantile regression.

install.packages("quantreg")
library(quantreg)

Using rq function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss data. For this we set tau = 0.25.

model1 = rq(Fertility~.,data = swiss,tau = 0.25)
summary(model1)
tau: [1] 0.25

Coefficients:
coefficients lower bd upper bd
(Intercept) 76.63132 2.12518 93.99111
Agriculture -0.18242 -0.44407 0.10603
Examination -0.53411 -0.91580 0.63449
Education -0.82689 -1.25865 -0.50734
Catholic 0.06116 0.00420 0.22848
Infant.Mortality 0.69341 -0.10562 2.36095

Setting tau = 0.5 we run the median regression.
model2 = rq(Fertility~.,data = swiss,tau = 0.5)
summary(model2)

tau: [1] 0.5

Coefficients:
coefficients lower bd upper bd
(Intercept) 63.49087 38.04597 87.66320
Agriculture -0.20222 -0.32091 -0.05780
Examination -0.45678 -1.04305 0.34613
Education -0.79138 -1.25182 -0.06436
Catholic 0.10385 0.01947 0.15534
Infant.Mortality 1.45550 0.87146 2.21101

We can run quantile regression for multiple quantiles in a single plot.
model3 = rq(Fertility~.,data = swiss, tau = seq(0.05,0.95,by = 0.05))
quantplot = summary(model3)
quantplot

We can check whether our quantile regression results differ from the OLS results using plots.

plot(quantplot)
We get the following plot:

Various quantiles are depicted by X axis. The red central line denotes the estimates of OLS coefficients and the dotted red lines are the confidence intervals around those OLS coefficients for various quantiles. The black dotted line are the quantile regression estimates and the gray area is the confidence interval for them for various quantiles. We can see that for all the variable both the regression estimated coincide for most of the quantiles. Hence our use of quantile regression is not justifiable for such quantiles. In other words we want that both the red and the gray lines should overlap as less as possible to justify our use of quantile regression.

5. Ridge Regression

It's important to understand the concept of regularization before jumping to ridge regression.

1. Regularization

Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.

Regularization is generally useful in the following situations:
  1. Large number of variables
  2. Low ratio of number observations to number of variables
  3. High Multi-Collinearity

2. L1 Loss function or L1 Regularization

In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients.  This is also known as least absolute deviations method. Lasso Regression makes use of L1 regularization.

3. L2 Loss function or L2 Regularization

In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization.

In general, L2 performs better than L1 regularization. L2 is efficient in terms of computation. There is one area where L1 is considered as a preferred option over L2. L1 has in-built feature selection for sparse feature spaces.  For example, you are predicting whether a person is having a brain tumor using more than 20,000 genetic markers (features). It is known that the vast majority of genes have little or no effect on the presence or severity of most diseases.

In the linear regression objective function we try to minimize the sum of squares of errors. In ridge regression (also known as shrinkage regression) we add a constraint on the sum of squares of the regression coefficients. Thus in ridge regression our objective function is:
Here λ is the regularization parameter which is a non negative number. Here we do not assume normality in the error terms.

Very Important Note: 
We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X's.
We can see that ridge regression makes use of L2 regularization.


On solving the above objective function we can get the estimates of β as:

How can we choose the regularization parameter λ?

If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen to be very large then it will lead to underfitting. Thus it is highly important to determine a desirable value of lambda. To tackle this issue, we plot the parameter estimates against different values of lambda and select the minimum value of λ after which the parameters tend to stabilize.

R code for Ridge Regression

Considering the swiss data set, we create two different datasets, one containing dependent variable and other containing independent variables.
X = swiss[,-1]
y = swiss[,1]

We need to load glmnet library to carry out ridge regression.
library(glmnet)
Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means we are carrying out ridge regression. lambda is a sequence of various values of lambda which will be used for cross validation.
set.seed(123) #Setting the seed to get similar results.
model = cv.glmnet(as.matrix(X),y,alpha = 0,lambda = 10^seq(4,-1,-0.1))

We take the best lambda by using lambda.min and hence get the regression coefficients using predict function.
best_lambda = model$lambda.min

ridge_coeff = predict(model,s = best_lambda,type = "coefficients")
ridge_coeff The coefficients obtained using ridge regression are:
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 64.92994664
Agriculture -0.13619967
Examination -0.31024840
Education -0.75679979
Catholic 0.08978917
Infant.Mortality 1.09527837

6. Lasso Regression
Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularization technique in the objective function. Thus the objective function in LASSO regression becomes:
λ is the regularization parameter and the intercept term is not regularized. 
We do not assume that the error terms are normally distributed.
For the estimates we don't have any specific mathematical formula but we can obtain the estimates using some statistical software.

Note that lasso regression also needs standardization.

Advantage of lasso over ridge regression

Lasso regression can perform in-built variable selection as well as parameter shrinkage. While using ridge regression one may end up getting all the variables but with Shrinked Paramaters.

R code for Lasso Regression

Considering the swiss dataset from "datasets" package, we have: 
#Creating dependent and independent variables.
X = swiss[,-1]
y = swiss[,1]
Using cv.glmnet in glmnet package we do cross validation. For lasso regression we set alpha = 1. By default standardize = TRUE hence we do not need to standardize the variables seperately.
#Setting the seed for reproducibility
set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 1,lambda = 10^seq(4,-1,-0.1))
#By default standardize = TRUE

We consider the best value of lambda by filtering out lamba.min from the model and hence get the coefficients using predict function.
#Taking the best lambda
best_lambda = model$lambda.min
lasso_coeff = predict(model,s = best_lambda,type = "coefficients")
lasso_coeff The lasso coefficients we got are:
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 65.46374579
Agriculture -0.14994107
Examination -0.24310141
Education -0.83632674
Catholic 0.09913931
Infant.Mortality 1.07238898


Which one is better - Ridge regression or Lasso regression?

Both ridge regression and lasso regression are addressed to deal with multicollinearity. 
Ridge regression is computationally more efficient over lasso regression. Any of them can perform better. So the best approach is to select that regression model which fits the test set data well.

7. Elastic Net Regression
Elastic Net regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables.

It is a combination of both L1 and L2 regularization.

The objective function in case of Elastic Net Regression is:
Like ridge and lasso regression, it does not assume normality.

R code for Elastic Net Regression

Setting some different value of alpha between 0 and 1 we can carry out elastic net regression.
set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 0.5,lambda = 10^seq(4,-1,-0.1))
#Taking the best lambda
best_lambda = model$lambda.min
en_coeff = predict(model,s = best_lambda,type = "coefficients")
en_coeff
The coeffients we obtained are:
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 65.9826227
Agriculture -0.1570948
Examination -0.2581747
Education -0.8400929
Catholic 0.0998702
Infant.Mortality 1.0775714
8. Principal Components Regression (PCR) 
PCR is a regression technique which is widely used when you have many independent variables OR multicollinearity exist in your data. It is divided into 2 steps:
  1. Getting the Principal components
  2. Run regression analysis on principal components
The most common features of PCR are:
  1. Dimensionality Reduction
  2. Removal of multicollinearity

Getting the Principal components

Principal components analysis is a statistical method to extract new features when the original features are highly correlated. We create new features with the help of original features such that the new features are uncorrelated.

Let us consider the first principle component:
The first PC is having the maximum variance.
Similarly we can find the second PC U2 such that it is uncorrelated with U1 and has the second largest variance.
In a similar manner for 'p' features we can have a maximum of 'p' PCs such that all the PCs are uncorrelated with each other and the first PC has the maximum variance, then 2nd PC has the maximum variance and so on.

Drawbacks:

It is to be mentioned that PCR is not a feature selection technique instead it is a feature extraction technique. Each principle component we obtain is a function of all the features. Hence on using principal components one would be unable to explain which factor is affecting the dependent variable to what extent.

Principal Components Regression in R

We use the longley data set available in R which is used for high multicollinearity. We excplude the Year column.
data1 = longley[,colnames(longley) != "Year"]

View(data)  This is how some of the observations in our dataset will look like:
We use pls package in order to run PCR.
install.packages("pls")
library(pls)

In PCR we are trying to estimate the number of Employed people; scale  = T denotes that we are standardizing the variables; validation = "CV" denotes applicability of cross-validation.
pcr_model <- pcr(Employed~., data = data1, scale = TRUE, validation = "CV")
summary(pcr_model)

We get the summary as:
Data:  X dimension: 16 5 
Y dimension: 16 1
Fit method: svdpc
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
CV 3.627 1.194 1.118 0.5555 0.6514 0.5954
adjCV 3.627 1.186 1.111 0.5489 0.6381 0.5819

TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 72.19 95.70 99.68 99.98 100.00
Employed 90.42 91.89 98.32 98.33 98.74

Here in the RMSEP the root mean square errors are being denoted. While in 'Training: %variance explained' the cumulative % of variance explained by principle components is being depicted. We can see that with 3 PCs more than 99% of variation can be attributed.
We can also create a plot depicting the mean squares error for the number of various PCs.
validationplot(pcr_model,val.type = "MSEP")
By writing val.type = "R2" we can plot the R square for various no. of PCs.
validationplot(pcr_model,val.type = "R2")
 If we want to fit pcr for 3 principal components and hence get the predicted values we can write:
pred = predict(pcr_model,data1,ncomp = 3)

9. Partial Least Squares (PLS) Regression 

It is an alternative technique of principal component regression when you have independent variables highly correlated. It is also useful when there are a large number of independent variables.

Difference between PLS and PCR
Both techniques create new independent variables called components which are linear combinations of the original predictor variables but PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. While PLS takes the dependent variable into account, and therefore often leads to models that are able to fit the dependent variable with fewer components.
PLS Regression in R
library(plsdepot)
data(vehicles)
pls.model = plsreg1(vehicles[, c(1:12,14:16)], vehicles[, 13], comps = 3)
# R-Square
pls.model$R2


10. Support Vector Regression

Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models.

The main idea of SVR is to minimize error, individualizing the hyperplane which maximizes the margin.
library(e1071)
svr.model <- svm(Y ~ X , data)
pred <- predict(svr.model, data)
points(data$X, pred, col = "red", pch=4)

11. Ordinal Regression

Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables - Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).

Why we can't use linear regression when dealing with ordinal target variable?

In linear regression, the dependent variable assumes that changes in the level of the dependent variable are equivalent throughout the range of the variable. For example, the difference in weight between a person who is 100 kg and a person who is 120 kg is 20kg, which has the same meaning as the difference in weight between a person who is 150 kg and a person who is 170 kg. These relationships do not necessarily hold for ordinal variables.
library(ordinal)
o.model <- clm(rating ~ ., data = wine)
summary(o.model)

12. Poisson Regression

Poisson regression is used when dependent variable has count data.

Application of Poisson Regression -
  1. Predicting the number of calls in customer care related to a particular product
  2. Estimating the number of emergency service calls during an event
The dependent variable must meet the following conditions
  1. The dependent variable has a Poisson distribution.
  2. Counts cannot be negative.
  3. This method is not suitable on non-whole numbers

In the code below, we are using dataset named warpbreaks which shows the number of breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool tension and the interaction between the two.
pos.model<-glm(breaks~wool*tension, data = warpbreaks, family=poisson)
summary(pos.model)

13. Negative Binomial Regression

Like Poisson Regression, it also deals with count data. The question arises "how it is different from poisson regression". The answer is negative binomial regression does not assume distribution of count having variance equal to its mean. While poisson regression assumes the variance equal to its mean.
When the variance of count data is greater than the mean count, it is a case of overdispersion. The opposite of the previous statement is a case of under-dispersion.
library(MASS)
nb.model <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
summary(nb.model)

14. Quasi Poisson Regression

It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean.
qs.pos.model <- glm(Days ~ Sex/(Age + Eth*Lrn), data = quine,  family = "quasipoisson")
Quasi-Poisson regression can handle both over-dispersion and under-dispersion.


15. Cox Regression

Cox Regression is suitable for time-to-event data. See the examples below -
  1. Time from customer opened the account until attrition.
  2. Time after cancer treatment until death.
  3. Time from first heart attack to the second.
Logistic regression uses a binary dependent variable but ignores the timing of events. 
As well as estimating the time it takes to reach a certain event, survival analysis can also be used to compare time-to-event for multiple groups.

Dual targets are set for the survival model 
1. A continuous variable representing the time to event.
2. A binary variable representing the status whether event occurred or not.
library(survival)
# Lung Cancer Data
# status: 2=death
lung$SurvObj <- with(lung, Surv(time, status == 2))
cox.reg <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data =  lung)
cox.reg

How to choose the correct regression model?
  1. If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.
  2. If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression.
  3. To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.
  4. Try support vector regression when you have non-linear model.
6
Mar

Use R to interface with SAS Cloud Analytics Services

The R SWAT package (SAS Wrapper for Analytics Transfer) enables you to upload big data into an in-memory distributed environment to manage data and create predictive models using familiar R syntax. In the SAS Viya Integration with Open Source Languages: R course, you learn the syntax and methodology required to [...]

The post Use R to interface with SAS Cloud Analytics Services appeared first on SAS Learning Post.

26
Feb

Gartner’s 2018 Take on Data Science Tools

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2018 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging though all 40+ pages of my report, here’s just the new section:

IT Research Firms

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. While these reports focus on companies, they often also describe how their commercial tools integrate open source tools such as R, Python, H2O, TensoFlow, and others.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.

Gartner, Inc. is one of the companies that provides such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue, or lower revenue combined with high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Hereafter, I refer to these as simply vision and ability. Figure 3a shows the resulting “Magic Quadrant” plot for 2018, and 3b shows the plot for the previous year.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The further to the upper-right corner, the better the combined score. KNIME is in the prime position, with H2O.ai showing greater vision but lower ability to execute. This year KNIME gained the ability to run H2O.ai algorithms, so these two may be viewed as complementary tools rather than outright competitors.

Alteryx and SAS have nearly the same combined scores, but note that Gartner studied only SAS Enterprise Miner and SAS Visual Analytics. The latter includes Visual Statistics, and Visual Data Mining and Machine Learning. Excluded was the SAS System itself since Gartner focuses on tools that are integrated. This lack of integration may explain SAS’ decline in vision from last year.

KNIME and RapidMiner are quite similar tools as they are both driven by an easy to use and reproducible workflow interface. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license. In the previous year’s Magic Quadrant, RapidMiner was slightly ahead, but now KNIME is in the lead.

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Figure 3b. Gartner Magic Quadrant for Data Science Platforms 2017.

The companies in the Visionaries Quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, IBM took a big hit by landing here after being in the Leader’s Quadrant for several years. Now they’re in a near-tie with Microsoft and Domino. Domino shot up from the bottom of that quadrant to towards the top. They integrate many different open source and commercial software (e.g. SAS, MATLAB) into their Domino Data Science Platform. Databricks and Dataiku offer cloud-based analytics similar to Domino, though lacking in access to commercial tools.

Those in the Challenger’s Quadrant have ample resources but less customer confidence on their future plans, or vision. Mathworks, the makers of MATLAB, continues to “stay the course” with its proprietary tools while most of the competition offers much better integration into the ever-expanding universe of open source tools.  Tibco replaces Quest in this quadrant due to their purchase of Statistica. Whatever will become of the red-headed stepchild of data science? Statistica has been owned by four companies in four years! (Statsoft, Dell, Quest, Tibco) Users of the software have got to be considering other options. Tibco also purchased Alpine Data in 2017, accounting for its disappearance from Figure 3b to 3a.

Members of the Niche Players quadrant offer tools that are not as broadly applicable. Anaconda is new to Gartner coverage this year. It offers in-depth support for Python. SAP has a toolchain that Gartner calls “fragmented and ambiguous.”  Angoss was recently purchased by Datawatch. Gartner points out that after 20 years in business, Angoss has only 300 loyal customers. With competition fierce in the data science arena, one can’t help but wonder how long they’ll be around. Speaking of deathwatches, once the king of Big Data, Teradata has been hammered by competition from open source tools such as Hadoop and Spark. Teradata’s net income was higher in 2008 than it is today.

As of 2/26/2018, RapidMiner is giving away copies of the Gartner report here.

Back to Top