Data Science Software Used in Journals: Stat Packages Declining (including R), AI/ML Software Growing

In my neverending quest to track The Popularity of Data Science Software, it’s time to update the section on Scholarly Articles. The rapid growth of R could not go on forever and, as you’ll see below, its use actually declined over the last year.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant amounts of effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even as an object of study.

Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I collect data again for the previous years (with one exception noted below).

Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 1,700 articles) in the most recent complete year, 2018. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 3/28/2019.

Figure 2a. The number of scholarly articles found on Google Scholar, for data science software. Only those with more than 1,700 citations are shown.

SPSS is by far the most dominant package, as it has been for over 20 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. It offers extreme power, though with less ease of use. SAS is in third place, with a slight lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied.

Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

The next group of packages goes from Python through C, with usage declining slowly. The next set starts at Caffe, dropping nearly 50%, and continuing to IBM Watson with a slow decline.

The last two packages in Fig 2a are Weka and Theano, which are quite a drop from IBM Watson, though it’s getting harder to see as the lines shrink.

To continue on this scale would make the remaining packages all appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 1,700 rather than the 80,000 used on Figure 2a.

Figure 2b. Number of scholarly articles using each data science software found using Google Scholar. Only those with fewer than 1,700 citations are shown.

I chose to begin Figure 2b with software that has fewer than 1,700 articles because it allows us to see RapidMiner and KNIME on the same scale. They are both workflow-driven tools with very similar capabilities. This plot shows RapidMiner with 49% greater usage than KNIME. RapidMiner uses more marketing, while KNIME depends more on word-of-mouth recommendations and a more open source model. The IT advisory firms Gartner and Forrester rate them as tools able to hold their own against the commercial titans, IBM’s SPSS and SAS. Given that SPSS has roughly 50 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newer packages are growing, while the use of the older ones is shrinking quite rapidly.

Figure 2b also lets us see IBM’s SPSS Modeler, SAS Enterprise Miner, and Alteryx on the same plot. These three are also workflow-driven tools which are quite expensive. None are doing as well here as RapidMiner or KNIME, tools that much less expensive – or free – depending on how you use them (KNIME desktop is free but server is not; RapidMiner is free for analyzing fewer than 10,000 cases).

Another interesting comparison on Figure 2b is JASP and jamovi. Both are open-source tools that focus on statistics rather than machine learning or artificial intelligence. They both use graphical user interfaces (GUIs) in a style that is similar to SPSS. Both also use R behind the scenes to do their calculations. JASP emphasizes Bayesian Analysis and hides its R code; jamovi has a more frequentist orientation, it lets you see its R code, and it lets you execute your own R code directly from within it. JASP currently has nine times as many citations here, though jamovi’s use is growing much more rapidly.

Even newer on the GUI for R scene is BlueSky Statistics, which doesn’t appear on the plot at all since it has zero scholarly articles so far. It was created by a new company and only adopted an open source model a few months ago.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time-consuming. What I’ve done instead is collect data only for the past two complete years, 2017 and 2018. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side); the declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 1,000 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth but is still of little interest.

Figure 2c. Change in Google Scholar citation rate in the most recent complete two years, 2017 and 2018.

The recent changes in data science software can be summarized succinctly: AI/ML up; statistics down. The software that is growing contains none of the packages that are associated more with statistical analysis. The software in decline is dominated by the classic packages of statistics: SPSS Statistics, SAS, GraphPad Prism, Stata, Statgraphics, R, Statistica, Systat, and Minitab. JMP is the only traditional statistics package whose scholarly usage is growing. Of the machine learning software that’s declining in usage, there are rough equivalents that are growing (e.g. Mahout down, Spark up).

Of course another summary is: cheap (or free) up; expensive down. Of the growing packages, 13 out of 17 are available in open source. Of those in decline, only 5 out of 13 are open source.

Statistics software has been around much longer than AI/ML software, started back in the days before open source. Stat vendors have been adding AI/ML methods to their software, making them the more comprehensive solutions. The AI/ML vendors or projects are missing an opportunity to add more comprehensive statistics capabilities. Some, such as RapidMiner and KNIME, are indeed expanding in this direction, but very slowly indeed.

At the top of Figure 2c, we see that the deep learning packages Keras and TensorFlow are the fastest growing at nearly 150%. PyTorch is not shown here because it did not have enough usage in the previous year. However, its citation rate went from 616 to 4,670, a substantial 658% growth rate! There are other packages that are not shown here, including JASP with 223% growth, and jamovi with 720% growth. Despite such high growth, the latter still only has 108 citations in 2018. The rapid growth of JASP and jamovi lend credence to the perspective that the overall pattern of change shown in Figure 2c may be more of a result of free vs. expensive software. Neither of them offers any AI/ML features.

Scikit Learn, the Python machine learning library, was a fast grower with a 60% increase.

I was surprised to see IBM Watson growing a healthy 34% as much of the news about it has not been good. It’s awesome at Jeopardy though!

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot, we that KNIME growing slightly (5.7%) while RapidMiner is declining slightly (1.8%).

The biggest losers in Figure 2c are SPSS, down 39%, and SAS, Prism, and Mahout, all down 24%. Even R is down 13%. Recall that Figure 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use, and R and SAS are still the #2 and #3 most widely used packages in this arena.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of Google Scholar citations for each classic statistics package per year from 1995 through 2016.

SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 1015. The result is shown in Figure 2e.

Figure 2e. The number of Google Scholar citations for each classic statistics package from 1995 through 2016, this time with SPSS removed and SAS included only in 2014 and 2015. The removal of SPSS and SAS expanded scale makes it easier to see the rapid growth of the less popular packages.

Figure 2e makes it easy to see that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 58 out of over 100 data science tools.

While Figures 2d and 2e show the historical trend that ended in 2016, Figure 2f shows a fresh set of data collected in March, 2019. Since Google’s algorithm changes, preventing the new data from matching exactly with the old, this new data starts at 2015 so the two sets overlap. SPSS is not shown on this graph because its dominance would compress the y-axis, making trends in the others harder to see. However, keep in mind that despite SPSS’ 39% drop from 2017 to 2018, its use is still 66% higher than R’s in 2018! Apparently people are willing to pay for ease of use.

Figure 2f. The number of Google Scholar citations for each classic statistics package per year from 2015 through 2018.

In Figure 2f we can see that the downward trends of SAS, Prism, and Statistica are continuing. We also see that the long and rapid growth of R and Stata has come to an end. Growth that rapid can’t go on forever. It will be interesting to see next year to see if this is merely a flattening of usage or the beginning of a declining trend. As I pointed out in my book, R for Stata Users, there are many commonalities between R and Stata. As a result of this, and the fact that R is open source, I expect R use to stabilize at this level while use of Stata continues to slowly decline.

SPSS’ long-term rapid decline has to level out at some point. They have been chipped away at by many competitors. However, until recently these competitors have either been free and code-based such as R, or menu-based and proprietary, such as Prism. With the fairly recent arrival of JASP, jamovi, and BlueSky Statistics, SPSS now faces software that is both free and menu-based. Previous projects to add menus to R, such as the R Commander and Deducer, were also free and open source, but they required installing R separately and then using R code to activate the menus.

These results apply to scholarly articles in general. The results in specific fields or journals are very likely to be different.

To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the job advertisements that list science software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!


Data Science Software Reviews: Forrester vs. Gartner

In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I show Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.

Forrester Research, Inc. is another company that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.

Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports. Figure 3c shows the results of The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018. By “multimodal” they mean controllable by various means such as menus, workflows, wizards, or code.  Figure 3d shows the results from The Forrester Wave: Notebook-Based Solutions, Q3, 2018 (notebooks blend programming code and output in the same window). Those are the two most recent Forrester reports on the topic. Forrester plans to cover tools for automated modeling in a separate report. Given that automation is now a widely adopted feature of the several companies shown in Figure 3c, that seems like an odd approach.

Both plots use the x-axis to display the strength of each company’s strategy, while the y-axis measures the strength of each’s current offering. Blue shading is used to divide the vendors into Leaders, Strong Performers, Contenders, and Challengers. The size of the circle around each data point indicates the “presence” of each vendor in the marketplace, weighted by 70% by vendor size and 30% by ISV and service partners.

In Figure 3c, we see a perspective that is radically different from the latest Gartner plot, 3a (see previous post). Here IBM is considered a leader, instead of a middle-of-the-pack Visionary. SAS and RapidMiner are both considered leaders by Gartner and Forrester.

In the Strong Performers segment, we see KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. KNIME and SAP are next to each other in this segment, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating from Gartner.

The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.

Forrester’s Challengers segment World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems.  Neither were considered by Gartner.

Forrester 2018 Multimodal

Figure 3c. Forrester Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018

The notebook-based vendors shown in Figure 3d is also extremely different from Gartner’s perspective. Here Domino Data Labs is a leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a leader, though its strength is this market is minimal.

Forrester 2018 Notebook

Figure 3d. Forrester Wave Notebook-Based Predictive Analytics and Machine Learning Solutions.

In the Strong Performers segment are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this segment; neither were reviewed by Gartner. Cloudera is in this segment as well; it was left out by Gartner.

The Condenders segment contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.

The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.

As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.

To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!


Gartner’s 2019 Take on Data Science Software

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2019 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging through all 40+ pages of my report, here’s just the updated section:

IT Research Firms

IT research firms study software products and corporate strategies. They survey customers regarding their satisfaction with the products and services and provide their analysis in reports that they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. The reports exclude open source software that has no specific company backing, such as R, Python, or jamovi. Even open source projects that do have company backing, such as BlueSky Statistics, are excluded if they have yet to achieve sufficient market adoption. However, they do cover how company products integrate open source software into their proprietary ones.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal companies that are distributing them. On the date of this post, Datarobot is offering free copies.

Gartner, Inc. is one of the research firms that write such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 17 which offered “cohesive software.” That software performs a wide range of tasks including data importation, preparation, exploration, visualization, modeling, and deployment.

Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3a shows the resulting “Magic Quadrant” plot for 2019, and 3b shows the plot for the previous year. Here I provide some commentary on their choices, briefly summarize their take, and compare this year’s report to last year’s. The main reports from both years contain far more detail than I cover here.


Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms from their 2019 report (plot done in November 2018, report released in 2019).

The Leaders quadrant is the place for companies whose vision is aligned with their customer’s needs and who have the resources to execute that vision. The further toward the upper-right corner of the plot, the better the combined score.

  • RapidMiner and KNIME reside in the best part of the Leaders quadrant this year and last. This year RapidMiner has the edge in ability to execute, while KNIME offers more vision. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license with “difficult-to-navigate pricing conditions.” These two offer very similar workflow-style user interfaces and have the ability to integrate many open sources tools into their workflows, including R, Python, Spark, and H2O.
  • Tibco moved from the Challengers quadrant last year to the Leaders this year. This is due to a number of factors, including the successful integration of all the tools they’ve purchased over the years, including Jaspersoft, Spotfire, Alpine Data, Streambase Systems, and Statistica.
  • SAS declined from being solidly in the Leaders quadrant last year to barely being in it this year. This is due to a substantial decline in its ability to execute. Given SAS Institute’s billions in revenue, that certainly can’t be a financial limitation. It may be due to SAS’ more limited ability to integrate as wide a range of tools as other vendors have. The SAS language itself continues to be an important research tool among those doing complex mixed-effects linear models. Those models are among the very few that R often fails to solve.

The companies in the Visionaries Quadrant are those that have good future plans but which may not have the resources to execute that vision.

  • Mathworks moved forward substantially in this quadrant due to MATLAB’s ability to handle unconventional data sources such as images, video, and the Internet of Things (IoT). It has also opened up more to open source deep learning projects.
  • H2O.ai is also in the Visionaries quadrant. This is the company behind the open source  H2O software, which is callable from many other packages or languages including R, Python, KNIME, and RapidMiner. While its own menu-based interface is primitive, its integration into KNIME and RapidMiner makes it easy to use for non-coders. H2O’s strength is in modeling but it is lacking in data access and preparation, as well as model management.
  • IBM dropped from the top of the Visionaries quadrant last year to the middle. The company has yet to fully integrate SPSS Statistics and SPSS Modeler into its Watson Studio. IBM has also had trouble getting Watson to deliver on its promises.
  • Databricks improved both its vision and its ability to execute, but not enough to move out of the Visionaries quadrant. It has done well with its integration of open-source tools into its Apache Spark-based system. However, it scored poorly in the predictability of costs.
  • Datarobot is new to the Gartner report this year. As its name indicates, its strength is in the automation of machine learning, which broadens its potential user base. The company’s policy of assigning a data scientist to each new client gets them up and running quickly.
  • Google’s position could be clarified by adding more dimensions to the plot. Its complex collection of a dozen products that work together is clearly aimed at software developers rather than data scientists or casual users. Simply figuring out what they all do and how they work together is a non-trivial task. In addition, the complete set runs only on Google’s cloud platform. Performance on big data is its forte, especially problems involving image or speech analysis/translation.
  • Microsoft offers several products, but only its cloud-only Azure Machine Learning (AML) was comprehensive enough to meet Gartner’s inclusion criteria. Gartner gives it high marks for ease-of-use, scalability, and strong partnerships. However, it is weak in automated modeling and AML’s relation to various other Microsoft components is overwhelming (same problem as Google’s toolset).

Figure 3b. Last year’s Gartner Magic Quadrant for Data Science and Machine Learning Platforms (January, 2018)

Those in the Challenger’s Quadrant have ample resources but less customer confidence in their future plans, or vision.

  • Alteryx dropped slightly in vision from last year, just enough to drop it out of the Leaders quadrant. Its workflow-based user interface is very similar to that of KNIME and RapidMiner, and it too gets top marks in ease-of-use. It also offers very strong data management capabilities, especially those that involve geographic data, spatial modeling, and mapping. It comes with geo-coded datasets, saving its customers from having to buy it elsewhere and figuring out how to import it. However, it has fallen behind in cutting edge modeling methods such as deep learning, auto-modeling, and the Internet of Things.
  • Dataiku strengthed its ability to execute significantly from last year. It added better scalability to its ease-of-use and teamwork collaboration. However, it is also perceived as expensive with a “cumbersome pricing structure.”

Members of the Niche Players quadrant offer tools that are not as broadly applicable. These include Anaconda, Datawatch (includes the former Angoss), Domino, and SAP.

  • Anaconda provides a useful distribution of Python and various data science libraries. They provide support and model management tools. The vast army of Python developers is its strength, but lack of stability in such a rapidly improving world can be frustrating to production-oriented organizations. This is a tool exclusively for experts in both programming and data science.
  • Datawatch offers the tools it acquired recently by purchasing Angoss, and its set of “Knowledge” tools continues to get high marks on ease-of-use and customer support. However, it’s weak in advanced methods and has yet to integrate the data management tools that Datawatch had before buying Angoss.
  • Domino Data Labs offers tools aimed only at expert programmers and data scientists. It gets high marks for openness and ability to integrate open source and proprietary tools, but low marks for data access and prep, integrating models into day-to-day operations, and customer support.
  • SAP’s machine learning tools integrate into its main SAP Enterprise Resource Planning system, but its fragmented toolset is weak, and its customer satisfaction ratings are low.

To see many other ways to rate this type of software, see my ongoing article, The Popularity of Data Science Software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!


Add JavaScript and CSS in Shiny

In this tutorial, I will cover how to include your own JavaScript, CSS and HTML code in your R shiny app. By including them, you can make a very powerful professional web app using R.

First let's understand the basics of a Webpage

In general, web page contains the following section of details.
  1. Content (Header, Paragraph, Footer, Listing)
  2. Font style, color, background, border
  3. Images and Videos
  4. Popups, widgets, special effects etc.

HTML, CSS and JavaScript

These 3 web programming languages in conjunction  take care of all the information webpage contains (from text to adding special effects).
  1. HTML determines the content and structure of a page (header, paragraph, footer etc.)
  2. CSS controls how webpage would look like (color, font type, border etc.)
  3. JavaScript decides advanced behaviors such as pop-up, animation etc.
Make JavaScript, CSS work for Shiny
Fundamentals of Webpage
One of the most common web development term you should know : rendering. It is the act of putting together a web page for presentation.
Shiny Dashboard Syntax

In this article, I will use shinydashboard library as it gives more professional and elegant look to app. The structure of shinydashboard syntax is similar to shiny library. Both requires ui and server components. However, functions are totally different. Refer the code below. Make sure to install library before using the following program.
# Load Library

# User Interface
ui =
dashboardHeader(title = "Blank Shiny App"),

# Server
server = function(input, output) { }

# Run App
runApp(list(ui = ui, server = server), launch.browser =T)

Example : Create Animation Effect

The program below generates animation in the web page. To test it, you can check out this link. When user hits "Click Me" button, it will trigger demojs() JavaScript which will initiate animation. It's a very basic animation. You can edit the code and make it as complex as you want.
#sampleanimation {
width: 50px;
height: 50px;
position: absolute;
background-color: blue;

#myContainer {
width: 400px;
height: 400px;
position: relative;
background: black;
function demojs() {
var elem = document.getElementById('sampleanimation');
var position = 0;
var id = setInterval(frame, 10);
function frame() {
if (position == 350) {
} else {
elem.style.top = position + 'px';
elem.style.left = position + 'px';

There are several ways to include custom JavaScript and CSS codes in Shiny. Some of the common ones are listed below with detailed explanation -

Method I : Use tags to insert HTML, CSS and JS Code in Shiny

tags$body(HTML("Your HTML Code"))
tags$head(HTML("<style type='text/css'>
Your CSS Code

CSS code can also be defined using tags$style. 
tags$head(tags$style(HTML(" Your CSS Code ")))

tags$head(HTML("<script type='text/javascript'>
Your JS Code


JS code can be described with tags$script.
tags$head(tags$script(HTML(" Your JS Code ")))

Code specified in tags$head means it will be included and executed under <head> </head>. Similarly tags$body can also be used to make shiny run code within <body> </body>

tags$head vs. tags$body

In general, JavaScript and CSS files are defined inside <head> </head>. Things which we want to display under body section of the webpage should be defined within <body> </body>.

Animation Code in Shiny

Important Note
In JS, CSS and HTML code, make sure to replace double quotation mark with single quotation mark under shiny's HTML(" ") function as it considers double quotation mark as closing the function.

Method II : Call JavaScript and CSS files in Shiny

You can use includeScript( ) and includeCSS( ) functions to refer JS and CSS codes from files saved in your local directory. You can save the files anywhere and mention the file location of them in the functions.

How to create JS and CSS files manually
Open notepad and paste JS code and save it with .js file extension and file type "All files" (not text document). Similarly you can create css file using .css file extension.

When to use Method 2?
When you want to include a big (lengthy) JS / CSS code, use method 2. Method 1 should be used for small code snippets as RStudio does not support coloring and error-checking of JS / CSS code. Also it makes code unnecessary lengthy which makes difficult to maintain.

Method III : Add JS and CSS files under www directory

Step 1 : 
Create an app using shinyApp( ) function and save it as app.R. Refer the code below.

Step 2 :
Create a folder named www in your app directory (where your app app.r file is stored) and save .js and .css files under the folder. Refer the folder structure below.
├── app.R
└── www
└── animate.js
└── animation.css

Step 3 :
Submit runApp( ) function. Specify path of app directory.
runApp(appDir = "C:/Users/DELL/Documents", launch.browser = T)

Method IV : Using Shinyjs R Package

The shinyjs package allows you to perform most frequently used JavaScript tasks without knowing JavaScript programming at all. For example, you can hide, show or toggle element. You can also enable or disable input.

Example : Turn content on and off by pressing the same button

Make sure to install shinyjs package before loading it. You can install it by using install.packages("shinyjs").

Important Point : Use function useShinyjs( ) under dashboardBody( ) to initialize shinyjs library

In the above program, we have used toggle( ) function to turn content on and off.

Example : Enable or disable Numeric Input based on checkbox selection

Communication between R and JavaScript

You can also define and call your own JavaScript function using shinyjs package with the use of extendShinyjs( ) function inside dashboardBody( ).
  1. Make sure to define custom JavaScript function beginning with word shinyjs
  2. JS function should be inside quotes
  3. In server, you can call the function by writing js$function-name
The program below closes app when user clicks on action button.

End Notes

With the huge popularity of JavaScript and many recent advancements, it is recommended to learn basics of JavaScript so that you can use them in R Shiny app. According to latest survey, JavaScript is used by 95% of websites. Its huge popularity is because of active broad JS developers community and being used by big players like Google, Facebook, Microsoft, etc.
Do comment on how you use shiny app in the comment box below. If you are beginner and want to learn building webapp using shiny, check out this tutorial

Install and Load Multiple R Packages

In enterprise environment, we generally need to automate the process of installing multiple R packages so that user does not have to install them separately before submitting your program.

The function below performs the following operations -
  1. First it finds all the already installed R packages
  2. Check packages which we want to install are already installed or not.
  3. If package is already installed, it does not install it again.
  4. If package is missing (not installed), it installs the package.
  5. Loop through steps 2, 3 and 4 for multiple packages we want to install
  6. Load all the packages (both already available and new ones).

Install_And_Load <- function(packages) {
  k <- packages[!(packages %in% installed.packages()[,"Package"])];
  {install.packages(k, repos='https://cran.rstudio.com/');}

  for(package_name in packages)
  {library(package_name,character.only=TRUE, quietly = TRUE);}
Install_And_Load(c("fuzzyjoin", "quanteda", "stringdist", "stringr", "stringi"))


1. installed.packages() returns details of all the already installed packages. installed.packages()[,"Package"] returns names of these packages.

To see version of the packages, submit the following command
2.  You can use any of the following repositories (URL of a CRAN mirror). You can experiment with these 3 repositories if one of them is blocked in your company due to firewall restriction.
3. quietly = TRUE tells R not to print errors/warnings if package attaching (loading) fails.

How to check version of R while installation

In the program below, the package RDCOMClient refers repository - http://www.omegahat.net/R if R version is greater than or equal to 3.5. Else refers the repository http://www.stats.ox.ac.uk/pub/RWin
if (length("RDCOMClient"[!("RDCOMClient" %in% installed.packages()[,"Package"])])) {
  if (as.numeric(R.Version()$minor)>= 5)
    install.packages("RDCOMClient", repos = "http://www.omegahat.net/R")
    install.packages("RDCOMClient", repos = "http://www.stats.ox.ac.uk/pub/RWin")

Take Screenshot of Webpage using R

Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.
Take Screenshot from R

What is PhantomJS?

PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

PhantomJS is an optimal solution for the following:
  • Headless website testing
  • Screen Capture
  • Page Automation
  • Network Monitoring

Webshot : R Package 

The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).

Install and Load Package

The stable version of webshot is available on CRAN hence can be installed using the below code:

Also, the latest development version of webshot is hosted on github and can be installed using the below code:

Initial Setup

As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.
The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.

Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.

Screenshot Function

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like '.jpg', '.png', '.pdf' based on which the output file is rendered. Below is the basic structure of how the function goes:

#webshot(url, filename.extension)
webshot("https://www.listendata.com/", "listendata.png")

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases - starting with downloading/converting a webpage as a PDFcopy.

Case #1: PDF Copy of WebPage

Let us assume, we would like to download Bill Gates' notes on Best Books of 2017 as a PDF copy.

#loading the required library

#PDF copy of a web page / article
 delay = 2)

The above code generates a PDF whose (partial) screenshot is below:
Snapshot of PDF Copy

Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.
  1. URL from which the screenshot has to be taken. 
  2. Output Filename along with its file extensions. 
  3. Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.
Thus, a webpage can be converted/downloaded as a PDF programmatically in R.

Case #2: Webpage Screenshot (Viewport Size)

Now, I'd like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.
#Screenshot of Viewport
webshot('https://www.livemint.com/','livemint.png', cliprect = 'viewport')
While the first two arguments are similar to the above function, there's a new third argument cliprect which specifies the size of the Clipping rectangle.

If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value 'viewport' which clips only the viewport part of the browser, as below.

Screenshot of Viewport of Browser

Case #3: Multiple Selector Based Screenshots

All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.
#Multiple Selector Based Screenshots
 file = c("organizations.png","contributions.png"),
 selector = list("div.border-top.py-3.clearfix","div.js-contribution-graph"))
In the above code, we take screenshot of two CSS Selectors from the github profile page of  Hadley Wickham and save them in two PNG files - organizations.png and contributions.png.


Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.


Using Excel for Data Entry

This article shows you how to enter data so that you can easily open in statistics packages such as R, SAS, SPSS, or jamovi (code or GUI steps below). Excel has some statistical analysis capabilities, but they often provide incorrect answers. For a comprehensive list of these limitations, see http://www.forecastingprinciples.com/paperpdf/McCullough.pdf and http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction.

Simple Data Sets

Most data sets are easy to enter using the following rules.

  • All your data should be in a single spreadsheet of a single file (for an exception to this rule, see Relational Data Sets below.)
  • Enter variable names in the first row of the spreadsheet.
  • Consider the length of your variable names. If you know for sure what software you will use, follow its rules for how many characters names can contain. When in doubt, use variable names that are no longer than 8 characters, beginning with a letter. Those short names can be used by any software.
  • Variable names should not contain spaces, but may use the underscore character.
  • No other text rows such as titles should be in the spreadsheet.
  • No blank rows should appear in the data.
  • Always include an ID variable on your original data collection form and in the spreadsheet to help you find the case again if you need to correct errors. You may need to sort the data later, after which the row number in Excel would then apply to a different subject or sampling unit, making it hard to find.
  • Position the ID variable in the left-most column for easy reference. 
  • If you have multiple groups, put them in the same spreadsheet along with a variable that indicates group membership (see Gender example below).
  • Many statistics packages don’t work well with alphabetic characters representing categorical values. For example to enter political party, you might enter 1 instead of Democrat, 2 instead of Republican and 3 instead of Other.
  • Avoid the use of special characters in numeric columns. Currency signs ($, €, etc.) can cause trouble in some programs.
  • If your group has only two levels, coding them 0 and 1 makes some analyses (e.g. linear regression) much easier to do. If the data are logical, use 0 for false, and 1 for true.
    If the data represent gender, it’s common to use 0 for female, 1 for male.
  • For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, some software (like R) will read the column as character data so you will not be able to, for example, calculate the mean of a column without taking action to address the situation.
  • You can enter dates with slashes (8/31/2018) and times with colons (12:15 AM). Note that dates are recorded differently across countries, so make sure you are using a format that matches your locale.
  • For text analysis, you can enter up to 32K of text, or about 8 pages, in a single cell. However, if you cut & paste if from elsewhere, remove carriage returns first as they will cause it to jump to a new cell.

Relational Data Sets

Some data sets contain observations that are related in some way. They may be people who all live in the same home, or samples that all came from the same site. There may be higher levels of relations, such as students within classrooms, then classrooms within schools. Data that contains such relations (a.k.a. nesting) may be stored in a “relational” database, but those are harder to learn than spreadsheet software. Relational data can easily be entered as two or more spreadsheets and combined later during data analysis. This saves quite a lot of data entry as the higher level data (e.g. family house value, socio-economic status, etc.) only needs to be entered once, instead of on several lines (e.g. for each family member).

If you have such data, make sure that each data set contains a “key” variable that acts as a  common ID number for family, site, school, etc. You can later read two files at a time and combine them matching on that key variable. R calls this combination a join or merge; SAS calls it a merge; and SPSS calls it Add Variables.

Example of a Good Data Structure

This data set follows all the rules for simple data sets above. Any statistics software can read it easily.

Gender Income
















Example of a Bad Data Structure

This is the same data shown above, but it violates the rules for simple data sets in several ways: there is no column for gender, the income values contain dollar signs and commas, variable names appear on more than one line, variable names are not even consistent (income vs. salary), and there is a blank line in the middle. This would not be easy to read!

Data for Female Subjects
ID Income





Data for Male Subjects
ID Salary







Excel Tips for Data Entry

  • You can make sure your variable names are always visible at the top of your Excel spreadsheet by choosing View> Freeze Panes> Freeze Top Row. This helps you enter data in the proper columns.
  • Avoid using Excel to sort your data. It’s too easy to sort one column independent of the others, which destroys your data! Statistics packages can sort data and they understand the importance of keeping all the values in each row locked together.
  • If you need to enter a pattern of consecutive values such as an ID number with values such as 1,2,3 or 1001,1002,1003, enter the first two, select those cells, then drag the tiny square in the lower right corner as far downward as you wish. Excel will see the pattern of the first two entries and extend it as far as you drag your selection. This works for days of the week and dates too. You can create your own lists in Options>Lists, if you use a certain pattern often.
  • To help prevent typos, you can set minimum and maximum values, or create a list of valid values. Select a column or set of similar columns, then go to the Data tab, then the Data Tools group, and choose Validation. To set minimum and maximum values, choose Allow: Whole Number or Decimals and then fill in the values in the Minimum and Maximum boxes. To create a list of valid values, choose Allow: List and then fill in the numeric or character values separated by commas in the Source box. Note that these rules only operate as you enter data, they will not help you find improper values that you have already entered.
  • The gold standard for data accuracy is the dual entry method. With this method you actually enter all the data twice. Only this method can catch errors that are within the normal range of values, but still wrong. Excel can show you where the values differ. Enter the data first in Sheet1. Then enter it again using the exact same layout in Sheet2. Finally, in Sheet1 select all cells using CTRL-A. Then choose Conditional Formatting> New Rule. Choose “Use a formula to determine which cells to format,” enter this formula:
    then click the Format button, make sure the Fill tab is selected, and choose a color. Then click OK twice. The inconsistencies between the two sheets will then be highlighted in Sheet1. You then check to see which entry was wrong and fix it. When you read the data into a statistics package, you will only need to read the data in Sheet1.
  • When looking for data errors, it can be very helpful to display only a subset of values. To do this, select all the columns you wish to scan for errors, then click the Filter icon on the Data tab. A downward-pointing triangle will appear at the top of each column selected. Clicking it displays a list of the values contained in that column. If you have entered values that are supposed to be, for example, between 1 and 5 and you see 6 on this list, choosing it will show you only those rows in which you made that error. Then you can fix them. You can also use click on Number Filters to use simple logic to find, for example, all rows with values greater than 5. When you are finished, click on the filter icon again to turn it off.


Save your data frequently and make backup copies often. Don’t leave all your backup copies connected to a computer which would leave them vulnerable to attack by viruses. Don’t store them all in the same building or you risk losing all your hard work in a fire or theft. Get a free account at http://drive.google.com, http://dropbox.com, or http://onedrive.live.com and save copies there.

 Steps for Reading Excel Data Into R

There are several ways to read an Excel file into R. Perhaps the easiest method uses the following commands. They read an excel file named mydata.xlsx into an R data frame called mydata. For examples on how to read many other file formats into R, see:

# Do this once to install:

# Each time you read a file, follow these steps
mydata <- read_excel("mydata.xlsx")

Steps for Reading Excel Data Into SPSS

  1. In SPSS, choose File> Open> Data.
  2. Change the “Files of file type” box to “Excel (*.xlsx)”
  3. When the Read Excel File box appears, select the Worksheet name and check the box for Read variable names from the first row of data, then click OK.
  4. When the data appears in the SPSS data editor spreadsheet, Choose File: Save as and leave the Save as type box to SPSS (*.sav).
  5. Enter the name of the file without the .sav extension and then click Save to save the file in SPSS format.
  6. Next time open the .sav version, you won’t need to convert the file again.
  7. If you create variable or value labels in the SPSS file and then need to read your data from Excel again you can copy them into the new file. First, make sure you use the same variable names. Next, after opening the file in SPSS, use Copy Data Properties from the Data menu. Simply name the SPSS file that has properties (such as labels) that you want to copy, check off the things you want to copy and click OK. 

Steps for Reading Excel Data Into SAS

The code below will read an excel file called mydata.xlsx and store it as a permanent SAS dataset called sasuser.mydata. If your organization is considering migrating from SAS to R, I offer some tips here: http://r4stats.com/articles/migrate-to-r/

proc import datafile="mydata.xlsx"
dbms=xlsx out=sasuser.mydata replace;

Steps for Reading Excel Data into jamovi

At the moment, jamovi can open CSV, JASP, SAS, SPSS, and Stata files, but not Excel. So you must open the data in Excel and Save As a comma separated value (CSV) file. The ability to read Excel files should be added to a release in the near future. For more information about the free and open source jamovi software, see my review here:

More to Come

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter where I announce my blog posts.


Run Python from R

This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It's always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.
Combine Python and R

RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

Install and Load Reticulate Package

Run the command below to get this package installed and imported to your system.
# Install reticulate package

# Load reticulate package

Check whether Python is available on your system
It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

Import a python module within R

You can use the function import( ) to import a particular package or module.
os <- import("os")
The above program returns working directory.
[1] "C:\\Users\\DELL\\Documents"

You can use listdir( ) function from os package to see all the files in working directory
 [1] ".conda"                       ".gitignore"                   ".httr-oauth"                 
[4] ".matplotlib" ".RData" ".RDataTmp"
[7] ".Rhistory" "1.pdf" "12.pdf"
[10] "122.pdf" "124.pdf" "13.pdf"
[13] "1403.2805.pdf" "2.pdf" "3.pdf"
[16] "AIR.xlsx" "app.r" "Apps"
[19] "articles.csv" "Attrition_Telecom.xlsx" "AUC.R"

Install Python Package

Step 1 : Create a new environment 
Step 2 : Install a package within a conda environment
conda_install("r-reticulate", "numpy")
Since numpy is already installed, you don't need to install it again. The above example is just for demonstration.

Step 3 : Load the package
numpy <- import("numpy")

Working with numpy array

Let's create a sample numpy array
y <- array(1:4, c(2, 2))
x <- numpy$array(y)
     [,1] [,2]
[1,] 1 3
[2,] 2 4

Transpose the above array
    [,1] [,2]
[1,] 1 2
[2,] 3 4

Eigenvalues and eigen vectors
[1] -0.3722813 5.3722813

[,1] [,2]
[1,] -0.9093767 -0.5657675
[2,] 0.4159736 -0.8245648

Mathematical Functions

Working with Python interactively

You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).

By using repl_python() function, you can make it interactive. Download the dataset used in the program below.

# Load Pandas package
import pandas as pd

# Importing Dataset
travel = pd.read_excel("AIR.xlsx")

# Number of rows and columns

# Select random no. of rows
travel.sample(n = 10)

# Group By

# Filter
t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

# Return to R
Note : You need to enter exit to return to the R environment.
call python from R
Run Python from R

How to access objects created in python from R

You can use the py object to access objects created within python.
In this case, I am using R's summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.
# Line chart using ggplot2
ggplot(py$t, aes(AIR, Year)) + geom_line()

How to access objects created in R from Python

You can use the r object to accomplish this task. 

1. Let's create a object in R
mydata = head(cars, n=15)
2. Use the R created object within Python REPL
import pandas as pd

Building Logistic Regression Model using sklearn package

The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.

# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# load the iris datasets
iris = datasets.load_iris()

# Developing logit model
model = LogisticRegression()
model.fit(iris.data, iris.target)

# Scoring
actual = iris.target
predicted = model.predict(iris.data)

# Performance Metrics
print(metrics.classification_report(actual, predicted))
print(metrics.confusion_matrix(actual, predicted))

Other Useful Functions

To see configuration of python

Run the py_config( ) command to find the version of python installed on your system.It also shows details about anaconda and numpy.
python:         C:\Users\DELL\ANACON~1\python.exe
libpython: C:/Users/DELL/ANACON~1/python36.dll
pythonhome: C:\Users\DELL\ANACON~1
version: 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Architecture: 64bit
numpy: C:\Users\DELL\ANACON~1\lib\site-packages\numpy
numpy_version: 1.14.2

To check whether a particular package is installed

In the following program, we are checking whether pandas package is installed or not.

15 Types of Regression you should know

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution.

Table of Contents
  1. What is Regression Analysis?
  2. Terminologies related to Regression
  3. Types of Regressions
    • Linear Regression
    • Polynomial Regression
    • Logistic Regression
    • Quantile Regression
    • Ridge Regression
    • Lasso Regression
    • ElasticNet Regression
    • Principal Component Regression
    • Partial Least Square Regression
    • Support Vector Regression
    • Ordinal Regression
    • Poisson Regression
    • Negative Binomial Regression
    • Quasi-Poisson Regression
    • Cox Regression
  4. How to choose the correct Regression Model?
Regression Analysis Simplified

What is Regression Analysis?

Lets take a simple example : Suppose your manager asked you to predict annual sales. There can be a hundred of factors (drivers) that affects sales. In this case, sales is your dependent variable. Factors affecting sales are independent variables. Regression analysis would help you to solve this problem.
In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

It helps us to answer the following questions -
  1. Which of the drivers have a significant impact on sales. 
  2. Which is the most important driver of sales
  3. How do the drivers interact with each other
  4. What would be the annual sales next year.

Terminologies related to regression analysis

1. Outliers
Suppose there is an observation in the dataset which is having a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is extreme value. An outlier is a problem because many times it hampers the results we get.

2. Multicollinearity
When the independent variables are highly correlated to each other then the variables are said to be multicollinear. Many types of regression techniques assumes multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes job difficult in selecting the most important independent variable (factor).

3. Heteroscedasticity
When dependent variable's variability is not equal across values of an independent variable, it is called heteroscedasticity. Example - As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

4. Underfitting and Overfitting
When we use unnecessary explanatory variables it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as problem of high variance.

When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as problem of high bias.

In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.e. it will lead to large errors even in the training set. Using a polynomial fit in fig 2 is balanced i.e. such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.
Underfitting vs Overfitting
Regression : Underfitting and Overfitting

Types of Regression

Every regression technique has some assumptions attached to it which we need to meet before running analysis. These techniques differ in terms of type of dependent and independent variables and distribution.

1. Linear Regression

It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. The green points are the actual observations while the black line fitted is the line of regression

regression analysis
Regression Analysis

When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression.
When you have more than 1 independent variable and 1 dependent variable, it is called Multiple linear regression.
The equation of multiple linear regression is listed below -

Multiple Regression Equation
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term. βi’s are the regression coefficients.

Assumptions of linear regression: 
  1. There must be a linear relation between independent and dependent variables. 
  2. There should not be any outliers present. 
  3. No heteroscedasticity 
  4. Sample observations should be independent. 
  5. Error terms should be normally distributed with mean 0 and constant variance. 
  6. Absence of multicollinearity and auto-correlation.

Estimating the parametersTo estimate the regression coefficients βi’s we use principle of least squares which is to minimize the sum of squares due to the error terms i.e.

On solving the above equation mathematically we obtain the regression coefficients as:

Interpretation of regression coefficients
Let us consider an example where the dependent variable is marks obtained by a student and explanatory variables are number of hours studied and no. of classes attended. Suppose on fitting linear regression we got the linear regression as:
Marks obtained = 5 + 2 (no. of hours studied) + 0.5(no. of classes attended)
Thus we can have the regression coefficients 2 and 0.5 which can interpreted as:
  1. If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
  2. Keeping no. of classes attended constant, if student studies for one hour more then he will score 2 more marks in the examination. 
  3. Similarly keeping no. of hours studied constant, if student attends one more class then he will attain 0.5 marks more.

Linear Regression in R
We consider the swiss data set for carrying out linear regression in R. We use lm() function in the base package. We try to estimate Fertility with the help of other variables.
model = lm(Fertility ~ .,data = swiss)
lm_coeff = model$coefficients

The output we get is:

> lm_coeff
     (Intercept)      Agriculture      Examination        Education         Catholic 
66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
> summary(model)

lm(formula = Fertility ~ ., data = swiss)

Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213

Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
Hence we can see that 70% of the variation in Fertility rate can be explained via linear regression.

2. Polynomial Regression

It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable.
In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relation between the dependent and independent variable seems to be non-linear we can deploy Polynomial Regression Models.
Thus a polynomial of degree k in one variable is written as:
Here we can create new features like
and can fit linear regression in the similar manner.

In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.
Disclaimer: It is to be kept in mind that creating unnecessary extra features or fitting polynomials of higher degree may lead to overfitting.

Polynomial regression in R:
We are using poly.csv data for fitting polynomial regression where we try to estimate the Prices of the house given their area.

Firstly we read the data using read.csv( ) and divide it into the dependent and independent variable
data = read.csv("poly.csv")
x = data$Area
y = data$Price
In order to compare the results of linear and polynomial regression, firstly we fit linear regression:
model1 = lm(y ~x)

The coefficients and predicted values obtained are:
> model1$fit
1 2 3 4 5 6 7 8 9 10
169.0995 178.9081 188.7167 218.1424 223.0467 266.6949 291.7068 296.6111 316.2282 335.8454
> model1$coeff
(Intercept) x
120.05663769 0.09808581
We create a dataframe where the new variable are x and x square.

new_x = cbind(x,x^2)

[1,] 500 250000
[2,] 600 360000
[3,] 700 490000
[4,] 1000 1000000
[5,] 1050 1102500
[6,] 1495 2235025
[7,] 1750 3062500
[8,] 1800 3240000
[9,] 2000 4000000
[10,] 2200 4840000
Now we fit usual OLS to the new data:
model2 = lm(y~new_x)

The fitted values and regression coefficients of polynomial regression are:
> model2$fit
1 2 3 4 5 6 7 8 9 10
122.5388 153.9997 182.6550 251.7872 260.8543 310.6514 314.1467 312.6928 299.8631 275.8110
> model2$coeff
(Intercept) new_xx new_x
-7.684980e+01 4.689175e-01 -1.402805e-04

Using ggplot2 package we try to create a plot to compare the curves by both linear and polynomial regression.
ggplot(data = data) + geom_point(aes(x = Area,y = Price)) +
geom_line(aes(x = Area,y = model1$fit),color = "red") +
geom_line(aes(x = Area,y = model2$fit),color = "blue") +
theme(panel.background = element_blank())

3. Logistic Regression

In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable.

Here my model is:
logistic regression
logistic regression equation

Why don't we use linear regression in this case?
  • The homoscedasticity assumption is violated.
  • Errors are not normally distributed
  • y follows binomial distribution and hence is not normal.

  • HR Analytics: IT firms recruit large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost over-runs because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome - Join / Not Join).

  • Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

Predicting the category of dependent variable for a given vector X of independent variables
Through logistic regression we have -
P(Y=1) = exp(a + BₙX)  / (1+ exp(a + BₙX))

Thus we choose a cut-off of probability say 'p'  and if P(Yi = 1) > p then we can say that Yi belongs to class 1 otherwise 0.

Interpreting the logistic regression coefficients (Concept of Odds Ratio)
If we take exponential of coefficients, then we’ll get odds ratio for ith explanatory variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than the odds of non-event. Suppose dependent variable is customer attrition (whether customer will close relationship with the company) and independent variable is citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the odds of a national attrite.

Logistic Regression in R:
In this case, we are trying to estimate whether a person will have cancer depending whether he smokes or not.

We fit logistic regression with glm( )  function and we set family = "binomial"
model <- glm(Lung.Cancer..Y.~Smoking..X.,data = data, family = "binomial")
The predicted probabilities are given by:
#Predicted Probablities

        1         2         3         4         5         6         7         8         9 
0.4545455 0.4545455 0.6428571 0.6428571 0.4545455 0.4545455 0.4545455 0.4545455 0.6428571
10 11 12 13 14 15 16 17 18
0.6428571 0.4545455 0.4545455 0.6428571 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
19 20 21 22 23 24 25
0.6428571 0.4545455 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
Predicting whether the person will have cancer or not when we choose the cut off probability to be 0.5
data$prediction <- model$fitted.values>0.5
> data$prediction

4. Quantile Regression

Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data.

In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.
The term “quantile” is the same as “percentile”

Basic Idea of Quantile Regression:In quantile regression we try to estimate the quantile of the dependent variable given the values of X's. Note that the dependent variable should be continuous.

The quantile regression model:
For qth quantile we have the following regression model:
This seems similar to linear regression model but here the objective function we consider to minimize is:
where q is the qth quantile.

If q  = 0.5 i.e. if we are interested in the median then it becomes median regression (or least absolute deviation regression) and substituting the value of q = 0.5 in above equation we get the objective function as:
Interpreting the coefficients in quantile regression:
Suppose the regression equation for 25th quantile of regression is: 
y = 5.2333 + 700.823 x

It means that for one unit increase in x the estimated increase in 25th quantile of y by 700.823 units.
Advantages of Quantile over Linear Regression
  • Quite beneficial when heteroscedasticity is present in the data.
  • Robust to outliers
  • Distribution of dependent variable can be described via various quantiles.
  • It is more useful than linear regression when the data is skewed.

Disclaimer on using quantile regression!
It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.

Quantile Regression in R
We need to install quantreg package in order to carry out quantile regression.


Using rq function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss data. For this we set tau = 0.25.

model1 = rq(Fertility~.,data = swiss,tau = 0.25)
tau: [1] 0.25

coefficients lower bd upper bd
(Intercept) 76.63132 2.12518 93.99111
Agriculture -0.18242 -0.44407 0.10603
Examination -0.53411 -0.91580 0.63449
Education -0.82689 -1.25865 -0.50734
Catholic 0.06116 0.00420 0.22848
Infant.Mortality 0.69341 -0.10562 2.36095

Setting tau = 0.5 we run the median regression.
model2 = rq(Fertility~.,data = swiss,tau = 0.5)

tau: [1] 0.5

coefficients lower bd upper bd
(Intercept) 63.49087 38.04597 87.66320
Agriculture -0.20222 -0.32091 -0.05780
Examination -0.45678 -1.04305 0.34613
Education -0.79138 -1.25182 -0.06436
Catholic 0.10385 0.01947 0.15534
Infant.Mortality 1.45550 0.87146 2.21101

We can run quantile regression for multiple quantiles in a single plot.
model3 = rq(Fertility~.,data = swiss, tau = seq(0.05,0.95,by = 0.05))
quantplot = summary(model3)

We can check whether our quantile regression results differ from the OLS results using plots.

We get the following plot:

Various quantiles are depicted by X axis. The red central line denotes the estimates of OLS coefficients and the dotted red lines are the confidence intervals around those OLS coefficients for various quantiles. The black dotted line are the quantile regression estimates and the gray area is the confidence interval for them for various quantiles. We can see that for all the variable both the regression estimated coincide for most of the quantiles. Hence our use of quantile regression is not justifiable for such quantiles. In other words we want that both the red and the gray lines should overlap as less as possible to justify our use of quantile regression.

5. Ridge Regression

It's important to understand the concept of regularization before jumping to ridge regression.

1. Regularization

Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.

Regularization is generally useful in the following situations:
  1. Large number of variables
  2. Low ratio of number observations to number of variables
  3. High Multi-Collinearity

2. L1 Loss function or L1 Regularization

In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients.  This is also known as least absolute deviations method. Lasso Regression makes use of L1 regularization.

3. L2 Loss function or L2 Regularization

In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization.

In general, L2 performs better than L1 regularization. L2 is efficient in terms of computation. There is one area where L1 is considered as a preferred option over L2. L1 has in-built feature selection for sparse feature spaces.  For example, you are predicting whether a person is having a brain tumor using more than 20,000 genetic markers (features). It is known that the vast majority of genes have little or no effect on the presence or severity of most diseases.

In the linear regression objective function we try to minimize the sum of squares of errors. In ridge regression (also known as shrinkage regression) we add a constraint on the sum of squares of the regression coefficients. Thus in ridge regression our objective function is:
Here λ is the regularization parameter which is a non negative number. Here we do not assume normality in the error terms.

Very Important Note: 
We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X's.
We can see that ridge regression makes use of L2 regularization.

On solving the above objective function we can get the estimates of β as:

How can we choose the regularization parameter λ?

If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen to be very large then it will lead to underfitting. Thus it is highly important to determine a desirable value of lambda. To tackle this issue, we plot the parameter estimates against different values of lambda and select the minimum value of λ after which the parameters tend to stabilize.

R code for Ridge Regression

Considering the swiss data set, we create two different datasets, one containing dependent variable and other containing independent variables.
X = swiss[,-1]
y = swiss[,1]

We need to load glmnet library to carry out ridge regression.
Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means we are carrying out ridge regression. lambda is a sequence of various values of lambda which will be used for cross validation.
set.seed(123) #Setting the seed to get similar results.
model = cv.glmnet(as.matrix(X),y,alpha = 0,lambda = 10^seq(4,-1,-0.1))

We take the best lambda by using lambda.min and hence get the regression coefficients using predict function.
best_lambda = model$lambda.min

ridge_coeff = predict(model,s = best_lambda,type = "coefficients")
ridge_coeff The coefficients obtained using ridge regression are:
6 x 1 sparse Matrix of class "dgCMatrix"
(Intercept) 64.92994664
Agriculture -0.13619967
Examination -0.31024840
Education -0.75679979
Catholic 0.08978917
Infant.Mortality 1.09527837

6. Lasso Regression
Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularization technique in the objective function. Thus the objective function in LASSO regression becomes:
λ is the regularization parameter and the intercept term is not regularized. 
We do not assume that the error terms are normally distributed.
For the estimates we don't have any specific mathematical formula but we can obtain the estimates using some statistical software.

Note that lasso regression also needs standardization.

Advantage of lasso over ridge regression

Lasso regression can perform in-built variable selection as well as parameter shrinkage. While using ridge regression one may end up getting all the variables but with Shrinked Paramaters.

R code for Lasso Regression

Considering the swiss dataset from "datasets" package, we have: 
#Creating dependent and independent variables.
X = swiss[,-1]
y = swiss[,1]
Using cv.glmnet in glmnet package we do cross validation. For lasso regression we set alpha = 1. By default standardize = TRUE hence we do not need to standardize the variables seperately.
#Setting the seed for reproducibility
model = cv.glmnet(as.matrix(X),y,alpha = 1,lambda = 10^seq(4,-1,-0.1))
#By default standardize = TRUE

We consider the best value of lambda by filtering out lamba.min from the model and hence get the coefficients using predict function.
#Taking the best lambda
best_lambda = model$lambda.min
lasso_coeff = predict(model,s = best_lambda,type = "coefficients")
lasso_coeff The lasso coefficients we got are:
6 x 1 sparse Matrix of class "dgCMatrix"
(Intercept) 65.46374579
Agriculture -0.14994107
Examination -0.24310141
Education -0.83632674
Catholic 0.09913931
Infant.Mortality 1.07238898

Which one is better - Ridge regression or Lasso regression?

Both ridge regression and lasso regression are addressed to deal with multicollinearity. 
Ridge regression is computationally more efficient over lasso regression. Any of them can perform better. So the best approach is to select that regression model which fits the test set data well.

7. Elastic Net Regression
Elastic Net regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables.

It is a combination of both L1 and L2 regularization.

The objective function in case of Elastic Net Regression is:
Like ridge and lasso regression, it does not assume normality.

R code for Elastic Net Regression

Setting some different value of alpha between 0 and 1 we can carry out elastic net regression.
model = cv.glmnet(as.matrix(X),y,alpha = 0.5,lambda = 10^seq(4,-1,-0.1))
#Taking the best lambda
best_lambda = model$lambda.min
en_coeff = predict(model,s = best_lambda,type = "coefficients")
The coeffients we obtained are:
6 x 1 sparse Matrix of class "dgCMatrix"
(Intercept) 65.9826227
Agriculture -0.1570948
Examination -0.2581747
Education -0.8400929
Catholic 0.0998702
Infant.Mortality 1.0775714
8. Principal Components Regression (PCR) 
PCR is a regression technique which is widely used when you have many independent variables OR multicollinearity exist in your data. It is divided into 2 steps:
  1. Getting the Principal components
  2. Run regression analysis on principal components
The most common features of PCR are:
  1. Dimensionality Reduction
  2. Removal of multicollinearity

Getting the Principal components

Principal components analysis is a statistical method to extract new features when the original features are highly correlated. We create new features with the help of original features such that the new features are uncorrelated.

Let us consider the first principle component:
The first PC is having the maximum variance.
Similarly we can find the second PC U2 such that it is uncorrelated with U1 and has the second largest variance.
In a similar manner for 'p' features we can have a maximum of 'p' PCs such that all the PCs are uncorrelated with each other and the first PC has the maximum variance, then 2nd PC has the maximum variance and so on.


It is to be mentioned that PCR is not a feature selection technique instead it is a feature extraction technique. Each principle component we obtain is a function of all the features. Hence on using principal components one would be unable to explain which factor is affecting the dependent variable to what extent.

Principal Components Regression in R

We use the longley data set available in R which is used for high multicollinearity. We excplude the Year column.
data1 = longley[,colnames(longley) != "Year"]

View(data)  This is how some of the observations in our dataset will look like:
We use pls package in order to run PCR.

In PCR we are trying to estimate the number of Employed people; scale  = T denotes that we are standardizing the variables; validation = "CV" denotes applicability of cross-validation.
pcr_model <- pcr(Employed~., data = data1, scale = TRUE, validation = "CV")

We get the summary as:
Data:  X dimension: 16 5 
Y dimension: 16 1
Fit method: svdpc
Number of components considered: 5

Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
CV 3.627 1.194 1.118 0.5555 0.6514 0.5954
adjCV 3.627 1.186 1.111 0.5489 0.6381 0.5819

TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 72.19 95.70 99.68 99.98 100.00
Employed 90.42 91.89 98.32 98.33 98.74

Here in the RMSEP the root mean square errors are being denoted. While in 'Training: %variance explained' the cumulative % of variance explained by principle components is being depicted. We can see that with 3 PCs more than 99% of variation can be attributed.
We can also create a plot depicting the mean squares error for the number of various PCs.
validationplot(pcr_model,val.type = "MSEP")
By writing val.type = "R2" we can plot the R square for various no. of PCs.
validationplot(pcr_model,val.type = "R2")
 If we want to fit pcr for 3 principal components and hence get the predicted values we can write:
pred = predict(pcr_model,data1,ncomp = 3)

9. Partial Least Squares (PLS) Regression 

It is an alternative technique of principal component regression when you have independent variables highly correlated. It is also useful when there are a large number of independent variables.

Difference between PLS and PCR
Both techniques create new independent variables called components which are linear combinations of the original predictor variables but PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. While PLS takes the dependent variable into account, and therefore often leads to models that are able to fit the dependent variable with fewer components.
PLS Regression in R
pls.model = plsreg1(vehicles[, c(1:12,14:16)], vehicles[, 13], comps = 3)
# R-Square

10. Support Vector Regression

Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models.

The main idea of SVR is to minimize error, individualizing the hyperplane which maximizes the margin.
svr.model <- svm(Y ~ X , data)
pred <- predict(svr.model, data)
points(data$X, pred, col = "red", pch=4)

11. Ordinal Regression

Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables - Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).

Why we can't use linear regression when dealing with ordinal target variable?

In linear regression, the dependent variable assumes that changes in the level of the dependent variable are equivalent throughout the range of the variable. For example, the difference in weight between a person who is 100 kg and a person who is 120 kg is 20kg, which has the same meaning as the difference in weight between a person who is 150 kg and a person who is 170 kg. These relationships do not necessarily hold for ordinal variables.
o.model <- clm(rating ~ ., data = wine)

12. Poisson Regression

Poisson regression is used when dependent variable has count data.

Application of Poisson Regression -
  1. Predicting the number of calls in customer care related to a particular product
  2. Estimating the number of emergency service calls during an event
The dependent variable must meet the following conditions
  1. The dependent variable has a Poisson distribution.
  2. Counts cannot be negative.
  3. This method is not suitable on non-whole numbers

In the code below, we are using dataset named warpbreaks which shows the number of breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool tension and the interaction between the two.
pos.model<-glm(breaks~wool*tension, data = warpbreaks, family=poisson)

13. Negative Binomial Regression

Like Poisson Regression, it also deals with count data. The question arises "how it is different from poisson regression". The answer is negative binomial regression does not assume distribution of count having variance equal to its mean. While poisson regression assumes the variance equal to its mean.
When the variance of count data is greater than the mean count, it is a case of overdispersion. The opposite of the previous statement is a case of under-dispersion.
nb.model <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)

14. Quasi Poisson Regression

It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean.
qs.pos.model <- glm(Days ~ Sex/(Age + Eth*Lrn), data = quine,  family = "quasipoisson")
Quasi-Poisson regression can handle both over-dispersion and under-dispersion.

15. Cox Regression

Cox Regression is suitable for time-to-event data. See the examples below -
  1. Time from customer opened the account until attrition.
  2. Time after cancer treatment until death.
  3. Time from first heart attack to the second.
Logistic regression uses a binary dependent variable but ignores the timing of events. 
As well as estimating the time it takes to reach a certain event, survival analysis can also be used to compare time-to-event for multiple groups.

Dual targets are set for the survival model 
1. A continuous variable representing the time to event.
2. A binary variable representing the status whether event occurred or not.
# Lung Cancer Data
# status: 2=death
lung$SurvObj <- with(lung, Surv(time, status == 2))
cox.reg <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data =  lung)

How to choose the correct regression model?
  1. If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.
  2. If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression.
  3. To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.
  4. Try support vector regression when you have non-linear model.

Use R to interface with SAS Cloud Analytics Services

The R SWAT package (SAS Wrapper for Analytics Transfer) enables you to upload big data into an in-memory distributed environment to manage data and create predictive models using familiar R syntax. In the SAS Viya Integration with Open Source Languages: R course, you learn the syntax and methodology required to [...]

The post Use R to interface with SAS Cloud Analytics Services appeared first on SAS Learning Post.

Back to Top