28
May

Take Screenshot of Webpage using R

Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.
Take Screenshot from R


What is PhantomJS?

PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

PhantomJS is an optimal solution for the following:
  • Headless website testing
  • Screen Capture
  • Page Automation
  • Network Monitoring

Webshot : R Package 

The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).

Install and Load Package

The stable version of webshot is available on CRAN hence can be installed using the below code:
install.packages('webshot')
library('webshot')

Also, the latest development version of webshot is hosted on github and can be installed using the below code:
#install.packages('devtools')
devtools::install_github('wch/webshot')

Initial Setup

As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.
webshot::install_phantomjs()
The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.

Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.

Screenshot Function

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like '.jpg', '.png', '.pdf' based on which the output file is rendered. Below is the basic structure of how the function goes:
library(webshot)

#webshot(url, filename.extension)
webshot("https://www.listendata.com/", "listendata.png")

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases - starting with downloading/converting a webpage as a PDFcopy.

Case #1: PDF Copy of WebPage

Let us assume, we would like to download Bill Gates' notes on Best Books of 2017 as a PDF copy.

#loading the required library
 library(webshot)

#PDF copy of a web page / article
 webshot("https://www.gatesnotes.com/About-Bill-Gates/Best-Books-2017",
 "billgates_book.pdf",
 delay = 2)

The above code generates a PDF whose (partial) screenshot is below:
Snapshot of PDF Copy

Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.
  1. URL from which the screenshot has to be taken. 
  2. Output Filename along with its file extensions. 
  3. Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.
Thus, a webpage can be converted/downloaded as a PDF programmatically in R.

Case #2: Webpage Screenshot (Viewport Size)

Now, I'd like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.
#Screenshot of Viewport
webshot('https://www.livemint.com/','livemint.png', cliprect = 'viewport')
While the first two arguments are similar to the above function, there's a new third argument cliprect which specifies the size of the Clipping rectangle.

If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value 'viewport' which clips only the viewport part of the browser, as below.

Screenshot of Viewport of Browser

Case #3: Multiple Selector Based Screenshots

All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.
#Multiple Selector Based Screenshots
webshot("https://github.com/hadley",
 file = c("organizations.png","contributions.png"),
 selector = list("div.border-top.py-3.clearfix","div.js-contribution-graph"))
In the above code, we take screenshot of two CSS Selectors from the github profile page of  Hadley Wickham and save them in two PNG files - organizations.png and contributions.png.

Contributions.png

Organizations.png
Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.

References
28
Mar

Using Excel for Data Entry

This article shows you how to enter data so that you can easily open in statistics packages such as R, SAS, SPSS, or jamovi (code or GUI steps below). Excel has some statistical analysis capabilities, but they often provide incorrect answers. For a comprehensive list of these limitations, see http://www.forecastingprinciples.com/paperpdf/McCullough.pdf and http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction.

Simple Data Sets

Most data sets are easy to enter using the following rules.

  • All your data should be in a single spreadsheet of a single file (for an exception to this rule, see Relational Data Sets below.)
  • Enter variable names in the first row of the spreadsheet.
  • Consider the length of your variable names. If you know for sure what software you will use, follow its rules for how many characters names can contain. When in doubt, use variable names that are no longer than 8 characters, beginning with a letter. Those short names can be used by any software.
  • Variable names should not contain spaces, but may use the underscore character.
  • No other text rows such as titles should be in the spreadsheet.
  • No blank rows should appear in the data.
  • Always include an ID variable on your original data collection form and in the spreadsheet to help you find the case again if you need to correct errors. You may need to sort the data later, after which the row number in Excel would then apply to a different subject or sampling unit, making it hard to find.
  • Position the ID variable in the left-most column for easy reference. 
  • If you have multiple groups, put them in the same spreadsheet along with a variable that indicates group membership (see Gender example below).
  • Many statistics packages don’t work well with alphabetic characters representing categorical values. For example to enter political party, you might enter 1 instead of Democrat, 2 instead of Republican and 3 instead of Other.
  • Avoid the use of special characters in numeric columns. Currency signs ($, €, etc.) can cause trouble in some programs.
  • If your group has only two levels, coding them 0 and 1 makes some analyses (e.g. linear regression) much easier to do. If the data are logical, use 0 for false, and 1 for true.
    If the data represent gender, it’s common to use 0 for female, 1 for male.
  • For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, some software (like R) will read the column as character data so you will not be able to, for example, calculate the mean of a column without taking action to address the situation.
  • You can enter dates with slashes (8/31/2018) and times with colons (12:15 AM). Note that dates are recorded differently across countries, so make sure you are using a format that matches your locale.
  • For text analysis, you can enter up to 32K of text, or about 8 pages, in a single cell. However, if you cut & paste if from elsewhere, remove carriage returns first as they will cause it to jump to a new cell.

Relational Data Sets

Some data sets contain observations that are related in some way. They may be people who all live in the same home, or samples that all came from the same site. There may be higher levels of relations, such as students within classrooms, then classrooms within schools. Data that contains such relations (a.k.a. nesting) may be stored in a “relational” database, but those are harder to learn than spreadsheet software. Relational data can easily be entered as two or more spreadsheets and combined later during data analysis. This saves quite a lot of data entry as the higher level data (e.g. family house value, socio-economic status, etc.) only needs to be entered once, instead of on several lines (e.g. for each family member).

If you have such data, make sure that each data set contains a “key” variable that acts as a  common ID number for family, site, school, etc. You can later read two files at a time and combine them matching on that key variable. R calls this combination a join or merge; SAS calls it a merge; and SPSS calls it Add Variables.

Example of a Good Data Structure

This data set follows all the rules for simple data sets above. Any statistics software can read it easily.

ID
Gender Income

1

0

32000

2

1

23000

3

0

137000

4

1

54000

5

1

48500

Example of a Bad Data Structure

This is the same data shown above, but it violates the rules for simple data sets in several ways: there is no column for gender, the income values contain dollar signs and commas, variable names appear on more than one line, variable names are not even consistent (income vs. salary), and there is a blank line in the middle. This would not be easy to read!

Data for Female Subjects
ID Income

1

$32,000

3

$137,000

   
Data for Male Subjects
ID Salary

2

$23,000

4

$54,000

5

$48,500

Excel Tips for Data Entry

  • You can make sure your variable names are always visible at the top of your Excel spreadsheet by choosing View> Freeze Panes> Freeze Top Row. This helps you enter data in the proper columns.
  • Avoid using Excel to sort your data. It’s too easy to sort one column independent of the others, which destroys your data! Statistics packages can sort data and they understand the importance of keeping all the values in each row locked together.
  • If you need to enter a pattern of consecutive values such as an ID number with values such as 1,2,3 or 1001,1002,1003, enter the first two, select those cells, then drag the tiny square in the lower right corner as far downward as you wish. Excel will see the pattern of the first two entries and extend it as far as you drag your selection. This works for days of the week and dates too. You can create your own lists in Options>Lists, if you use a certain pattern often.
  • To help prevent typos, you can set minimum and maximum values, or create a list of valid values. Select a column or set of similar columns, then go to the Data tab, then the Data Tools group, and choose Validation. To set minimum and maximum values, choose Allow: Whole Number or Decimals and then fill in the values in the Minimum and Maximum boxes. To create a list of valid values, choose Allow: List and then fill in the numeric or character values separated by commas in the Source box. Note that these rules only operate as you enter data, they will not help you find improper values that you have already entered.
  • The gold standard for data accuracy is the dual entry method. With this method you actually enter all the data twice. Only this method can catch errors that are within the normal range of values, but still wrong. Excel can show you where the values differ. Enter the data first in Sheet1. Then enter it again using the exact same layout in Sheet2. Finally, in Sheet1 select all cells using CTRL-A. Then choose Conditional Formatting> New Rule. Choose “Use a formula to determine which cells to format,” enter this formula:
    =A1<>Sheet2!A1
    then click the Format button, make sure the Fill tab is selected, and choose a color. Then click OK twice. The inconsistencies between the two sheets will then be highlighted in Sheet1. You then check to see which entry was wrong and fix it. When you read the data into a statistics package, you will only need to read the data in Sheet1.
  • When looking for data errors, it can be very helpful to display only a subset of values. To do this, select all the columns you wish to scan for errors, then click the Filter icon on the Data tab. A downward-pointing triangle will appear at the top of each column selected. Clicking it displays a list of the values contained in that column. If you have entered values that are supposed to be, for example, between 1 and 5 and you see 6 on this list, choosing it will show you only those rows in which you made that error. Then you can fix them. You can also use click on Number Filters to use simple logic to find, for example, all rows with values greater than 5. When you are finished, click on the filter icon again to turn it off.

Backups

Save your data frequently and make backup copies often. Don’t leave all your backup copies connected to a computer which would leave them vulnerable to attack by viruses. Don’t store them all in the same building or you risk losing all your hard work in a fire or theft. Get a free account at http://drive.google.com, http://dropbox.com, or http://onedrive.live.com and save copies there.

 Steps for Reading Excel Data Into R

There are several ways to read an Excel file into R. Perhaps the easiest method uses the following commands. They read an excel file named mydata.xlsx into an R data frame called mydata. For examples on how to read many other file formats into R, see:
http://r4stats.com/examples/data-import/.

# Do this once to install:
install.packages("readxl")

# Each time you read a file, follow these steps
library("readxl")
mydata <- read_excel("mydata.xlsx")
mydata 

Steps for Reading Excel Data Into SPSS

  1. In SPSS, choose File> Open> Data.
  2. Change the “Files of file type” box to “Excel (*.xlsx)”
  3. When the Read Excel File box appears, select the Worksheet name and check the box for Read variable names from the first row of data, then click OK.
  4. When the data appears in the SPSS data editor spreadsheet, Choose File: Save as and leave the Save as type box to SPSS (*.sav).
  5. Enter the name of the file without the .sav extension and then click Save to save the file in SPSS format.
  6. Next time open the .sav version, you won’t need to convert the file again.
  7. If you create variable or value labels in the SPSS file and then need to read your data from Excel again you can copy them into the new file. First, make sure you use the same variable names. Next, after opening the file in SPSS, use Copy Data Properties from the Data menu. Simply name the SPSS file that has properties (such as labels) that you want to copy, check off the things you want to copy and click OK. 

Steps for Reading Excel Data Into SAS

The code below will read an excel file called mydata.xlsx and store it as a permanent SAS dataset called sasuser.mydata. If your organization is considering migrating from SAS to R, I offer some tips here: http://r4stats.com/articles/migrate-to-r/

proc import datafile="mydata.xlsx"
dbms=xlsx out=sasuser.mydata replace;
getnames=yes;
run;

Steps for Reading Excel Data into jamovi

At the moment, jamovi can open CSV, JASP, SAS, SPSS, and Stata files, but not Excel. So you must open the data in Excel and Save As a comma separated value (CSV) file. The ability to read Excel files should be added to a release in the near future. For more information about the free and open source jamovi software, see my review here:
http://r4stats.com/2018/02/13/jamovi-for-r-easy-but-controversial/.

More to Come

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter where I announce my blog posts.

27
Mar

Run Python from R

This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It's always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.
Combine Python and R

RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

Install and Load Reticulate Package

Run the command below to get this package installed and imported to your system.
# Install reticulate package
install.packages("reticulate")

# Load reticulate package
library(reticulate)

Check whether Python is available on your system
py_available()
It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

Import a python module within R

You can use the function import( ) to import a particular package or module.
os <- import("os")
os$getcwd()
The above program returns working directory.
[1] "C:\\Users\\DELL\\Documents"

You can use listdir( ) function from os package to see all the files in working directory
os$listdir()
 [1] ".conda"                       ".gitignore"                   ".httr-oauth"                 
[4] ".matplotlib" ".RData" ".RDataTmp"
[7] ".Rhistory" "1.pdf" "12.pdf"
[10] "122.pdf" "124.pdf" "13.pdf"
[13] "1403.2805.pdf" "2.pdf" "3.pdf"
[16] "AIR.xlsx" "app.r" "Apps"
[19] "articles.csv" "Attrition_Telecom.xlsx" "AUC.R"


Install Python Package

Step 1 : Create a new environment 
conda_create("r-reticulate")
Step 2 : Install a package within a conda environment
conda_install("r-reticulate", "numpy")
Since numpy is already installed, you don't need to install it again. The above example is just for demonstration.

Step 3 : Load the package
numpy <- import("numpy")

Working with numpy array

Let's create a sample numpy array
y <- array(1:4, c(2, 2))
x <- numpy$array(y)
     [,1] [,2]
[1,] 1 3
[2,] 2 4


Transpose the above array
numpy$transpose(x)
    [,1] [,2]
[1,] 1 2
[2,] 3 4

Eigenvalues and eigen vectors
numpy$linalg$eig(x)
[[1]]
[1] -0.3722813 5.3722813

[[2]]
[,1] [,2]
[1,] -0.9093767 -0.5657675
[2,] 0.4159736 -0.8245648

Mathematical Functions
numpy$sqrt(x)
numpy$exp(x)

Working with Python interactively

You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).

By using repl_python() function, you can make it interactive. Download the dataset used in the program below.
repl_python()

# Load Pandas package
import pandas as pd

# Importing Dataset
travel = pd.read_excel("AIR.xlsx")

# Number of rows and columns
travel.shape

# Select random no. of rows
travel.sample(n = 10)

# Group By
travel.groupby("Year").AIR.mean()

# Filter
t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

# Return to R
exit
Note : You need to enter exit to return to the R environment.
call python from R
Run Python from R

How to access objects created in python from R

You can use the py object to access objects created within python.
summary(py$t)
In this case, I am using R's summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.
# Line chart using ggplot2
library(ggplot2)
ggplot(py$t, aes(AIR, Year)) + geom_line()

How to access objects created in R from Python

You can use the r object to accomplish this task. 

1. Let's create a object in R
mydata = head(cars, n=15)
2. Use the R created object within Python REPL
repl_python()
import pandas as pd
r.mydata.describe()
pd.isnull(r.mydata.speed)
exit

Building Logistic Regression Model using sklearn package

The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.
repl_python()

# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# load the iris datasets
iris = datasets.load_iris()

# Developing logit model
model = LogisticRegression()
model.fit(iris.data, iris.target)

# Scoring
actual = iris.target
predicted = model.predict(iris.data)

# Performance Metrics
print(metrics.classification_report(actual, predicted))
print(metrics.confusion_matrix(actual, predicted))

Other Useful Functions

To see configuration of python

Run the py_config( ) command to find the version of python installed on your system.It also shows details about anaconda and numpy.
py_config()
python:         C:\Users\DELL\ANACON~1\python.exe
libpython: C:/Users/DELL/ANACON~1/python36.dll
pythonhome: C:\Users\DELL\ANACON~1
version: 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Architecture: 64bit
numpy: C:\Users\DELL\ANACON~1\lib\site-packages\numpy
numpy_version: 1.14.2


To check whether a particular package is installed

In the following program, we are checking whether pandas package is installed or not.
py_module_available("pandas")
25
Mar

15 Types of Regression you should know

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution.

Table of Contents
  1. What is Regression Analysis?
  2. Terminologies related to Regression
  3. Types of Regressions
    • Linear Regression
    • Polynomial Regression
    • Logistic Regression
    • Quantile Regression
    • Ridge Regression
    • Lasso Regression
    • ElasticNet Regression
    • Principal Component Regression
    • Partial Least Square Regression
    • Support Vector Regression
    • Ordinal Regression
    • Poisson Regression
    • Negative Binomial Regression
    • Quasi-Poisson Regression
    • Cox Regression
  4. How to choose the correct Regression Model?
Regression Analysis Simplified


What is Regression Analysis?

Lets take a simple example : Suppose your manager asked you to predict annual sales. There can be a hundred of factors (drivers) that affects sales. In this case, sales is your dependent variable. Factors affecting sales are independent variables. Regression analysis would help you to solve this problem.
In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

It helps us to answer the following questions -
  1. Which of the drivers have a significant impact on sales. 
  2. Which is the most important driver of sales
  3. How do the drivers interact with each other
  4. What would be the annual sales next year.

Terminologies related to regression analysis

1. Outliers
Suppose there is an observation in the dataset which is having a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is extreme value. An outlier is a problem because many times it hampers the results we get.

2. Multicollinearity
When the independent variables are highly correlated to each other then the variables are said to be multicollinear. Many types of regression techniques assumes multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes job difficult in selecting the most important independent variable (factor).

3. Heteroscedasticity
When dependent variable's variability is not equal across values of an independent variable, it is called heteroscedasticity. Example - As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

4. Underfitting and Overfitting
When we use unnecessary explanatory variables it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as problem of high variance.

When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as problem of high bias.

In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.e. it will lead to large errors even in the training set. Using a polynomial fit in fig 2 is balanced i.e. such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.
Underfitting vs Overfitting
Regression : Underfitting and Overfitting

Types of Regression

Every regression technique has some assumptions attached to it which we need to meet before running analysis. These techniques differ in terms of type of dependent and independent variables and distribution.

1. Linear Regression

It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. The green points are the actual observations while the black line fitted is the line of regression

regression analysis
Regression Analysis

When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression.
When you have more than 1 independent variable and 1 dependent variable, it is called Multiple linear regression.
The equation of multiple linear regression is listed below -

Multiple Regression Equation
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term. βi’s are the regression coefficients.

Assumptions of linear regression: 
  1. There must be a linear relation between independent and dependent variables. 
  2. There should not be any outliers present. 
  3. No heteroscedasticity 
  4. Sample observations should be independent. 
  5. Error terms should be normally distributed with mean 0 and constant variance. 
  6. Absence of multicollinearity and auto-correlation.

Estimating the parametersTo estimate the regression coefficients βi’s we use principle of least squares which is to minimize the sum of squares due to the error terms i.e.


On solving the above equation mathematically we obtain the regression coefficients as:

Interpretation of regression coefficients
Let us consider an example where the dependent variable is marks obtained by a student and explanatory variables are number of hours studied and no. of classes attended. Suppose on fitting linear regression we got the linear regression as:
Marks obtained = 5 + 2 (no. of hours studied) + 0.5(no. of classes attended)
Thus we can have the regression coefficients 2 and 0.5 which can interpreted as:
  1. If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
  2. Keeping no. of classes attended constant, if student studies for one hour more then he will score 2 more marks in the examination. 
  3. Similarly keeping no. of hours studied constant, if student attends one more class then he will attain 0.5 marks more.

Linear Regression in R
We consider the swiss data set for carrying out linear regression in R. We use lm() function in the base package. We try to estimate Fertility with the help of other variables.
library(datasets)
model = lm(Fertility ~ .,data = swiss)
lm_coeff = model$coefficients
lm_coeff
summary(model)

The output we get is:

> lm_coeff
     (Intercept)      Agriculture      Examination        Education         Catholic 
66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
Infant.Mortality
1.0770481
> summary(model)

Call:
lm(formula = Fertility ~ ., data = swiss)

Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
Hence we can see that 70% of the variation in Fertility rate can be explained via linear regression.

2. Polynomial Regression

It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable.
In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relation between the dependent and independent variable seems to be non-linear we can deploy Polynomial Regression Models.
Thus a polynomial of degree k in one variable is written as:
Here we can create new features like
and can fit linear regression in the similar manner.

In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.
Disclaimer: It is to be kept in mind that creating unnecessary extra features or fitting polynomials of higher degree may lead to overfitting.

Polynomial regression in R:
We are using poly.csv data for fitting polynomial regression where we try to estimate the Prices of the house given their area.

Firstly we read the data using read.csv( ) and divide it into the dependent and independent variable
data = read.csv("poly.csv")
x = data$Area
y = data$Price
In order to compare the results of linear and polynomial regression, firstly we fit linear regression:
model1 = lm(y ~x)
model1$fit
model1$coeff

The coefficients and predicted values obtained are:
> model1$fit
1 2 3 4 5 6 7 8 9 10
169.0995 178.9081 188.7167 218.1424 223.0467 266.6949 291.7068 296.6111 316.2282 335.8454
> model1$coeff
(Intercept) x
120.05663769 0.09808581
We create a dataframe where the new variable are x and x square.

new_x = cbind(x,x^2)

new_x
         x        
[1,] 500 250000
[2,] 600 360000
[3,] 700 490000
[4,] 1000 1000000
[5,] 1050 1102500
[6,] 1495 2235025
[7,] 1750 3062500
[8,] 1800 3240000
[9,] 2000 4000000
[10,] 2200 4840000
Now we fit usual OLS to the new data:
model2 = lm(y~new_x)
model2$fit
model2$coeff

The fitted values and regression coefficients of polynomial regression are:
> model2$fit
1 2 3 4 5 6 7 8 9 10
122.5388 153.9997 182.6550 251.7872 260.8543 310.6514 314.1467 312.6928 299.8631 275.8110
> model2$coeff
(Intercept) new_xx new_x
-7.684980e+01 4.689175e-01 -1.402805e-04

Using ggplot2 package we try to create a plot to compare the curves by both linear and polynomial regression.
library(ggplot2)
ggplot(data = data) + geom_point(aes(x = Area,y = Price)) +
geom_line(aes(x = Area,y = model1$fit),color = "red") +
geom_line(aes(x = Area,y = model2$fit),color = "blue") +
theme(panel.background = element_blank())



3. Logistic Regression

In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable.

Here my model is:
logistic regression
logistic regression equation

Why don't we use linear regression in this case?
  • The homoscedasticity assumption is violated.
  • Errors are not normally distributed
  • y follows binomial distribution and hence is not normal.

Examples
  • HR Analytics: IT firms recruit large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost over-runs because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome - Join / Not Join).

  • Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

Predicting the category of dependent variable for a given vector X of independent variables
Through logistic regression we have -
P(Y=1) = exp(a + BₙX)  / (1+ exp(a + BₙX))

Thus we choose a cut-off of probability say 'p'  and if P(Yi = 1) > p then we can say that Yi belongs to class 1 otherwise 0.

Interpreting the logistic regression coefficients (Concept of Odds Ratio)
If we take exponential of coefficients, then we’ll get odds ratio for ith explanatory variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than the odds of non-event. Suppose dependent variable is customer attrition (whether customer will close relationship with the company) and independent variable is citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the odds of a national attrite.

Logistic Regression in R:
In this case, we are trying to estimate whether a person will have cancer depending whether he smokes or not.


We fit logistic regression with glm( )  function and we set family = "binomial"
model <- glm(Lung.Cancer..Y.~Smoking..X.,data = data, family = "binomial")
The predicted probabilities are given by:
#Predicted Probablities

model$fitted.values
        1         2         3         4         5         6         7         8         9 
0.4545455 0.4545455 0.6428571 0.6428571 0.4545455 0.4545455 0.4545455 0.4545455 0.6428571
10 11 12 13 14 15 16 17 18
0.6428571 0.4545455 0.4545455 0.6428571 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
19 20 21 22 23 24 25
0.6428571 0.4545455 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
Predicting whether the person will have cancer or not when we choose the cut off probability to be 0.5
data$prediction <- model$fitted.values>0.5
> data$prediction
[1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[16] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE

4. Quantile Regression

Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data.

In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.
The term “quantile” is the same as “percentile”

Basic Idea of Quantile Regression:In quantile regression we try to estimate the quantile of the dependent variable given the values of X's. Note that the dependent variable should be continuous.

The quantile regression model:
For qth quantile we have the following regression model:
This seems similar to linear regression model but here the objective function we consider to minimize is:
where q is the qth quantile.

If q  = 0.5 i.e. if we are interested in the median then it becomes median regression (or least absolute deviation regression) and substituting the value of q = 0.5 in above equation we get the objective function as:
Interpreting the coefficients in quantile regression:
Suppose the regression equation for 25th quantile of regression is: 
y = 5.2333 + 700.823 x

It means that for one unit increase in x the estimated increase in 25th quantile of y by 700.823 units.
Advantages of Quantile over Linear Regression
  • Quite beneficial when heteroscedasticity is present in the data.
  • Robust to outliers
  • Distribution of dependent variable can be described via various quantiles.
  • It is more useful than linear regression when the data is skewed.

Disclaimer on using quantile regression!
It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.

Quantile Regression in R
We need to install quantreg package in order to carry out quantile regression.

install.packages("quantreg")
library(quantreg)

Using rq function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss data. For this we set tau = 0.25.

model1 = rq(Fertility~.,data = swiss,tau = 0.25)
summary(model1)
tau: [1] 0.25

Coefficients:
coefficients lower bd upper bd
(Intercept) 76.63132 2.12518 93.99111
Agriculture -0.18242 -0.44407 0.10603
Examination -0.53411 -0.91580 0.63449
Education -0.82689 -1.25865 -0.50734
Catholic 0.06116 0.00420 0.22848
Infant.Mortality 0.69341 -0.10562 2.36095

Setting tau = 0.5 we run the median regression.
model2 = rq(Fertility~.,data = swiss,tau = 0.5)
summary(model2)

tau: [1] 0.5

Coefficients:
coefficients lower bd upper bd
(Intercept) 63.49087 38.04597 87.66320
Agriculture -0.20222 -0.32091 -0.05780
Examination -0.45678 -1.04305 0.34613
Education -0.79138 -1.25182 -0.06436
Catholic 0.10385 0.01947 0.15534
Infant.Mortality 1.45550 0.87146 2.21101

We can run quantile regression for multiple quantiles in a single plot.
model3 = rq(Fertility~.,data = swiss, tau = seq(0.05,0.95,by = 0.05))
quantplot = summary(model3)
quantplot

We can check whether our quantile regression results differ from the OLS results using plots.

plot(quantplot)
We get the following plot:

Various quantiles are depicted by X axis. The red central line denotes the estimates of OLS coefficients and the dotted red lines are the confidence intervals around those OLS coefficients for various quantiles. The black dotted line are the quantile regression estimates and the gray area is the confidence interval for them for various quantiles. We can see that for all the variable both the regression estimated coincide for most of the quantiles. Hence our use of quantile regression is not justifiable for such quantiles. In other words we want that both the red and the gray lines should overlap as less as possible to justify our use of quantile regression.

5. Ridge Regression

It's important to understand the concept of regularization before jumping to ridge regression.

1. Regularization

Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.

Regularization is generally useful in the following situations:
  1. Large number of variables
  2. Low ratio of number observations to number of variables
  3. High Multi-Collinearity

2. L1 Loss function or L1 Regularization

In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients.  This is also known as least absolute deviations method. Lasso Regression makes use of L1 regularization.

3. L2 Loss function or L2 Regularization

In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization.

In general, L2 performs better than L1 regularization. L2 is efficient in terms of computation. There is one area where L1 is considered as a preferred option over L2. L1 has in-built feature selection for sparse feature spaces.  For example, you are predicting whether a person is having a brain tumor using more than 20,000 genetic markers (features). It is known that the vast majority of genes have little or no effect on the presence or severity of most diseases.

In the linear regression objective function we try to minimize the sum of squares of errors. In ridge regression (also known as shrinkage regression) we add a constraint on the sum of squares of the regression coefficients. Thus in ridge regression our objective function is:
Here λ is the regularization parameter which is a non negative number. Here we do not assume normality in the error terms.

Very Important Note: 
We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X's.
We can see that ridge regression makes use of L2 regularization.


On solving the above objective function we can get the estimates of β as:

How can we choose the regularization parameter λ?

If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen to be very large then it will lead to underfitting. Thus it is highly important to determine a desirable value of lambda. To tackle this issue, we plot the parameter estimates against different values of lambda and select the minimum value of λ after which the parameters tend to stabilize.

R code for Ridge Regression

Considering the swiss data set, we create two different datasets, one containing dependent variable and other containing independent variables.
X = swiss[,-1]
y = swiss[,1]

We need to load glmnet library to carry out ridge regression.
library(glmnet)
Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means we are carrying out ridge regression. lambda is a sequence of various values of lambda which will be used for cross validation.
set.seed(123) #Setting the seed to get similar results.
model = cv.glmnet(as.matrix(X),y,alpha = 0,lambda = 10^seq(4,-1,-0.1))

We take the best lambda by using lambda.min and hence get the regression coefficients using predict function.
best_lambda = model$lambda.min

ridge_coeff = predict(model,s = best_lambda,type = "coefficients")
ridge_coeff The coefficients obtained using ridge regression are:
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 64.92994664
Agriculture -0.13619967
Examination -0.31024840
Education -0.75679979
Catholic 0.08978917
Infant.Mortality 1.09527837

6. Lasso Regression
Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularization technique in the objective function. Thus the objective function in LASSO regression becomes:
λ is the regularization parameter and the intercept term is not regularized. 
We do not assume that the error terms are normally distributed.
For the estimates we don't have any specific mathematical formula but we can obtain the estimates using some statistical software.

Note that lasso regression also needs standardization.

Advantage of lasso over ridge regression

Lasso regression can perform in-built variable selection as well as parameter shrinkage. While using ridge regression one may end up getting all the variables but with Shrinked Paramaters.

R code for Lasso Regression

Considering the swiss dataset from "datasets" package, we have: 
#Creating dependent and independent variables.
X = swiss[,-1]
y = swiss[,1]
Using cv.glmnet in glmnet package we do cross validation. For lasso regression we set alpha = 1. By default standardize = TRUE hence we do not need to standardize the variables seperately.
#Setting the seed for reproducibility
set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 1,lambda = 10^seq(4,-1,-0.1))
#By default standardize = TRUE

We consider the best value of lambda by filtering out lamba.min from the model and hence get the coefficients using predict function.
#Taking the best lambda
best_lambda = model$lambda.min
lasso_coeff = predict(model,s = best_lambda,type = "coefficients")
lasso_coeff The lasso coefficients we got are:
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 65.46374579
Agriculture -0.14994107
Examination -0.24310141
Education -0.83632674
Catholic 0.09913931
Infant.Mortality 1.07238898


Which one is better - Ridge regression or Lasso regression?

Both ridge regression and lasso regression are addressed to deal with multicollinearity. 
Ridge regression is computationally more efficient over lasso regression. Any of them can perform better. So the best approach is to select that regression model which fits the test set data well.

7. Elastic Net Regression
Elastic Net regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables.

It is a combination of both L1 and L2 regularization.

The objective function in case of Elastic Net Regression is:
Like ridge and lasso regression, it does not assume normality.

R code for Elastic Net Regression

Setting some different value of alpha between 0 and 1 we can carry out elastic net regression.
set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 0.5,lambda = 10^seq(4,-1,-0.1))
#Taking the best lambda
best_lambda = model$lambda.min
en_coeff = predict(model,s = best_lambda,type = "coefficients")
en_coeff
The coeffients we obtained are:
6 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 65.9826227
Agriculture -0.1570948
Examination -0.2581747
Education -0.8400929
Catholic 0.0998702
Infant.Mortality 1.0775714
8. Principal Components Regression (PCR) 
PCR is a regression technique which is widely used when you have many independent variables OR multicollinearity exist in your data. It is divided into 2 steps:
  1. Getting the Principal components
  2. Run regression analysis on principal components
The most common features of PCR are:
  1. Dimensionality Reduction
  2. Removal of multicollinearity

Getting the Principal components

Principal components analysis is a statistical method to extract new features when the original features are highly correlated. We create new features with the help of original features such that the new features are uncorrelated.

Let us consider the first principle component:
The first PC is having the maximum variance.
Similarly we can find the second PC U2 such that it is uncorrelated with U1 and has the second largest variance.
In a similar manner for 'p' features we can have a maximum of 'p' PCs such that all the PCs are uncorrelated with each other and the first PC has the maximum variance, then 2nd PC has the maximum variance and so on.

Drawbacks:

It is to be mentioned that PCR is not a feature selection technique instead it is a feature extraction technique. Each principle component we obtain is a function of all the features. Hence on using principal components one would be unable to explain which factor is affecting the dependent variable to what extent.

Principal Components Regression in R

We use the longley data set available in R which is used for high multicollinearity. We excplude the Year column.
data1 = longley[,colnames(longley) != "Year"]

View(data)  This is how some of the observations in our dataset will look like:
We use pls package in order to run PCR.
install.packages("pls")
library(pls)

In PCR we are trying to estimate the number of Employed people; scale  = T denotes that we are standardizing the variables; validation = "CV" denotes applicability of cross-validation.
pcr_model <- pcr(Employed~., data = data1, scale = TRUE, validation = "CV")
summary(pcr_model)

We get the summary as:
Data:  X dimension: 16 5 
Y dimension: 16 1
Fit method: svdpc
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
CV 3.627 1.194 1.118 0.5555 0.6514 0.5954
adjCV 3.627 1.186 1.111 0.5489 0.6381 0.5819

TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 72.19 95.70 99.68 99.98 100.00
Employed 90.42 91.89 98.32 98.33 98.74

Here in the RMSEP the root mean square errors are being denoted. While in 'Training: %variance explained' the cumulative % of variance explained by principle components is being depicted. We can see that with 3 PCs more than 99% of variation can be attributed.
We can also create a plot depicting the mean squares error for the number of various PCs.
validationplot(pcr_model,val.type = "MSEP")
By writing val.type = "R2" we can plot the R square for various no. of PCs.
validationplot(pcr_model,val.type = "R2")
 If we want to fit pcr for 3 principal components and hence get the predicted values we can write:
pred = predict(pcr_model,data1,ncomp = 3)

9. Partial Least Squares (PLS) Regression 

It is an alternative technique of principal component regression when you have independent variables highly correlated. It is also useful when there are a large number of independent variables.

Difference between PLS and PCR
Both techniques create new independent variables called components which are linear combinations of the original predictor variables but PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. While PLS takes the dependent variable into account, and therefore often leads to models that are able to fit the dependent variable with fewer components.
PLS Regression in R
library(plsdepot)
data(vehicles)
pls.model = plsreg1(vehicles[, c(1:12,14:16)], vehicles[, 13], comps = 3)
# R-Square
pls.model$R2


10. Support Vector Regression

Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models.

The main idea of SVR is to minimize error, individualizing the hyperplane which maximizes the margin.
library(e1071)
svr.model <- svm(Y ~ X , data)
pred <- predict(svr.model, data)
points(data$X, pred, col = "red", pch=4)

11. Ordinal Regression

Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables - Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).

Why we can't use linear regression when dealing with ordinal target variable?

In linear regression, the dependent variable assumes that changes in the level of the dependent variable are equivalent throughout the range of the variable. For example, the difference in weight between a person who is 100 kg and a person who is 120 kg is 20kg, which has the same meaning as the difference in weight between a person who is 150 kg and a person who is 170 kg. These relationships do not necessarily hold for ordinal variables.
library(ordinal)
o.model <- clm(rating ~ ., data = wine)
summary(o.model)

12. Poisson Regression

Poisson regression is used when dependent variable has count data.

Application of Poisson Regression -
  1. Predicting the number of calls in customer care related to a particular product
  2. Estimating the number of emergency service calls during an event
The dependent variable must meet the following conditions
  1. The dependent variable has a Poisson distribution.
  2. Counts cannot be negative.
  3. This method is not suitable on non-whole numbers

In the code below, we are using dataset named warpbreaks which shows the number of breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool tension and the interaction between the two.
pos.model<-glm(breaks~wool*tension, data = warpbreaks, family=poisson)
summary(pos.model)

13. Negative Binomial Regression

Like Poisson Regression, it also deals with count data. The question arises "how it is different from poisson regression". The answer is negative binomial regression does not assume distribution of count having variance equal to its mean. While poisson regression assumes the variance equal to its mean.
When the variance of count data is greater than the mean count, it is a case of overdispersion. The opposite of the previous statement is a case of under-dispersion.
library(MASS)
nb.model <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
summary(nb.model)

14. Quasi Poisson Regression

It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean.
qs.pos.model <- glm(Days ~ Sex/(Age + Eth*Lrn), data = quine,  family = "quasipoisson")
Quasi-Poisson regression can handle both over-dispersion and under-dispersion.


15. Cox Regression

Cox Regression is suitable for time-to-event data. See the examples below -
  1. Time from customer opened the account until attrition.
  2. Time after cancer treatment until death.
  3. Time from first heart attack to the second.
Logistic regression uses a binary dependent variable but ignores the timing of events. 
As well as estimating the time it takes to reach a certain event, survival analysis can also be used to compare time-to-event for multiple groups.

Dual targets are set for the survival model 
1. A continuous variable representing the time to event.
2. A binary variable representing the status whether event occurred or not.
library(survival)
# Lung Cancer Data
# status: 2=death
lung$SurvObj <- with(lung, Surv(time, status == 2))
cox.reg <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data =  lung)
cox.reg

How to choose the correct regression model?
  1. If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.
  2. If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression.
  3. To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.
  4. Try support vector regression when you have non-linear model.
6
Mar

Use R to interface with SAS Cloud Analytics Services

The R SWAT package (SAS Wrapper for Analytics Transfer) enables you to upload big data into an in-memory distributed environment to manage data and create predictive models using familiar R syntax. In the SAS Viya Integration with Open Source Languages: R course, you learn the syntax and methodology required to [...]

The post Use R to interface with SAS Cloud Analytics Services appeared first on SAS Learning Post.

26
Feb

Gartner’s 2018 Take on Data Science Tools

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2018 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging though all 40+ pages of my report, here’s just the new section:

IT Research Firms

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. While these reports focus on companies, they often also describe how their commercial tools integrate open source tools such as R, Python, H2O, TensoFlow, and others.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.

Gartner, Inc. is one of the companies that provides such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue, or lower revenue combined with high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Hereafter, I refer to these as simply vision and ability. Figure 3a shows the resulting “Magic Quadrant” plot for 2018, and 3b shows the plot for the previous year.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The further to the upper-right corner, the better the combined score. KNIME is in the prime position, with H2O.ai showing greater vision but lower ability to execute. This year KNIME gained the ability to run H2O.ai algorithms, so these two may be viewed as complementary tools rather than outright competitors.

Alteryx and SAS have nearly the same combined scores, but note that Gartner studied only SAS Enterprise Miner and SAS Visual Analytics. The latter includes Visual Statistics, and Visual Data Mining and Machine Learning. Excluded was the SAS System itself since Gartner focuses on tools that are integrated. This lack of integration may explain SAS’ decline in vision from last year.

KNIME and RapidMiner are quite similar tools as they are both driven by an easy to use and reproducible workflow interface. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license. In the previous year’s Magic Quadrant, RapidMiner was slightly ahead, but now KNIME is in the lead.

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Figure 3b. Gartner Magic Quadrant for Data Science Platforms 2017.

The companies in the Visionaries Quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, IBM took a big hit by landing here after being in the Leader’s Quadrant for several years. Now they’re in a near-tie with Microsoft and Domino. Domino shot up from the bottom of that quadrant to towards the top. They integrate many different open source and commercial software (e.g. SAS, MATLAB) into their Domino Data Science Platform. Databricks and Dataiku offer cloud-based analytics similar to Domino, though lacking in access to commercial tools.

Those in the Challenger’s Quadrant have ample resources but less customer confidence on their future plans, or vision. Mathworks, the makers of MATLAB, continues to “stay the course” with its proprietary tools while most of the competition offers much better integration into the ever-expanding universe of open source tools.  Tibco replaces Quest in this quadrant due to their purchase of Statistica. Whatever will become of the red-headed stepchild of data science? Statistica has been owned by four companies in four years! (Statsoft, Dell, Quest, Tibco) Users of the software have got to be considering other options. Tibco also purchased Alpine Data in 2017, accounting for its disappearance from Figure 3b to 3a.

Members of the Niche Players quadrant offer tools that are not as broadly applicable. Anaconda is new to Gartner coverage this year. It offers in-depth support for Python. SAP has a toolchain that Gartner calls “fragmented and ambiguous.”  Angoss was recently purchased by Datawatch. Gartner points out that after 20 years in business, Angoss has only 300 loyal customers. With competition fierce in the data science arena, one can’t help but wonder how long they’ll be around. Speaking of deathwatches, once the king of Big Data, Teradata has been hammered by competition from open source tools such as Hadoop and Spark. Teradata’s net income was higher in 2008 than it is today.

As of 2/26/2018, RapidMiner is giving away copies of the Gartner report here.

26
Feb

Web Scraping Website with R

In this tutorial, we will cover how to extract information from a matrimonial website using R.  We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.
Web Scraping in R
Web Scraping in R

Install the required packages

To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.
install.packages("rvest")
install.packages("dplyr")

Load the required Libraries

To make the libraries in use, you need to submit the program below.
library(rvest)
library(dplyr)

Scrape Information from Matrimonial Website

First we need to understand the structure of URL. See the URLs below.
https://www.jeevansathi.com/punjabi-brides-girls
https://www.jeevansathi.com/punjabi-grooms-boys

The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.

We need to split the main URL into different elements so that we can access it. 
Main_URL = Static_URL + Mother_Tongue + Brides_Grooms
Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -
  1. Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
  2. Select Mother Tongue. For example, punjabi, tamil etc.
# Looking for bride/groom
Bride_Groom = "bride"
# Possible Values : bride, groom

# Select Mother Tongue
Mother_Tongue = "punjabi"
# Possible Values
# punjabi
# tamil
# bengali
# telugu
# kannada
# marathi

# URL
if (tolower(Bride_Groom) == "bride") {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
} else {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
}
See the output :
[1] "https://www.jeevansathi.com/punjabi-brides-girls"

Extract Profile IDs

First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)
         ID
1 ZARX0345
2 ZZWX5573
3 ZWVT2173
4 ZAYZ6100
5 ZYTS6885
6 ZXYV9849
7 TRZ8475
8 VSA7284
9 ZXTU1965
10 ZZSA6877
11 ZZSZ6545
12 ZYSW4809
13 ZARW2199
14 ZRSY0723
15 ZXAT2801
16 ZYXX8818
17 ZAWA8567
18 WXZ2147
19 ZVRT8875
20 ZWWR9533
21 ZYXW4043
The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -
  1. read_html() :  you can create a html document from a URL
  2. html_nodes() : extracts pieces out of HTML documents.
  3. html_nodes(".class") : calls node based on CSS class
  4. html_nodes("#class") : calls node based on <div>, <span>, <pre> id
  5. html_text() : extracts only the text from HTML tag
  6. html_attr() : extracts contents of a single attribute

Difference between .class and #class

1. .class targets the following element:
<div class="class"></div>

2. #class targets the following element:
<div id="class"></div>

How to find HTML/ CSS code of website 

Perform the steps below -
  1. On Google chrome, right click and select "Inspect" option. Or use shortcut Ctrl + Shift + I
  2. Select a particular section of the website.
  3. Press Ctrl + Shift + C to inspect a particular element.
  4. See the selected code under "Elements" section.

Inspect element

Get Detailed Information of Profiles

The following program performs the following tasks -
  1. Loop through profile IDs
  2. Pull information about Age, Height, Qualification etc.
  3. Extract details about appearance
  4. Fetch 'About Me' section of profiles
# Get Detailed Information
finaldf = data.frame()
for (i in 1:length(profileIDs$ID)){
ID = profileIDs[i,1]
link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
html_text()))
# Final Table
FormattedInfo = data.frame(ProfileID = ID,
Description = read_html(link) %>%
html_nodes("#myinfoView") %>%
html_text(),
Appearance = read_html(link) %>%
html_nodes("#section-lifestyle #appearanceView") %>%
html_text(),
FormattedInfo)

finaldf = bind_rows(finaldf, FormattedInfo)
}

# Assign Variable Names
names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")
Web Scraping Output
Web Scraping Output PartII


Download Display Pic

To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.
# Download Profile Pic of a particular Profile
ID = "XXXXXXX"
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg

Disclaimer
We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.
Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -

You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.
library(rvest)
url       = "http://www.google.com"
pgsession = html_session(url)           
pgform    = html_form(pgsession)[[1]]

# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)

# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()
 [1] "Data science - Wikipedia"                                          
[2] "Data Science Courses | Coursera"
[3] "Data Science | edX"
[4] "Data science - Wikipedia"
[5] "DataScience.com | Enterprise Data Science Platform Provider"
[6] "Top Data Science Courses Online - Updated February 2018 - Udemy"
[7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"
[8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
[9] "Online Data Science Courses | Microsoft Professional Program"
[10] "News for Datascience"
[11] "Data Science Course - Cognitive Class"

Important Points related to Web Scraping
Please make sure of the following points -
  1. Use website API rather than web scraping.
  2. Too many requests from a certain IP-address might result to IP address being blocked. Do not scrape more than 8 keywords requests on google.
  3. Do not use web scraping for commercial purpose.
14
Feb

Tutorial : Build Webapp in R using Shiny

In this tutorial, we will cover how to build shiny app from scratch in R. It includes various examples which would help you to get familiar with shiny package.

Shiny is a R package developed by RStudio that can be used to create interactive web pages with R. In simple words, you can build web page (online reporting tool) without knowing any web programming languages such as Javascript / PHP / CSS.

The best part about shiny package is that you can easily integrate R with webpage. Suppose you want your web page run machine learning algorithms like random forest, SVM etc  and display summary of the model with the flexibility of selecting inputs from user. Shiny can do it very easily.
R : Shiny Tutorial


Shiny's prominent features

  1. Customizable widgets like sliders, drop down lists, numeric inputs and many more.
  2. Downloading datasets, graphs and tables in various formats.
  3. Uploading files.
  4. Provides utility to create brilliant plots.
  5. In-built functions for viewing data or printing the text or summaries.
  6. Reactive programs which makes data handling easier.
  7. Conditional Panels for only when a particular condition is present.
  8. Works in any R environment (Console R, RGUI for Windows or Mac,  RStudio, etc)
  9. No need to learn another software for online dashboarding
  10. Can style your app with CSS / HTML (Optional)

Must things in shiny app code

  1. User Interface (UI) : It controls the layout and appearance of various widgets on the web page. Every shiny app requires a user interface which is controlled by ui script. 
  2. Server :  It contains the instructions that your computer needs when the user interacts with the app.
Example - You must have seen or created interactive charts in Excel. To make it interactive, we use drop downs, list boxes or some user controls. When user changes the values from drop downs,  you will notice that the chart gets updated. 

The UI is responsible for creating these drop downs, list boxes and telling Shiny where to place these user controls and where to place the charts, while the server is responsible for creating the chart and the data in the table

Basic layout of UI

User Interface: A simple shiny UI consists of a fluidpage which contains various panels. We can divide the display in two parts named sidebdarPanel( )  and mainPanel( ). Both of the panels can be accessed using sidebarLayout( ).

In the following image you can get an idea what is a title panel, sidebar panel and main panel in a shiny app.
  1. Title panel is a place where the title of the app is displayed.
  2. Sidebar panel is where special instructions or widgets (drop down / slider/ checkbox) are displayed to the app user. The sidebar panel appears on the left side of your app by default. You can move it to the right side by changing the position argument in the sidebar layout.
  3. Main panel is the area where all the outputs are generally placed.

Shiny Elements

Installing Shiny

First we need to install shiny package by using command install.packages( ).
install.packages("shiny")

Loading Shiny

Shiny package can be loaded by using library( ).
library(shiny)

The first simple shiny app with basic layout

ui = fluidPage(sidebarLayout(sidebarPanel("Welcome to Shiny App"),
                             mainPanel("This is main Panel")))
server = function(input, output) {  }
shinyApp(ui, server)
Sidebar Panel and Main Panel 


Guidelines for beginners to run a shiny app

Step 1 : shinyApp(ui,server) It is an in-built function in shiny package to run the app with ui and server as the arguments. Select the code and run it. Once you do it successfully, you would find the text Listening on http://127.0.0.1:4692 on console.

Step 2 : To create your app you need to save the code as an app.R file and a RunApp icon will get displayed on your screen. Click on it and a new prompt window as your app will appear.
Shiny App

Some more panels...

There are some additional panels which can be added to sidebarPanel and mainPanel depending upon the layout and requirements of the app. Some of them which shall be explained later in this tutorial are:
Shiny : Panels


Adding a title to your App!

Using titlePanel( ) once can provide an appropriate title for the app. Note that after titlePanel a comma(,) is placed.
ui =  fluidPage(titlePanel("My favorite novels:"),
                sidebarLayout(sidebarPanel(),
                              mainPanel()))
server = function(input, output) {
}
shinyApp(ui, server)
Title : Shiny App

Using HTML tags in Shiny

Content can be added in various panels. To change the appearance of the text by bolds, italics, images, changing the fonts and colors, adding heading etc. we can use various HTML functions in shiny. Some of them being the same in both of them are:
HTML Tags

Creating a hyperlink

A hyperlink can be created using a( ) where the first argument is the text with which the link is attached. href contains the link for our website which we want to attach.
ui =  fluidPage(sidebarLayout(sidebarPanel(
  a("Click here!", href = "http://www.listendata.com/")),
  mainPanel()))
server = function(input, output) {}
shinyApp(ui, server)

Modifying the text presentation using HTML tags.

We create an app containing the list of the favorite novels . You can refer to the above mentioned table of HTML and shiny functions.
ui =  fluidPage(titlePanel("My favorite novels:"),
                sidebarLayout(sidebarPanel(
                  ("My favorite novels are:"),
                  br(),
                  h4(strong("The Kiterunner"), "a novel by", em("Khaled Hoseinni")),
                  h3(strong("Jane Eyre"), "a novel by", code("Charolette Bronte")),
                  strong(
                    "The diary of a young girl",
                    "by",
                    span("Anne Frank", style = "color:blue")
                  ),
                  div(strong("A thousand splendid suns"), "by Khaled Hoseinni", style = "color: red")
                ),
                mainPanel()))
server = function(input, output) { }
shinyApp(ui, server)

Note that "Charolette Bronte" in the app would be written in a coded style;
Difference between span( ) and div( ) span( ) wrote "Anne Frank" on the same line with "blue" color.  div( ) is similar to span( ), it is just that it creates the text in a different line. 
Shiny : HTML Formating

Introducing widgets

Various widgets are used in shiny to select various outputs. These widgets can be inserted in the ui function (anywhere in the main panel and sidebar panel).
The most commonly used widgets are:
Shiny Widgets

The following image tells how various widgets appear on running an app.

Shiny Widgets
  • 'Buttons' can be created using an actionButton and submitButton widgets
  • Single check box, multiple check box and date inputs are created using checkboxInput, checkboxGroupInput and dateInput respectively.
  • Date range is created using dateRangeInput.

Most commonly used widgets

All the widgets demand an input ID which we will use to retrieve the values.  This input ID is not accessible by the app user. labels is the heading for our widget which be visible on when the app is being run. In order to understand more we create an app to get the details of the user by the widgets provided by shiny.

HelpText and TextInput

ui =  fluidPage(sidebarLayout(
  sidebarPanel(helpText("This questionnaire is subject to privacy."),
 
    textInput(inputId = "name", label = "Enter your name.")
  ),

  mainPanel()

))
server = function(input, output) { }
shinyApp(ui, server)
helptext() and Text Input


helpText( ) creates a disclaimer which will be displayed on the sidebarPanel.


Adding SliderInput
ui =  fluidPage(sidebarLayout(
  sidebarPanel(
    helpText("This questionnaire is subject to privacy."),
 
    textInput(inputId = "name", label = "Enter your name."),
    sliderInput(
      inputId = "age",
      label = "What is your age?",
      min = 1,
      max = 100,
      value = 25
    )
  ),

  mainPanel()

))
server = function(input, output) { }
shinyApp(ui, server)

In sliderInput we use the ID as "age" and the label which will be displayed in our app is "What is your age?" min = 1 and max = 100 depict the minimum and maximum values for our sliders and value = 25 denotes the default selected value.

SliderInput

RadioButtons, NumericInput and CheckBoxInput

ui =  fluidPage(sidebarLayout(
  sidebarPanel(
    radioButtons(
      inputId = "month",
      label = "In which month are you born?",
      choices = list(
        "Jan - March" = 1,
        "April - June" = 2,
        "July - September" = 3,
        "October - November" = 4
      ),
      selected = 1
    ),
 
    numericInput(
      inputId = "children_count",
      label = "How many children do you have?",
      value = 2,
      min = 0,
      max = 15
    ),
 
    selectInput(
      inputId  = "smoker",
      label = "Are you a smoker?",
      choices = c("Yes", "No", "I smoke rarely"),
      selected = "Yes"
    ),
 
    checkboxInput(
      inputId = "checkbox",
      label = "Are you a human?",
      value = FALSE
    ),
 
    checkboxGroupInput(
      inputId = "checkbox2",
      label = "2 + 2 = ?",
      choices = list(4, "Four", "IV", "None of the above")
    )
 
  ),

  mainPanel()

))
server = function(input, output) { }
shinyApp(ui, server)
Other common Widgets

In radioButtons or selectInput widgets we define the list of options in choices parameter. The parameter selected  implies the default selected option.

Using fluidRow

The output of our above app is a bit weird. Right? Everything comes in the sidepanel and nothing on the mainPanel. We can make it a bit more sophisticated by removing the mainPanel and creating the widgets in a row.

We use fluidRow for such things. It is to be kept in mind that the width of the row is 12 thus if a row contains the widgets which require in more than 12 units of width then the last widget will be displayed on the other row.

Let us create the above same app using fluidRow.

Our app creates textInput, sliderInput and radioButtons in one row.

ui =  fluidPage(
helpText(
"This questionnaire is subject to privacy. All the information obtained will be confidential."
),


fluidRow(
column(4,textInput(inputId = "name", label = "Enter your name.")),


column(
4, sliderInput(
inputId = "age",
label = "What is your age?",
min = 1,
max = 100,
value = 25
)
),

column(
4, radioButtons(
inputId = "month",
label = "In which month are you born?",
choices = list(
"Jan - March" = 1,
"April - June" = 2,
"July - September" = 3,
"October - November" = 4
),

selected = 1
)
)
),


fluidRow(column(
6, numericInput(
inputId = "children_count",
label = "How many children do you have?",
value = 2,
min = 0,
max = 15
)
) )
)

server = function(input, output) { }

shinyApp(ui, server)

fluidrow

In column(6,...) 6 denotes the width required by one widget. To move to the next row another fluidRow command is used.

Time to get some output!

So far we have been providing the input to our server function but note that server function also has an output as an argument. Thus we can have various outputs like:
The above functions are defined in ui and are given a key and using that key we denote them in the server function.

In the server function we use render* to display various outputs. Some of the most common render* commands are:

Dealing with dates

Using dateInput( ) we can select the dates from our calendar.

The inputID is "bday", and the label which will be displayed in our app is "Enter your Date of Birth" and by default value is 1st January, 1985.

The verbatimTextOutput is used in the ui and it will be referred in the server as "text".

In the server function we use output$text to tell shiny that the following output will be displayed in verbatimTextOutput("text").

The renderPrint( ) denotes our output to be printed and we get the date of birth printed using input$bday (Recall bday was the inputID in our dateInput).
ui = fluidPage(dateInput(
  "bday",
  label = h3("Enter your Date of Birth"),
  value = "1985-01-01"
),
verbatimTextOutput("text"))

server = function(input, output) {
  output$text <- renderPrint({
    paste(input$bday, "was a blessed day!")
 
  })
}
shinyApp(ui, server)

Viewing Data

Here we are using the iris dataset and we want to display only the data for the particular specie selected by the user.

Using selectInput( ) we choose the specie with inpuID as "specie". In the main panel we want out output as a table thus we use tableOutput( ). In the server( ) output$data matches tableOutput("data") and renders a table using renderTable.

ui =  fluidPage(titlePanel("Viewing data"),
             
                sidebarLayout(sidebarPanel(
                  selectInput(
                    inputId  = "specie",
                    label = "Select the flower specie",
                    choices = c("setosa", "versicolor", "virginica"),
                    selected = "setosa"
                  )
                ),
             
                mainPanel(tableOutput("data"))))
server = function(input, output) {
  output$data  = renderTable({
    iris[iris$Species == input$specie, ]
  })
}
shinyApp(ui, server)

Reactivity in Shiny

Shiny apps use a functionality called reactivity that means that shiny app will be responsive to changes to inputs. It's similar to MS Excel where changing one cell have effect on the whole workbook.

It is quite useful to define reactive( ) function when there are multiple widgets.

Suppose we have two widgets with inputID 'a' and 'b'. We have two reactive functions say 'X' and 'Y' for one each. Thus is the value in 'a' changes then reactive function 'X' will be updated and for 'b' reactive function 'Y' will be updated.

If a change is made only in one of the input values say 'a'  and 'b' is the same then reactive function 'X' will be updated but 'Y' will be skipped. Hence it reduces a lot of time and saves shiny from confusion.


Creating Plots

Here we want to display the histogram by selecting any one variable in the iris dataset available in R.

Using plotOutput in main panel we refer to the server function.

In the server function we are using reactive.  It means that it will change the value only when the  value input$characterstic is changed.

The output$myplot matches to plotOutput("myplot") and hence draws the histogram using renderPlot( )
ui =  fluidPage(titlePanel("Creating the plots!"),
                sidebarLayout(sidebarPanel(
                  selectInput(
                    inputId  = "characterstic",
                    label = "Select the characterstic for which you want the histogram",
                    choices = c("Sepal Length", "Sepal Width" ,
                                "Petal Length", "Petal Width"),
                    selected = "Sepal Length"
                  )
                ),
             
                mainPanel(plotOutput("myplot"))))
server = function(input, output) {
  char = reactive({
    switch(
      input$characterstic,
      "Sepal Length" = "Sepal.Length",
      "Sepal Width" = "Sepal.Width",
      "Petal Length" = "Petal.Length",
      "Petal Width" = "Petal.Width"
    )
  })

  output$myplot  = renderPlot({
    hist(
      iris[, char()],
      xlab = input$characterstic,
      main = paste("Histogram of", input$characterstic)
    )
  })

}
shinyApp(ui, server)

Well Panel and Vertical Layout

Vertical Layout creates a layout in which each element passed in the UI will appear in its own line.
WellPanel creates a panel with a border and a grey background.

In the following example we are trying to create an app where we draw the QQ plot for random sample from normal distribution.

Using the sliders we define the size of the sample. By default it is 500.
ui = fluidPage(titlePanel("Vertical layout"),
verticalLayout(wellPanel(
sliderInput("n", "QQ Plot of normal distribution", 100, 1000,
value = 500)
),
plotOutput("plot1")))
server = function(input, output) {
output$plot1 = renderPlot({
qqnorm(rnorm(input$n))
})
}
shinyApp(ui, server)

Creating tabs

We can create various tabs in shiny where some particular output is displayed in a particular tab. This can be done using tabsetPanel.

We are creating an app in which the user selects the columns for which he wants the summary and the boxplot.

In the main panel we are creating the tabs. each tab has a label and the output to be shown.
For instance the first tab label is 'Summary' and it will show the verbatimTextOutput and the other tab will have label displayed as 'Boxplot' with output being plotted using plotOutput.
ui =  fluidPage(titlePanel("Creating the tabs!"),
sidebarLayout(sidebarPanel(
radioButtons(
inputId = "characterstic",
label = "Select the characterstic for which you want the summary",
choices = c(
"Mileage" = "mpg",
"Displacement" = "disp",
"Horsepower" = "hp",
"Rear axle ratio" = "drat",
"Weight" = "wt"
),
selected = "mpg"
)
),
mainPanel(tabsetPanel(
tabPanel("Summary", verbatimTextOutput("mysummary")),
tabPanel("Boxplot", plotOutput("myplot"))
))))

server = function(input, output) {
output$mysummary = renderPrint({
summary(mtcars[, input$characterstic])
})

output$myplot = renderPlot({
boxplot(mtcars[, input$characterstic], main = "Boxplot")
})
}
shinyApp(ui, server)
Creating tabs in Shiny

Some more plots...

In this example we are using VADeaths data. We firstly select the area (Rural or Urban) and gender( Male or Female) and hence plot the barplot denoting the death rate for different age groups.
ui = fluidPage(
titlePanel("Death rates by Gender and area"),

sidebarLayout(
sidebarPanel(
selectInput("area", "Choose the area",
choices = c("Rural", "Urban")),
br(),
selectInput("gender", "Choose the gender", choices = c("Male", "Female"))
),

mainPanel(plotOutput("deathrate"))

)
)

server = function(input, output) {
output$deathrate <- renderPlot({
a = barplot(VADeaths[, paste(input$area, input$gender)],
main = "Death Rates in Virginia",
xlab = "Age Groups")
text(a,
y = VADeaths[, paste(input$area, input$gender)] - 2,
labels = VADeaths[, paste(input$area, input$gender)],
col = "black")
})
}

shinyApp(ui, server)

Conditional Panels

Suppose you wish to create outputs only when a particular option is selected or if a particular condition is satisfied. For such a purpose we can use conditional panels where we define the condition in a JavaScript format and then define the output or the widget to appear if the condition is met. A simple example of a conditional panel is given as follows: Firstly we seek the number of hours one sleeps and then if someone sleeps for less than 7 hours then he needs more sleep and if someone sleeps more than or equal to 9 hours then he sleeps a lot.
ui = fluidPage(
titlePanel("Conditional Panels"),
sidebarPanel(
numericInput("num","How many hours do you sleep?",min = 1,max = 24,value = 6)),
mainPanel(
conditionalPanel("input.num < 7","You need more sleep"),
conditionalPanel("input.num >= 9","You sleep a lot")
)
)
server = function(input,output){

}
shinyApp(ui,server)

Note: The first argument in conditional panel is a JavaScript expression thus we write input.num and not input$num to access the input value of sleeping hours.


Conditional Panel : Example 2

In the following example we are using the income.csv file. Firstly we ask for which variable the user wants to work with and save the data in 'a' using reactive( ) . Then we using uiOutput we insert a widget asking for whether the user wants the summary or to view the data or the histogram. Based on the option selected by the user we create conditional panels for summary, viewing the data and plotting the histogram.
income = read.csv("income.csv", stringsAsFactors = FALSE)

ui = fluidPage(titlePanel(em("Conditional panels")),
sidebarLayout(
sidebarPanel(
selectInput(
"Choice1",
"Select the variable",
choices = colnames(income)[3:16],
selected = "Y2008"
),
uiOutput("Out1")
),
mainPanel(
conditionalPanel("input.Choice2 === 'Summary'", verbatimTextOutput("Out2")),
conditionalPanel("input.Choice2 === 'View data'", tableOutput("Out3")),
conditionalPanel("input.Choice2 === 'Histogram'", plotOutput("Out4"))
)
))

server = function(input, output) {
a = reactive({
income[, colnames(income) == input$Choice1]
})
output$Out1 = renderUI({
radioButtons(
"Choice2",
"What do you want to do?",
choices = c("Summary", "View data", "Histogram"),
selected = "Summary"
)
})
output$Out2 = renderPrint({
summary(a())
})
output$Out3 = renderTable({
return(a())
})
output$Out4 = renderPlot({
return(hist(a(), main = "Histogram", xlab = input$Choice1))
})
}
shinyApp(ui = ui, server = server)

Downloading Data

shiny allows the users to download the datasets. This can be done by using downloadButton in UI and downloadHandler in server. Firstly we select the data using radioButtons and hence save the dataset using reactive( ) in server. Then in the UI we create a downloadButton where the first argument is the inputID and the other one is the label. downloadHandler needs two arguments: filename and content. In 'filename' we specify by which name the file should be saved and using content we write the dataset into a csv file.
ui =  fluidPage(titlePanel("Downloading the data"),
sidebarLayout(sidebarPanel(
radioButtons(
"data",
"Choose a dataset to be downloaded",
choices = list("airquality", "iris", "sleep"),
selected = "airquality"
),
downloadButton("down", label = "Download the data.")
),
mainPanel()))

server = function(input, output) {

# Reactive value for selected dataset ----
datasetInput = reactive({
switch(input$data,
"airquality" = airquality,
"iris" = iris,
"sleep" = sleep)
})

# Downloadable csv of selected dataset ----
output$down = downloadHandler(
filename = function() {
paste(input$data, ".csv", sep = "")
},
content = function(file) {
write.csv(datasetInput(), file, row.names = FALSE)
}
)

}
shinyApp(ui, server)

Uploading a file

So far we were dealing with inbuilt datasets in R. In order to allow the users to upload their own datasets and do the analysis on them, fileInput function in UI in shiny allows users to upload their own file. Here we are creating an app to upload the files. In fileInput 'multiple = F' denotes that only one file can be uploaded by the user and 'accept = csv' denotes the type of files which can be uploaded. Then we ask the user whether he wants to view the head of the data or the entire dataset which is then viewed by using renderTable.
library(shiny)
ui = fluidPage(titlePanel("Uploading file in Shiny"),
sidebarLayout(
sidebarPanel(
fileInput(
"myfile",
"Choose CSV File",
multiple = F,
accept = ".csv"
),

checkboxInput("header", "Header", TRUE),

radioButtons(
"choice",
"Display",
choices = c(Head = "head",
All = "all"),
selected = "head"
)
),

mainPanel(tableOutput("contents"))

))
server = function(input, output) {
output$contents = renderTable({
req(input$myfile)

data = read.csv(input$myfile$datapath,
header = input$header)

if (input$choice == "head") {
return(head(data))
}
else {
return(data)
}

})
}
shinyApp(ui, server)

Sharing the app with others

Method I : Sharing the R code: You can share you app with others by sharing your R code. To make it work, users need to have R installed on their system.

Method II : Share your app as a web page: You need to create an account on shinyapps.io and follow the instructions below to share your app.R file.

Deploying shiny app on shinyapps.io

First you need to have an account on shinyapps.io.

Import library rsconnect by using
library(rsconnect) 
Then you need to configure the rsconnect package to your account using the code below -
rsconnect::setAccountInfo(name="<ACCOUNT>", token="<TOKEN>", secret="<SECRET>")
To deploy the app you can write:
rsconnect::deployApp(' Folder path in which your app.R file is saved') 
 As a result a new web page of your app link will be opened.

Shiny App for Normality

In this app the user first selects the variable for which he wants to test the normality. Then he is asked whether he wants to check normality via plots or some statistical tests. If the user selects plots then he will be asked whether he wants a Histogram or a QQ-Plot. The link for the shiny app is:  My Shiny App
ui =  fluidPage(titlePanel("My first App"),
sidebarLayout(
sidebarPanel(
selectInput(
"varchoice",
"Choose the variable for which you want to check the normality",
choices = c("mpg", "disp", "drat", "qsec", "hp", "wt")
),
radioButtons(
"normchoice",
"How do you want to check the normality?",
choices = c("Plots", "Tests"),
selected = "Plots"
),
conditionalPanel(
"input.normchoice == 'Plots'",
selectInput(
"plotchoice",
"Choose which plot you want?",
choices = c("Histogram", "QQ-Plot")
)
)


),
mainPanel(
conditionalPanel("input.normchoice == 'Plots'", plotOutput("myplot")),
conditionalPanel("input.normchoice == 'Tests'", verbatimTextOutput("mytest"))
)
))
server = function(input, output) {
var = reactive({
mtcars[, input$varchoice]

})
output$myplot = renderPlot({
if (input$plotchoice == "Histogram")
return(hist(var(), main = "Histogram", xlab = input$varchoice))
else
return(qqnorm(var(), main = paste("QQ plot of", input$varchoice)))
})
output$mytest = renderPrint({
shapiro.test(var())
})
}

shinyApp(ui, server)
Following is the clip of how the app will look when opened the link:
My First Shiny App
14
Feb

jamovi for R: Easy but Controversial

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally,  since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

library("jmv")
ttestIS(
 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!

Acknowledgements

Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.

28
Jan

Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.

R

First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

Read more »
Back to Top