Tutorial : Build Webapp in R using Shiny

In this tutorial, we will cover how to build shiny app from scratch in R. It includes various examples which would help you to get familiar with shiny package.

Shiny is a R package developed by RStudio that can be used to create interactive web pages with R. In simple words, you can build web page (online reporting tool) without knowing any web programming languages such as Javascript / PHP / CSS.

The best part about shiny package is that you can easily integrate R with webpage. Suppose you want your web page run machine learning algorithms like random forest, SVM etc  and display summary of the model with the flexibility of selecting inputs from user. Shiny can do it very easily.
R : Shiny Tutorial

Shiny's prominent features

  1. Customizable widgets like sliders, drop down lists, numeric inputs and many more.
  2. Downloading datasets, graphs and tables in various formats.
  3. Uploading files.
  4. Provides utility to create brilliant plots.
  5. In-built functions for viewing data or printing the text or summaries.
  6. Reactive programs which makes data handling easier.
  7. Conditional Panels for only when a particular condition is present.
  8. Works in any R environment (Console R, RGUI for Windows or Mac,  RStudio, etc)
  9. No need to learn another software for online dashboarding
  10. Can style your app with CSS / HTML (Optional)

Must things in shiny app code

  1. User Interface (UI) : It controls the layout and appearance of various widgets on the web page. Every shiny app requires a user interface which is controlled by ui script. 
  2. Server: It contains the instructions that your computer needs when the user interacts with the app.
Example - You must have seen or created interactive charts in Excel. To make it interactive, we use drop downs, list boxes or some user controls. When user changes the values from drop downs,  you will notice that the chart gets updated. 

The UI is responsible for creating these drop downs, list boxes and telling Shiny where to place these user controls and where to place the charts, while the server is responsible for creating the chart and the data in the table

Basic layout of UI

User Interface: A simple shiny UI consists of a fluidpage which contains various panels. We can divide the display in two parts named sidebdarPanel( )  and mainPanel( ). Both of the panels can be accessed using sidebarLayout( ).

In the following image you can get an idea what is a title panel, sidebar panel and main panel in a shiny app.
  1. Title panel is a place where the title of the app is displayed.
  2. Sidebar panel is where special instructions or widgets (drop down / slider/ checkbox) are displayed to the app user. The sidebar panel appears on the left side of your app by default. You can move it to the right side by changing the position argument in the sidebar layout.
  3. Main panel is the area where all the outputs are generally placed.

Shiny Elements

Installing Shiny

First we need to install shiny package by using command install.packages( ).

Loading Shiny

Shiny package can be loaded by using library( ).

The first simple shiny app with basic layout

ui = fluidPage(sidebarLayout(sidebarPanel("Welcome to Shiny App"),
                             mainPanel("This is main Panel")))
server = function(input, output) {  }
shinyApp(ui, server)
Sidebar Panel and Main Panel 

Guidelines for beginners to run a shiny app

Step 1 : shinyApp(ui,server) It is an in-built function in shiny package to run the app with ui and server as the arguments. Select the code and run it. Once you do it successfully, you would find the text Listening on on console.

Step 2 : To create your app you need to save the code as an app.R file and a RunApp icon will get displayed on your screen. Click on it and a new prompt window as your app will appear.
Shiny App

Some more panels...

There are some additional panels which can be added to sidebarPanel and mainPanel depending upon the layout and requirements of the app. Some of them which shall be explained later in this tutorial are:
Shiny : Panels

Adding a title to your App!

Using titlePanel( ) once can provide an appropriate title for the app. Note that after titlePanel a comma(,) is placed.
ui =  fluidPage(titlePanel("My favorite novels:"),
server = function(input, output) {
shinyApp(ui, server)
Title : Shiny App

Using HTML tags in Shiny

Content can be added in various panels. To change the appearance of the text by bolds, italics, images, changing the fonts and colors, adding heading etc. we can use various HTML functions in shiny. Some of them being the same in both of them are:

Creating a hyperlink

A hyperlink can be created using a( ) where the first argument is the text with which the link is attached. href contains the link for our website which we want to attach.
ui =  fluidPage(sidebarLayout(sidebarPanel(
  a("Click here!", href = "http://www.listendata.com/")),
server = function(input, output) {}
shinyApp(ui, server)

Modifying the text presentation using HTML tags.

We create an app containing the list of the favorite novels . You can refer to the above mentioned table of HTML and shiny functions.
ui =  fluidPage(titlePanel("My favorite novels:"),
                  ("My favorite novels are:"),
                  h4(strong("The Kiterunner"), "a novel by", em("Khaled Hoseinni")),
                  h3(strong("Jane Eyre"), "a novel by", code("Charolette Bronte")),
                    "The diary of a young girl",
                    span("Anne Frank", style = "color:blue")
                  div(strong("A thousand splendid suns"), "by Khaled Hoseinni", style = "color: red")
server = function(input, output) { }
shinyApp(ui, server)

Note that "Charolette Bronte" in the app would be written in a coded style;
Difference between span( ) and div( ) span( ) wrote "Anne Frank" on the same line with "blue" color.  div( ) is similar to span( ), it is just that it creates the text in a different line. 
Shiny : HTML Formating

Introducing widgets

Various widgets are used in shiny to select various outputs. These widgets can be inserted in the ui function (anywhere in the main panel and sidebar panel).
The most commonly used widgets are:
Shiny Widgets

The following image tells how various widgets appear on running an app.

Shiny Widgets
  • 'Buttons' can be created using an actionButton and submitButton widgets
  • Single check box, multiple check box and date inputs are created using checkboxInput, checkboxGroupInput and dateInput respectively.
  • Date range is created using dateRangeInput.

Most commonly used widgets

All the widgets demand an input ID which we will use to retrieve the values.  This input ID is not accessible by the app user. labels is the heading for our widget which be visible on when the app is being run. In order to understand more we create an app to get the details of the user by the widgets provided by shiny.

HelpText and TextInput

ui =  fluidPage(sidebarLayout(
  sidebarPanel(helpText("This questionnaire is subject to privacy."),
    textInput(inputId = "name", label = "Enter your name.")


server = function(input, output) { }
shinyApp(ui, server)
helptext() and Text Input

helpText( ) creates a disclaimer which will be displayed on the sidebarPanel.

Adding SliderInput
ui =  fluidPage(sidebarLayout(
    helpText("This questionnaire is subject to privacy."),
    textInput(inputId = "name", label = "Enter your name."),
      inputId = "age",
      label = "What is your age?",
      min = 1,
      max = 100,
      value = 25


server = function(input, output) { }
shinyApp(ui, server)

In sliderInput we use the ID as "age" and the label which will be displayed in our app is "What is your age?" min = 1 and max = 100 depict the minimum and maximum values for our sliders and value = 25 denotes the default selected value.


RadioButtons, NumericInput and CheckBoxInput

ui =  fluidPage(sidebarLayout(
      inputId = "month",
      label = "In which month are you born?",
      choices = list(
        "Jan - March" = 1,
        "April - June" = 2,
        "July - September" = 3,
        "October - November" = 4
      selected = 1
      inputId = "children_count",
      label = "How many children do you have?",
      value = 2,
      min = 0,
      max = 15
      inputId  = "smoker",
      label = "Are you a smoker?",
      choices = c("Yes", "No", "I smoke rarely"),
      selected = "Yes"
      inputId = "checkbox",
      label = "Are you a human?",
      value = FALSE
      inputId = "checkbox2",
      label = "2 + 2 = ?",
      choices = list(4, "Four", "IV", "None of the above")


server = function(input, output) { }
shinyApp(ui, server)
Other common Widgets

In radioButtons or selectInput widgets we define the list of options in choices parameter. The parameter selected  implies the default selected option.

Using fluidRow

The output of our above app is a bit weird. Right? Everything comes in the sidepanel and nothing on the mainPanel. We can make it a bit more sophisticated by removing the mainPanel and creating the widgets in a row.

We use fluidRow for such things. It is to be kept in mind that the width of the row is 12 thus if a row contains the widgets which require in more than 12 units of width then the last widget will be displayed on the other row.

Let us create the above same app using fluidRow.

Our app creates textInput, sliderInput and radioButtons in one row.

ui =  fluidPage(
"This questionnaire is subject to privacy. All the information obtained will be confidential."

column(4,textInput(inputId = "name", label = "Enter your name.")),

4, sliderInput(
inputId = "age",
label = "What is your age?",
min = 1,
max = 100,
value = 25

4, radioButtons(
inputId = "month",
label = "In which month are you born?",
choices = list(
"Jan - March" = 1,
"April - June" = 2,
"July - September" = 3,
"October - November" = 4

selected = 1

6, numericInput(
inputId = "children_count",
label = "How many children do you have?",
value = 2,
min = 0,
max = 15
) )

server = function(input, output) { }

shinyApp(ui, server)


In column(6,...) 6 denotes the width required by one widget. To move to the next row another fluidRow command is used.

Time to get some output!

So far we have been providing the input to our server function but note that server function also has an output as an argument. Thus we can have various outputs like:
The above functions are defined in ui and are given a key and using that key we denote them in the server function.

In the server function we use render* to display various outputs. Some of the most common render* commands are:

Dealing with dates

Using dateInput( ) we can select the dates from our calendar.

The inputID is "bday", and the label which will be displayed in our app is "Enter your Date of Birth" and by default value is 1st January, 1985.

The verbatimTextOutput is used in the ui and it will be referred in the server as "text".

In the server function we use output$text to tell shiny that the following output will be displayed in verbatimTextOutput("text").

The renderPrint( ) denotes our output to be printed and we get the date of birth printed using input$bday (Recall bday was the inputID in our dateInput).
ui = fluidPage(dateInput(
  label = h3("Enter your Date of Birth"),
  value = "1985-01-01"

server = function(input, output) {
  output$text <- renderPrint({
    paste(input$bday, "was a blessed day!")
shinyApp(ui, server)

Viewing Data

Here we are using the iris dataset and we want to display only the data for the particular specie selected by the user.

Using selectInput( ) we choose the specie with inpuID as "specie". In the main panel we want out output as a table thus we use tableOutput( ). In the server( ) output$data matches tableOutput("data") and renders a table using renderTable.

ui =  fluidPage(titlePanel("Viewing data"),
                    inputId  = "specie",
                    label = "Select the flower specie",
                    choices = c("setosa", "versicolor", "virginica"),
                    selected = "setosa"
server = function(input, output) {
  output$data  = renderTable({
    iris[iris$Species == input$specie, ]
shinyApp(ui, server)

Reactivity in Shiny

Shiny apps use a functionality called reactivity that means that shiny app will be responsive to changes to inputs. It's similar to MS Excel where changing one cell have effect on the whole workbook.

It is quite useful to define reactive( ) function when there are multiple widgets.

Suppose we have two widgets with inputID 'a' and 'b'. We have two reactive functions say 'X' and 'Y' for one each. Thus is the value in 'a' changes then reactive function 'X' will be updated and for 'b' reactive function 'Y' will be updated.

If a change is made only in one of the input values say 'a'  and 'b' is the same then reactive function 'X' will be updated but 'Y' will be skipped. Hence it reduces a lot of time and saves shiny from confusion.

Creating Plots

Here we want to display the histogram by selecting any one variable in the iris dataset available in R.

Using plotOutput in main panel we refer to the server function.

In the server function we are using reactive.  It means that it will change the value only when the  value input$characterstic is changed.

The output$myplot matches to plotOutput("myplot") and hence draws the histogram using renderPlot( )
ui =  fluidPage(titlePanel("Creating the plots!"),
                    inputId  = "characterstic",
                    label = "Select the characterstic for which you want the histogram",
                    choices = c("Sepal Length", "Sepal Width" ,
                                "Petal Length", "Petal Width"),
                    selected = "Sepal Length"
server = function(input, output) {
  char = reactive({
      "Sepal Length" = "Sepal.Length",
      "Sepal Width" = "Sepal.Width",
      "Petal Length" = "Petal.Length",
      "Petal Width" = "Petal.Width"

  output$myplot  = renderPlot({
      iris[, char()],
      xlab = input$characterstic,
      main = paste("Histogram of", input$characterstic)

shinyApp(ui, server)

Well Panel and Vertical Layout

Vertical Layout creates a layout in which each element passed in the UI will appear in its own line.
WellPanel creates a panel with a border and a grey background.

In the following example we are trying to create an app where we draw the QQ plot for random sample from normal distribution.

Using the sliders we define the size of the sample. By default it is 500.
ui = fluidPage(titlePanel("Vertical layout"),
sliderInput("n", "QQ Plot of normal distribution", 100, 1000,
value = 500)
server = function(input, output) {
output$plot1 = renderPlot({
shinyApp(ui, server)

Creating tabs

We can create various tabs in shiny where some particular output is displayed in a particular tab. This can be done using tabsetPanel.

We are creating an app in which the user selects the columns for which he wants the summary and the boxplot.

In the main panel we are creating the tabs. each tab has a label and the output to be shown.
For instance the first tab label is 'Summary' and it will show the verbatimTextOutput and the other tab will have label displayed as 'Boxplot' with output being plotted using plotOutput.
ui =  fluidPage(titlePanel("Creating the tabs!"),
inputId = "characterstic",
label = "Select the characterstic for which you want the summary",
choices = c(
"Mileage" = "mpg",
"Displacement" = "disp",
"Horsepower" = "hp",
"Rear axle ratio" = "drat",
"Weight" = "wt"
selected = "mpg"
tabPanel("Summary", verbatimTextOutput("mysummary")),
tabPanel("Boxplot", plotOutput("myplot"))

server = function(input, output) {
output$mysummary = renderPrint({
summary(mtcars[, input$characterstic])

output$myplot = renderPlot({
boxplot(mtcars[, input$characterstic], main = "Boxplot")
shinyApp(ui, server)
Creating tabs in Shiny

Some more plots...

In this example we are using VADeaths data. We firstly select the area (Rural or Urban) and gender( Male or Female) and hence plot the barplot denoting the death rate for different age groups.
ui = fluidPage(
titlePanel("Death rates by Gender and area"),

selectInput("area", "Choose the area",
choices = c("Rural", "Urban")),
selectInput("gender", "Choose the gender", choices = c("Male", "Female"))



server = function(input, output) {
output$deathrate <- renderPlot({
a = barplot(VADeaths[, paste(input$area, input$gender)],
main = "Death Rates in Virginia",
xlab = "Age Groups")
y = VADeaths[, paste(input$area, input$gender)] - 2,
labels = VADeaths[, paste(input$area, input$gender)],
col = "black")

shinyApp(ui, server)

Conditional Panels

Suppose you wish to create outputs only when a particular option is selected or if a particular condition is satisfied. For such a purpose we can use conditional panels where we define the condition in a JavaScript format and then define the output or the widget to appear if the condition is met. A simple example of a conditional panel is given as follows: Firstly we seek the number of hours one sleeps and then if someone sleeps for less than 7 hours then he needs more sleep and if someone sleeps more than or equal to 9 hours then he sleeps a lot.
ui = fluidPage(
titlePanel("Conditional Panels"),
numericInput("num","How many hours do you sleep?",min = 1,max = 24,value = 6)),
conditionalPanel("input.num < 7","You need more sleep"),
conditionalPanel("input.num >= 9","You sleep a lot")
server = function(input,output){


Note: The first argument in conditional panel is a JavaScript expression thus we write input.num and not input$num to access the input value of sleeping hours.

Conditional Panel : Example 2

In the following example we are using the income.csv file. Firstly we ask for which variable the user wants to work with and save the data in 'a' using reactive( ) . Then we using uiOutput we insert a widget asking for whether the user wants the summary or to view the data or the histogram. Based on the option selected by the user we create conditional panels for summary, viewing the data and plotting the histogram.
income = read.csv("income.csv", stringsAsFactors = FALSE)

ui = fluidPage(titlePanel(em("Conditional panels")),
"Select the variable",
choices = colnames(income)[3:16],
selected = "Y2008"
conditionalPanel("input.Choice2 === 'Summary'", verbatimTextOutput("Out2")),
conditionalPanel("input.Choice2 === 'View data'", tableOutput("Out3")),
conditionalPanel("input.Choice2 === 'Histogram'", plotOutput("Out4"))

server = function(input, output) {
a = reactive({
income[, colnames(income) == input$Choice1]
output$Out1 = renderUI({
"What do you want to do?",
choices = c("Summary", "View data", "Histogram"),
selected = "Summary"
output$Out2 = renderPrint({
output$Out3 = renderTable({
output$Out4 = renderPlot({
return(hist(a(), main = "Histogram", xlab = input$Choice1))
shinyApp(ui = ui, server = server)

Downloading Data

shiny allows the users to download the datasets. This can be done by using downloadButton in UI and downloadHandler in server. Firstly we select the data using radioButtons and hence save the dataset using reactive( ) in server. Then in the UI we create a downloadButton where the first argument is the inputID and the other one is the label. downloadHandler needs two arguments: filename and content. In 'filename' we specify by which name the file should be saved and using content we write the dataset into a csv file.
ui =  fluidPage(titlePanel("Downloading the data"),
"Choose a dataset to be downloaded",
choices = list("airquality", "iris", "sleep"),
selected = "airquality"
downloadButton("down", label = "Download the data.")

server = function(input, output) {

# Reactive value for selected dataset ----
datasetInput = reactive({
"airquality" = airquality,
"iris" = iris,
"sleep" = sleep)

# Downloadable csv of selected dataset ----
output$down = downloadHandler(
filename = function() {
paste(input$data, ".csv", sep = "")
content = function(file) {
write.csv(datasetInput(), file, row.names = FALSE)

shinyApp(ui, server)

Uploading a file

So far we were dealing with inbuilt datasets in R. In order to allow the users to upload their own datasets and do the analysis on them, fileInput function in UI in shiny allows users to upload their own file. Here we are creating an app to upload the files. In fileInput 'multiple = F' denotes that only one file can be uploaded by the user and 'accept = csv' denotes the type of files which can be uploaded. Then we ask the user whether he wants to view the head of the data or the entire dataset which is then viewed by using renderTable.
ui = fluidPage(titlePanel("Uploading file in Shiny"),
"Choose CSV File",
multiple = F,
accept = ".csv"

checkboxInput("header", "Header", TRUE),

choices = c(Head = "head",
All = "all"),
selected = "head"


server = function(input, output) {
output$contents = renderTable({

data = read.csv(input$myfile$datapath,
header = input$header)

if (input$choice == "head") {
else {

shinyApp(ui, server)

Sharing the app with others

Method I : Sharing the R code: You can share you app with others by sharing your R code. To make it work, users need to have R installed on their system.

Method II : Share your app as a web page: You need to create an account on shinyapps.io and follow the instructions below to share your app.R file.

Deploying shiny app on shinyapps.io

First you need to have an account on shinyapps.io.

Import library rsconnect by using
Then you need to configure the rsconnect package to your account using the code below -
rsconnect::setAccountInfo(name="<ACCOUNT>", token="<TOKEN>", secret="<SECRET>")
To deploy the app you can write:
rsconnect::deployApp(' Folder path in which your app.R file is saved') 
 As a result a new web page of your app link will be opened.

Shiny App for Normality

In this app the user first selects the variable for which he wants to test the normality. Then he is asked whether he wants to check normality via plots or some statistical tests. If the user selects plots then he will be asked whether he wants a Histogram or a QQ-Plot. The link for the shiny app is:  My Shiny App
ui =  fluidPage(titlePanel("My first App"),
"Choose the variable for which you want to check the normality",
choices = c("mpg", "disp", "drat", "qsec", "hp", "wt")
"How do you want to check the normality?",
choices = c("Plots", "Tests"),
selected = "Plots"
"input.normchoice == 'Plots'",
"Choose which plot you want?",
choices = c("Histogram", "QQ-Plot")

conditionalPanel("input.normchoice == 'Plots'", plotOutput("myplot")),
conditionalPanel("input.normchoice == 'Tests'", verbatimTextOutput("mytest"))
server = function(input, output) {
var = reactive({
mtcars[, input$varchoice]

output$myplot = renderPlot({
if (input$plotchoice == "Histogram")
return(hist(var(), main = "Histogram", xlab = input$varchoice))
return(qqnorm(var(), main = paste("QQ plot of", input$varchoice)))
output$mytest = renderPrint({

shinyApp(ui, server)
Following is the clip of how the app will look when opened the link:
My First Shiny App

jamovi for R: Easy but Controversial

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally,  since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!


Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.


Type I error rates in two-sample t-test by simulation

What do you do when analyzing data is fun, but you don't have any new data? You make it up.

This simulation tests the type I error rates of two-sample t-test in R and SAS. It demonstrates efficient methods for simulation, and it reminders the reader not to take the result of any single hypothesis test as gospel truth. That is, there is always a risk of a false positive (or false negative), so determining truth requires more than one research study.

A type I error is a false positive. That is, it happens when a hypothesis test rejects the null hypothesis when in fact it is not true. In this simulation the null hypothesis is true by design, though in the real world we cannot be sure the null hypothesis is true. This is why we write that we "fail to reject the null hypothesis" rather than "we accept it." If there were no errors in the hypothesis tests in this simulation, we would never reject the null hypothesis, but by design it is normal to reject it according to alpha, the significance level. The de facto standard for alpha is 0.05.


First, we run a simulation in R by repeatedly comparing randomly-generated sets of normally-distributed values using the two-sample t-test. Notice the simulation is vectorized: there are no "for" loops that clutter the code and slow the simulation.

Read more »

Data Visualization in R using ggplot2

For the purpose of data visualization, R offers various methods through inbuilt graphics and powerful packages such as ggolot2. Former helps in creating simple graphs while latter assists in creating customized professional graphs. In this article we will try to learn how various graphs can be made and altered using ggplot2 package.
Data Visualization with R

What is ggplot2?

ggplot2 is a robust and a versatile R package, developed by the most well known R developer, Hadley Wickham, for generating aesthetic plots and charts. 

The ggplot2 implies "Grammar of Graphics" which believes in the principle that a plot can be split into the following basic parts -
Plot = data + Aesthetics + Geometry
  1. data refers to a data frame (dataset).
  2. Aesthetics indicates x and y variables. It is also used to tell R how data are displayed in a plot, e.g. color, size and shape of points etc.
  3. Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot, density plot, dot plot etc.)

Apart from the above three parts, there are other important parts of plot -
  1. Faceting implies the same type of graph can be applied to each subset of the data. For example, for variable gender, creating 2 graphs for male and female.
  2. Annotation lets you to add text to the plot.
  3. Summary Statistics allows you to add descriptive statistics on a plot.
  4. Scales are used to control x and y axis limits

Why ggplot2 is better?
  • Excellent themes can be created with a single command. 
  • Its colors are nicer and more pretty than the usual graphics. 
  • Easy to visualize data with multiple variables. 
  • Provides a platform to create simple graphs providing plethora of information.

The table below shows common charts along with various important functions used in these charts.
Important Plots Important Functions
Scatter Plot geom_point(), geom_smooth(), stat_smooth()
Bar Chart geom_bar(), geom_errorbar()
Histogram geom_histogram(), stat_bin(), position_identity(), position_stack(), position_dodge()
Box Plot geom_boxplot(), stat_boxplot(), stat_summary()
Line Plot geom_line(), geom_step(), geom_path(), geom_errorbar()
Pie Chart coord_polar()


In this article, we will use three datasets - 'iris' , 'mpg' and 'mtcars' datasets available in R.

1. The 'iris' data comprises of 150 observations with 5 variables. We have 3 species of flowers: Setosa, Versicolor and Virginica and for each of them the sepal length and width and petal length and width are provided.

2. The 'mtcars' data consists of fuel consumption (mpg) and 10 aspects of automobile design and performance for 32 automobiles. In order words, we have 32 observations and 11 different variables:
  1. mpg Miles/(US) gallon
  2. cyl Number of cylinders
  3. disp Displacement (cu.in.)
  4. hp Gross horsepower
  5. drat Rear axle ratio
  6. wt Weight (1000 lbs)
  7. qsec 1/4 mile time
  8. vs V/S
  9. am Transmission (0 = automatic, 1 = manual)
  10. gear Number of forward gears
  11. carb Number of carburetors

3. The 'mpg' data consists of 234 observations and 11 variables.

Install and Load Package

First we need to install package in R by using command install.packages( ).
#installing package
Once installation is completed, we need to load the package so that we can use the functions available in the ggplot2 package. To load the package, use command library( )

Histogram, Density plots and Box plots are used for visualizing a continuous variable.

Creating Histogram: 

Firstly we consider the iris data to create histogram and scatter plot.
# Considering the iris data.
# Creating a histogram
ggplot(data  = iris, aes( x = Sepal.Length)) + geom_histogram( )
Here we call ggplot( ) function, the first argument being the dataset to be used.

  1. aes( ) i.e. aesthetics we define which variable will be represented on the x- axis; here we consider 'Sepal.Length'
  2. geom_histogram( ) denotes we want to plot a histogram.

Histogram in R

 To change the width of bin in the histograms we can use binwidth in geom_histogram( )
ggplot(data  = iris, aes(x = Sepal.Length)) + geom_histogram(binwidth=1)

One can also define the number of bins being wanted, the binwidth in that case will be adjusted automatically.
ggplot(data = iris , aes(x=Sepal.Length)) + geom_histogram(color="black", fill="white", bins = 10

Using  color = "black" and fill = "white" we are denoting the boundary colors and the inside color of the bins respectively.

How to visualize various groups in histogram

ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_histogram(fill="white", binwidth = 1)
Histogram depicting various species

Creating Density Plot

Density plot is also used to present the distribution of a continuous variable.
ggplot(iris, aes( x = Sepal.Length)) + geom_density( )
geom_density( ) function is for displaying density plot.

Density Plot

How to show various groups in density plot

ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_density( )
Density Plot by group

Creating Bar and Column Charts :
Bar and column charts are probably the most common chart type. It is best used to compare different values.

Now mpg data will be used for creating the following graphics.
ggplot(mpg, aes(x= class)) + geom_bar() 
Here we are trying to create a bar plot for number of cars in each class using geom_bar( ).

Column Chart using ggplot2

Using coord_flip( ) one can inter-change x and y axis.
ggplot(mpg, aes(x= class)) + geom_bar() + coord_flip()
Bar Chart

How to add or modify Main Title and Axis Labels

The following functions can be used to add or alter main title and axis labels.
  1. ggtitle("Main title"): Adds a main title above the plot
  2. xlab("X axis label"): Changes the X axis label
  3. ylab("Y axis label"): Changes the Y axis label
  4. labs(title = "Main title", x = "X axis label", y = "Y axis label"): Changes main title and axis labels
p = ggplot(mpg, aes(x= class)) + geom_bar()
p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")
Title and Axis Labels
How to add data labels
p = ggplot(mpg, aes(x= class)) + geom_bar()
p = p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")
p + geom_text(stat='count', aes(label=..count..), vjust=-0.25)
geom_text() is used to add text directly to the plot. vjust is to adjust the position of data labels in bar.

Add Data Labels in Bar

How to reorder Bars
Using stat="identity" we can use our derived values instead of count.
count(mpg,class) %>% arrange(-n) %>%
mutate(class = factor(class,levels= class)) %>%
ggplot(aes(x=class, y=n)) + geom_bar(stat="identity")
The above command will firstly create a frequency distribution for the type of car and then arrange it in descending order using arrange(-n). Then using mutate( )  we modify the 'class' column to a factor with levels 'class' and hence plot the bar plot using geom_bar( ).

Change order of bars

Here, bar of SUV appears first as it has maximum number of cars. Now bars are ordered based on frequency count.

Showing Mean of Continuous Variable by Categorical Variable
df = mpg %>% group_by(class) %>% summarise(mean = mean(displ)) %>%
  arrange(-mean) %>% mutate(class = factor(class,levels= class))

p = ggplot(df, aes(x=class, y=mean)) + geom_bar(stat="identity")
p + geom_text(aes(label = sprintf("%0.2f", round(mean, digits = 2))),
              vjust=1.6, color="white", fontface = "bold", size=4)

Now using dplyr library we create a new dataframe 'df' and try to plot it.
Using group_by we group the data according to various types of cars and summarise enables us to find the statistics (here mean for 'displ' variable) for each group. To add data labels (with 2 decimal places) we use geom_text( )

Customized BarPlot
Creating Stacked Bar Chart
p <- ggplot(data=mpg, aes(x=class, y=displ, fill=drv))
p + geom_bar(stat = "identity")

Stacked BarPlot
p + geom_bar(stat="identity", position=position_dodge())
Stacked - Position_dodge

Creating BoxPlot 

Using geom_boxplot( ) one can create a boxplot.

To create different boxplots for 'disp' for different levels of x we can define aes(x = cyl, y = disp)
mtcars$cyl = factor(mtcars$cyl)
ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot()
We can see one outlier for 6 cylinders.

To create a notched boxplot we write notch = TRUE
ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot(notch = TRUE)

Notched Boxplot

Scatter Plot
A scatterplot is used to graphically represent the relationship between two continuous variables.
# Creating a scatter plot denoting various species.
ggplot(data = iris, aes( x = Sepal.Length, y = Sepal.Width,shape = Species, color = Species)) + geom_point()
We plot the points using geom_point( ). In the aesthetics we define that x axis denotes sepal length, y axis denotes sepal width; shape = Species and color = Species denotes that different shapes and different sizes should be used for each particular specie of flower.
Scatter Plot
Scatter plots are constructed using geom_point( ) 
# Creating scatter plot for automatic cars denoting different cylinders.
ggplot(data = subset(mtcars,am == 0),aes(x = mpg,y = disp,colour = factor(cyl))) + geom_point()
Scatter plot denotingvarious levels of cyl
We use subset( ) function to select only those cars which have am = 0; paraphrasing it; we are considering only those cars which are automatic. We plot the displacement corresponding to mileage and for different cylinders we are using various colors. Also factor(cyl) transforms our continuous variable cylinder to a factor.
# Seeing the patterns with the help of geom_smooth.
ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp))  + geom_point() + geom_smooth()
In the above command we try to plot mileage (mpg) and displacement (disp) and variation in colors denote the varying horsepower(hp) .  geom_smooth( ) is used to determine what kind of pattern is exhibited by the points.
In a similar way we can use geom_line( ) to plot another line on the graph:
# Plotting the horsepower using geom_line
ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp))  + geom_point(size = 2.5) + geom_line(aes(y = hp))
 Here in geom_point we have added an optional argument size = 2.5 denoting the size of the points. geom_line( ) creates a line. Note that we have not provided any aesthetics for x axis in geom_line, it means that it will plot the horsepower(hp) corresponding to mileage(mpg) only.

Modifying the axis labels and appending the title and subtitle
#Adding title or changing the labels
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + labs(title = "Scatter plot") 
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot")
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot",
                                                              subtitle = "mtcars data in R")
Adding title and subtitle to plots
  Here using labs( ) we can change the title of our legend or ggtitle we can assign our graph some title. If we want to add some title or sub-title to our graph thus we can use ggtitle( ) where the first argument is our 'main title' and second argument is our subtitle.
a <- ggplot(mtcars,aes(x = mpg, y = disp, color = factor(cyl))) + geom_point()
#Changing the axis labels.
a + labs(color = "Cylinders")
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement")
We firstly save our plot to 'a' and thus we make the alterations.
Note that in the labs command we are using color = "Cylinders" which changes the title of our legend.
Using the xlab and ylab commands we can change the x and y axis labels respectively. Here our x axis label is 'mileage' and y axis label is 'displacement'
#Combining it all
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement") + ggtitle(label = "Scatter plot", subtitle = "mtcars data in R")
 In the above plot we can see that the labels on x axis,y axis and legend have changed; the title and subtitle have been added and the points are colored, distinguishing the number of cylinders.

Playing with themes
Themes can be used in ggplot2 to change the backgrounds,text colors, legend colors and axis texts.
Firstly we save our plot to 'b' and hence create the visualizations by manipulating 'b'. Note that in aesthetics we have written mpg, disp which automatically plots mpg on x axis and disp on y axis.
#Changing the themes.
b <- ggplot(mtcars,aes(mpg,disp)) + geom_point()  + labs(title = "Scatter Plot") 
#Changing the size and color of the Title and the background color.
b + theme(plot.title = element_text(color = "blue",size = 17),plot.background = element_rect("orange"))
Plot background color changed.
 We use theme( ) to modify the the plot title and background. plot.title is an element_text( ) object in which we have specified the color and size of our title. Utilizing plot.background which is an element_rect( ) object we can specify the color of our background.
ggplot2( ) offers by default themes with background panel design colors being changed automatically. Some of them are theme_gray, theme_minimal, theme_dark etc.
b + theme_minimal( )
We can observe horizontal and vertical lines behind the points. What if we don't need them? This can be achieved via: 
#Removing the lines from the background.
b + theme(panel.background = element_blank())

Setting panel.background = element_blank( ) with no other parameter can remove those lines and color from the panel.
#Removing the text from x and y axis.
b + theme(axis.text = element_blank())
b + theme(axis.text.x = element_blank())
b + theme(axis.text.y = element_blank())
To remove the text from both the axis we can use axis.text = element_blank( ). If we want to remove the text only from particular axis then we need to specify it.
Now we save our plot to c and then make the changes.
#Changing the legend position
c <- ggplot(mtcars,aes(x = mpg, y = disp, color = hp)) +labs(title = "Scatter Plot") + geom_point()
c +  theme(legend.position = "top")
If we want to move the legend then we can specify legend.position as "top" or "bottom" or "left" or "right".
Finally combining all what we have learnt in themes we create the above plot where the legend is placed at bottom, plot title is in forest green color, the background is in yellow and no text is displayed on both the axis.
#Combining everything.
c + theme(legend.position = "bottom", axis.text = element_blank()) +
  theme(plot.title = element_text(color = "Forest Green",size = 17),plot.background = element_rect("Yellow")) 
Scatter Plot

Changing the color scales in the legend 
In ggplot2, by default the color scale is from dark blue to light blue. It might happen that we wish to innovate the scales by changing the colors or adding new colors. This can be done successfuly via scale_color_gradient function.
c + scale_color_gradient(low = "yellow",high = "red") 
Suppose we want the colors to vary from yellow to red; yellow denoting the least value and red denoting the highest value; we set low = "yellow" and high = "red". Note that in the legend it takes the scale to be started from 0 and not the minimum value of the series.
What if we want 3 colors?
c + scale_color_gradient2(low = "red",mid = "green",high = "blue")
 To serve the purpose of having 3 colors in the legend we use scale_color_gradient2 with low = "red",mid = "green" and high = "blue" means it divides the entire range(Starting from 0) to the maximum observation in 3 equal parts, with first part being shaded as red, central part as green and highest part as blue.
c + theme(legend.position = "bottom") + scale_color_gradientn(colours = c("red","forest green","white","blue"))
If we want more than 3 colors to be represented by our legend we can utilize scale_color_gradientn( ) function and the argument colors will be a vector starting where 1st element denotes the color of the 1st part, 2nd color denotes the color of 2nd part etc.

Changing the breaks in the legend.
It can be seen that the legend for continuous variable starts from 0.
Suppose we want the breaks to be: 50,125,200,275 and 350, we use seq(50,350,75) where 50 denotes the least number, 350 is the maximum number in the sequence and 75 is the difference between 2 consecutive numbers.
#Changing the breaks in the legend
c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), labels = paste(seq(50,350,75),"hp"))
 In scale_color_continuous we set the breaks as our desired sequence, and can change the labels if we want. Using paste function our sequence is followed by the word "hp" and name = "horsepower" changes the name of our legend.

Changing the break points and color scale of the legend together.
Let us try changing the break points and the colors in the legend together by trial and error.
#Trial 1 : This one is wrong
c + scale_color_continuous( breaks = seq(50,350,75)) +
  scale_color_gradient(low = "blue",high = "red") 
We can refer to trial1 image for the above code which can be found below. Notice that the color scale is blue to red as desired but the breaks have not changed.
#Trial 2: Next one is wrong.
c  +  scale_color_gradient(low = "blue",high = "red") +
  scale_color_continuous( breaks = seq(50,350,75))
trial2 image is the output for the above code. Here the color scale has not changed but the breaks have been created.

 What is happening? The reason for this is that we cannot have 2 scale_color functions for a single graph. If there are multiple scale_color_ functions then R overwrites the other scale_color_ functions by the last scale_color_ command it has received.
In trial 1, scale_color_gradient overwrites the previous scale_color_continuous command. Similarly in trial 2, scale_color_continuous overwrites the previous scale_color_gradient command.

The correct way to do is to define the arguments in one function only.
c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), low = "red", high = "black") + theme(panel.background = element_rect("green"),
 plot.background = element_rect("orange"))
Here low = "red" and high = "black" are defined in scale_color_continuous function along with the breaks.

Changing the axis cut points

We save our initial plot to 'd'. 
d <- ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point(aes(color = factor(am)))  +
  xlab("Mileage") + ylab("Displacement") +
  theme(panel.background = element_rect("black") , plot.background = element_rect("pink"))
To change the axis cut points we use scale_(axisname)_continuous.
d +  scale_x_continuous(limits = c(2,4)) + scale_y_continuous(limits = c(15,30))
To change the x axis limits to 2 to 4, we use scale_x_continuous and my 'limits' is a vector defining the upper and lower limits of the axis. Likewise, scale_y_continuous set the least cut off point to 15 and highest cut off point of y axis to 30.

d + scale_x_continuous(limits = c(2,4),breaks = seq(2,4,0.25)) +
  scale_y_continuous(limits = c(15,30),breaks = seq(15,30,3))
We can also add another parameter 'breaks' which will need a vector to specify all the cut of points of the axis. Here we create a sequence of 2,2.5,3,3.5,4 for x axis and for y axis the sequence is 15,18,21,...,30.

Faceting is a technique which is used to plot the graphs for the data corresponding to various categories of a particular variable. Let us try to understand it via an illustration:

facet_wrap function is used for faceting where the after the tilde(~) sign we define the variables on which we want the classification.
Faceting for carb
We see that there are 6 categories of "carb". Faceting creates 6 plots between mpg and disp; where the points correspond to the categories.
We can mention the number of rows we need for faceting.
# Control the number of rows and columns with nrow and ncol
ggplot(mtcars, aes(mpg, disp)) +  geom_point() +  facet_wrap(~carb,nrow = 3)
Here an additional parameter nrow =  3 depicts that in total all the graphs should be adjusted in 3 rows.

Faceting using multiple variables.
Faceting can be done for various combinations of carb and am.  
# You can facet by multiple variables
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb + am)
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(c("carb","am"))
 There are 6 unique 'carb' values and 2 unique 'am' values thus there could be 12 possible combinations but we can get only 9 graphs, this is because for remaining 3 combinations there is no observation. 
It might be puzzling to grasp which the level of am and carb specially when the labels ain't provided. Accordingly we can label the variables.
# Use the `labeller` option to control how labels are printed:
ggplot(mtcars, aes(mpg, disp)) +  geom_point() +  facet_wrap(~carb  + am, labeller = "label_both")
facet_wrap in multiple variables.
R provides facet_grid( ) function which can be used to faced in two dimensions.
z <- ggplot(mtcars, aes(mpg, disp)) + geom_point()
We store our basic plot in 'z' and thus we can make the additions:
z + facet_grid(. ~ cyl)   #col
z + facet_grid(cyl ~ .)   #row
z + facet_grid(gear ~ cyl,labeller = "label_both")  #row and col
using facet_grid( )
In facet_grid(.~cyl), it facets the data by 'cyl' and the cylinders are represented in columns. If we want to represent 'cyl' in rows, we write facet_grid(cyl~.). If we want to facet according to 2 variables we write facet_grid(gear~cyl) where gears are represented in rows and 'cyl' are illustrated in columns.

Adding text to the points.
Using ggplot2 we can define what are the different values / labels for all the points. This can be accomplished by using geom_text( ) 
#Adding texts to the points
ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
  geom_text(aes(label = am))
In geom_text we provide aes(label = am) which depicts that for all the points the corresponding levels of "am" should be shown.
In the graph it can be perceived that the labels of 'am' are overlapping with the points. In some situations it may become difficult to read the labels when there are many points. In order to avoid this we use geom_text_repel function in 'ggrepel' library.
ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
  geom_text_repel(aes(label = am))
 We load the library ggrepel using require( ) function. If we don't want the text to overlap we use geom_text_repel( ) instead of geom_text( ) of ggplot2 , keeping the argument aes(label = am).
Special thanks to  Ekta Aggarwal for her contribution in this article. She is a co-author of this article. She is a Data Science enthusiast, currently in the final year of her post graduation in statistics from Delhi University.

K Nearest Neighbor : Step by Step Tutorial

In this article, we will cover how K-nearest neighbor (KNN) algorithm works and how to run k-nearest neighbor in R. It is one of the most widely used algorithm for classification problems.

K-Nearest Neighbor Simplified

Introduction to K-Nearest Neighbor (KNN)

Knn is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity.
Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases. In classification this is the mode (or most common) class value.

How KNN algorithm works

Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. Data including height, weight and T-shirt size information is shown below -

Height (in cms) Weight (in kgs) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L

Step 1 : Calculate Similarity based on distance function

There are many distance functions but Euclidean is the most commonly used measure. It is mainly used when data is continuous. Manhattan distance is also very common for continuous variables.

Distance Functions

The idea to use distance measure is to find the distance (similarity) between new sample and training cases and then finds the k-closest customers to new customer in terms of height and weight.

New customer named 'Monica' has height 161cm and weight 61kg.

Euclidean distance between first observation and new observation (monica) is as follows -
Similarly, we will calculate distance of all the training cases with new case and calculates the rank in terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.

Step 2 : Find K-Nearest Neighbors

Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt. See the calculation shown in the snapshot below -

Calculate KNN manually

In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are close to yellow circle. so the prediction for the new case is blue highlighted data point which is Medium T-shirt size.
KNN: Visual Representation

Assumptions of KNN

1. Standardization

When independent variables in training data are measured in different units, it is important to standardize variables before calculating distance. For example, if one variable is based on height in cms, and the other is based on weight in kgs then height will influence more on the distance calculation. In order to make them comparable we need to standardize them which can be done by any of the following methods :

After standardization, 5th closest value got changed as height was dominating earlier before standardization. Hence, it is important to standardize predictors before running K-nearest neighbor algorithm.
Knn after standardization

2. Outlier

Low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction.

Why KNN is non-parametric?

Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you can imagine each training case as a "parameter" in the model.

KNN vs. K-mean

Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor. See some of the difference below -
  1. K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a supervised learning algorithm (dependent variable exists)
  2. K-mean is a clustering technique which tries to split data points into K-clusters such that the points in each cluster tend to be near each other whereas K-nearest neighbor tries to determine the classification of a point, combines the classification of the K nearest points

Can KNN be used for regression?
Yes, K-nearest neighbor can be used for regression. In other words, K-nearest neighbor algorithm can be applied  when dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbors.

Pros and Cons of KNN


  1. Easy to understand
  2. No assumptions about data
  3. Can be applied to both classification and regression
  4. Works easily on multi-class problems


  1. Memory Intensive / Computationally expensive
  2. Sensitive to scale of data
  3. Not work well on rare event (skewed) target variable
  4. Struggle when high number of independent variables
For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting  k to a large value may lead to a large model bias.
How to handle categorical variables in KNN?

Create dummy variables out of a categorical variable and include them instead of original categorical variable. Unlike regression, create k dummies instead of (k-1). For example, a categorical variable named "Department" has 5 unique levels / categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

How to find best K value?

Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate by holding out a subset of the training set from the model building process. 

Cross-validation (let's say 10 fold validation) involves randomly dividing the training set into 10 groups, or folds, of approximately equal size. 90% data is used to train the model and remaining 10% to validate it. The misclassification rate is then computed on the 10% validation data. This procedure repeats 10 times. Different group of observations are treated as a validation set each of the 10 times. It results to 10 estimates of the validation error which are then averaged out.

K Nearest Neighbor in R

We are going to use historical data of past win/loss statistics and the corresponding speeches. This dataset comprises of 1524 observations on 14 variables. Dependent variable is win/loss where 1 indicates win and 0 indicates loss. The independent variables are: 

1. Proportion of words in the speech showing 
a. Optimism
b. Pessimism
c. the use of Past
d. the use of Present
e. the use of Future

2. Number of time he/she mentions his/her own party

3. Number of time he/she mentions his/her opposite parties.

4. Some measure indicating the content of speech showing
a. Openness
b. Conscientiousness
c. Extraversion
d. Agreeableness
e. Neuroticism
f. emotionality

Download Link : Data File

Read Data
# Read data
data1 = read.csv("US Presidential Data.csv")

We read the CSV file with the help of read.csv command. Here the first argument is the name of the dataset.  The second argument - Header = TRUE or T implies that the first row in our csv file denotes the headings while header = FALSE or F indicates that the data should be read from the first line and does not involves any headings.
# load library

# Transforming the dependent variable to a factor
data1$Win.Loss = as.factor(data1$Win.Loss)
Here we will use caret package in order to run knn. Since my dependent variable is numeric here thus we need to transform it to factor using as.factor().
#Partitioning the data into training and validation data
index = createDataPartition(data1$Win.Loss, p = 0.7, list = F )
train = data1[index,]
validation = data1[-index,]
In order to partition the data into training and validation sets we use createDataPartition() function in caret.

Firstly we set the seed to be 101 so that the same results can be obtained. In the createDataPartition() the first argument is the dependent variable , p denotes how much data we want in the training set; here we take 70% of the data in training set and rest in cross validation set, list = F denotes that the indices we obtain should be in form of a vector.
# Explore data
The dimensions of training and validation sets are checked via dim(). See first 6 rows of training dataset -

   Win.Loss   Optimism  Pessimism  PastUsed FutureUsed PresentUsed OwnPartyCount
1 X1 0.10450450 0.05045045 0.4381443 0.4948454 0.06701031 2
3 X1 0.11257190 0.04930156 0.4159664 0.5168067 0.06722689 1
5 X1 0.10582640 0.05172414 0.3342618 0.5821727 0.08356546 3
7 X1 0.09838275 0.06401617 0.3240741 0.6018519 0.07407407 6
9 X1 0.10610734 0.04688464 0.3633540 0.5372671 0.09937888 2
10 X1 0.10066128 0.05951506 0.3554817 0.5382060 0.10631229 1
OppPartyCount NumericContent Extra Emoti Agree Consc Openn
1 2 0.001877543 4.041 4.049 3.469 2.450 2.548
3 1 0.002131163 3.463 4.039 3.284 2.159 2.465
5 4 0.002229220 4.658 4.023 3.283 2.415 2.836
7 4 0.002251985 3.727 4.108 3.357 2.128 2.231
9 5 0.002446440 4.119 4.396 3.661 2.572 2.599
10 2 0.002107436 3.800 4.501 3.624 2.117 2.154

By default, levels of dependent variable in this dataset is "0" "1". Later when we will do prediction, these levels will be used as variable names for prediction so we need to make it valid variable names.
# Setting levels for both training and validation data
levels(train$Win.Loss) <- make.names(levels(factor(train$Win.Loss)))
levels(validation$Win.Loss) <- make.names(levels(factor(validation$Win.Loss)))
Here we are using repeated cross validation method using trainControl . Number denotes either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3 separate 10-fold validations are used.
# Setting up train controls
repeats = 3
numbers = 10
tunel = 10

x = trainControl(method = "repeatedcv",
                 number = numbers,
                 repeats = repeats,
                 classProbs = TRUE,
                 summaryFunction = twoClassSummary)

Using train() function we run our knn; Win.Loss is dependent variable, the full stop after tilde  denotes all the independent variables are there. In ‘data=’ we pass our training set, ‘method=’ denotes which technique we want to deploy, setting preProcess to center and scale tells us that we are standardizing our independent variables

center : subtract mean from values.
scale : divide values by standard deviation.

trControl demands our ‘x’ which was obtained via train( ) and tunelength is always an integer which is used to tune our algorithm.
model1 <- train(Win.Loss~. , data = train, method = "knn",
               preProcess = c("center","scale"),
               trControl = x,
               metric = "ROC",
               tuneLength = tunel)

# Summary of model
k-Nearest Neighbors 

1068 samples
13 predictor
2 classes: 'X0', 'X1'

Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 961, 962, 961, 962, 961, 962, ...
Resampling results across tuning parameters:

k ROC Sens Spec
5 0.8440407 0.6910182 0.8382051
7 0.8537506 0.6847658 0.8520513
9 0.8575183 0.6712350 0.8525796
11 0.8588422 0.6545296 0.8592152
13 0.8585478 0.6560976 0.8556333
15 0.8570397 0.6432249 0.8648329
17 0.8547545 0.6448509 0.8627894
19 0.8520574 0.6336043 0.8632867
21 0.8484632 0.6215447 0.8627894
23 0.8453320 0.6071622 0.8658664

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 11.

Cross Validation : Fine Tuning

Finally to make predictions on our validation set, we use predict function in which the first argument is the formula to be applied and second argument is the new data on which we want the predictions.

# Validation
valid_pred <- predict(model1,validation, type = "prob")

#Storing Model Performance Scores
pred_val <-prediction(valid_pred[,2],validation$Win.Loss)

# Calculating Area under Curve (AUC)
perf_val <- performance(pred_val,"auc")

# Plot AUC
perf_val <- performance(pred_val, "tpr", "fpr")
plot(perf_val, col = "green", lwd = 1.5)

#Calculating KS statistics
ks <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))

The Area under curve (AUC) on validation dataset is 0.8642.

Special thanks to  Ekta Aggarwal for her contribution in this article. She is a co-author of this article. She is a Data Science enthusiast, currently in the final year of her post graduation in statistics from Delhi University.

Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.


Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.


Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.


Feature Selection : Select Important Variables with Boruta Package

This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

Why Variable Selection is important?
  1. Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
  2. Too many variables might result to overfitting which means model is not able to generalize pattern
  3. Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special".  See the following reasons to use boruta package for feature selection.
  1. It works well for both classification and regression problem.
  2. It takes into account multi-variable relationships.
  3. It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
  4. It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
  5. It can handle interactions between variables
  6. It can deal with fluctuating nature of random a random forest importance measure
Boruta Package

Basic Idea of Boruta Algorithm
Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works

Follow the steps below to understand the algorithm -
  1. Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
  2. Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
  3. Combine the original ones with shuffled copies
  4. Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
  5. Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
  6. Find the maximum Z score among shadow attributes (MZSA)
  7. Tag the variables as 'unimportant'  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
  8. Tag the variables as 'important'  when they have importance significantly higher than MZSA.
  9. Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

Difference between Boruta and Random Forest Importance Measure

When i first learnt this algorithm, this question 'RF importance measure vs. Boruta' made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.

Is Boruta a solution for all?
Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

What is shuffled feature or permuted copies?

It simply means changing order of values of a variable. See the practical example below -
mydata = data.frame(var1 = 1 : 6, var2=runif(6))
shuffle = data.frame(apply(mydata,2,sample))
head(cbind(mydata, shuffle))
Original Shuffled
var1 var2 var1 var2
1 1 0.2875775 4 0.9404673
2 2 0.7883051 5 0.4089769
3 3 0.4089769 3 0.2875775
4 4 0.8830174 2 0.0455565
5 5 0.9404673 6 0.8830174
6 6 0.0455565 1 0.7883051

R : Feature Selection with Boruta Package

1. Get Data into R

The read.csv() function is used to read data from CSV and import it into R environment.
#Read data
df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
2. List of variables
#Column Names
Result : "admit" "gre"   "gpa"   "rank"

3. Define categorical variables
df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)

4. Explore Data
#Summarize Data

#Check number of missing values
sapply(df, function(y) sum(is.na(y)))
 admit        gre             gpa        rank   
0:273 Min. :220.0 Min. :2.260 1: 61
1:127 1st Qu.:520.0 1st Qu.:3.130 2:151
Median :580.0 Median :3.395 3:121
Mean :587.7 Mean :3.390 4: 67
3rd Qu.:660.0 3rd Qu.:3.670
Max. :800.0 Max. :4.000

No missing values in the dataframe df.

Handle Missing Values
In this dataset, we have no missing values. If it exists in your dataset, you need to impute them before implementing boruta package.
5. Run Boruta Algorithm
#Install and load Boruta package

# Run Boruta Algorithm
boruta <- Boruta(admit~., data = df, doTrace = 2)

Boruta performed 9 iterations in 4.870027 secs.
3 attributes confirmed important: gpa, gre, rank;
No attributes deemed unimportant.

It shows all the three variables are considered important and no one is tagged 'unimportant'. The plot() option shows box plot of all the attributes plus minimum, average and max shadow score. Variables having boxplot in green shows all predictors are important. If boxplots are in red, it shows they are rejected. And yellow color of box plot indicates they are tentative.
Tentative Attributes refers to importance score so close to their best shadow attributes that Boruta is unable to decide in default number of random forest runs.
Box Plot - Variable Selection
As you can see above the label of shadowMean is not displayed as it got truncated due to insufficient space. To fix this problem, run the following program.
plot(boruta, xlab = "", xaxt = "n")
k <-lapply(1:ncol(boruta$ImpHistory),function(i)
names(k) <- colnames(boruta$ImpHistory)
Labels <- sort(sapply(k,median))
axis(side = 1,las=2,labels = names(Labels),
       at = 1:ncol(boruta$ImpHistory), cex.axis = 0.7)

Let's add some irrelevant data to our original dataset

It is to check whether boruta package will be able to find unimportant variables or not. In the following program, we have created duplicate copies of the original 3 variables and then randomise the order of values in these variables.
#Add some random permuted data
df.new$Random1 = as.numeric(as.character(df.new$Random1))
df.new$Random2 = as.numeric(as.character(df.new$Random2))
> head(df.new)
admit gre gpa rank Random1 Random2 Random3
1 0 380 3.61 3 600 3.76 4
2 1 660 3.67 3 660 3.30 4
3 1 800 4.00 1 700 3.37 2
4 1 640 3.19 4 620 3.33 3
5 0 520 2.93 4 600 3.04 2
6 1 760 3.00 2 520 3.64 4

Run Boruta Algorithm
boruta2 <- Boruta(admit~., data = df.new, doTrace = 1)
Boruta performed 55 iterations in 21.79995 secs.
3 attributes confirmed important: gpa, gre, rank;
3 attributes confirmed unimportant: Random1, Random2, Random3;

The irrelevant variable we added to the dataset came out unimportant as per boruta algorithm.
> attStats(boruta2)
meanImp medianImp minImp maxImp normHits decision
gre 5.56458881 5.80124786 2.347609 8.410490 0.90909091 Confirmed
gpa 9.66289180 9.37140347 6.818527 13.405592 1.00000000 Confirmed
rank 10.16762154 10.22875211 6.173894 15.235444 1.00000000 Confirmed
Random1 0.05986751 0.18360283 -1.281078 2.219137 0.00000000 Rejected
Random2 1.15927054 1.35728128 -2.779228 3.816915 0.29090909 Rejected
Random3 0.05281551 -0.02874847 -3.126645 3.219810 0.05454545 Rejected

To save a final list of important variables in a vector, use getSelectedAttributes() function.
#See list of finalvars
finalvars = getSelectedAttributes(boruta2, withTentative = F)
[1] "gre"  "gpa"  "rank"

Incase you get tentative attributes in your dataset, you need to treat them. In this dataset, we did not get any one. When you run the following function, it will compare the median Z score of the variables with the median Z score of the best shadow attribute and then make a decision whether an attribute should be confirmed or rejected.
Tentative.boruta <- TentativeRoughFix(boruta2)
List of parameters used in Boruta
  1. maxRuns: maximal number of random forest runs. Default is 100.
  2. doTrace: It refers to verbosity level. 0 means no tracing. 1 means reporting attribute decision as soon as it is cleared. 2 means all of 1 plus reporting each iteration. Default is 0.
  3. getImp : function used to obtain attribute importance. The default is getImpRfZ, which runs random forest from the ranger package and gathers Z-scores of mean decrease accuracy measure.
  4. holdHistory: The full history of importance runs is stored if set to TRUE (Default).

    Compare Boruta with RFE Algorithm

    In caret, there is a variable selection algorithm called recursive feature elimination (RFE). It is also called backward selection. A brief explanation of the algorithm is given below -
    1. Fit the model using all independent variables.
    2. Calculate variable importance of all the variables.
    3. Each independent variable is ranked using its importance to the model.
    4. Drop the weakest variable (worst ranked) and builds a model using the remaining variables and calculate model accuracy.
    5. Repeat step 4 until all variables are used.
    6. Variables are then ranked according to when they were dropped.
    7. For regression, RMSE and R-Squared are used as a metrics. For classification, it is 'Accuracy' and 'Kappa'.
    In the code below, we are building a random forest model in RFE algorithm. The function 'rfFuncs' denotes for random forest.
    control <- rfeControl(functions=rfFuncs, method="cv", number=10)
    rfe <- rfe(df.new[,2:7], df.new[,1], rfeControl=control)
    print(rfe, top=10)
    plot(rfe, type=c("g", "o"), cex = 1.0)
    head(rfe$resample, 10)
    Outer resampling method: Cross-Validated (10 fold) 

    Resampling performance over subset size:

    Variables Accuracy Kappa AccuracySD KappaSD Selected
    4 0.6477 0.1053 0.07009 0.1665
    6 0.7076 0.2301 0.06285 0.1580 *

    The top 6 variables (out of 6):
    gpa, rank, gre, Random2, Random3, Random1

    RFE - Variable Selection
    In this case, RFE algorithm returned all the variables based on model accuracy. As compared to RFE, boruta final variables make more sense in terms of interpretation. It all depends on data and its variables' distribution. As an analyst, we should explore both the techniques and see which one works better for the dataset. There are many packages in R for variable selection. Every technique has pros and cons.

    The following functions can be used for model fitting in RFE selections
    1. linear regression (lmFuncs)
    2. random forests (rfFuncs)
    3. naive Bayes (nbFuncs)
    4. bagged trees (treebagFuncs)

    Does Boruta handle multicollinearity?

    Multicollinearity means high correlation between independent variables. It is an important assumption in linear and logistic regression model. It makes coefficients (or estimates) more biased. Lets's check whether boruta algorithm takes care of it. Let's create some sample data. In this case, we are creating 3 predictors x1-x3 and target variable y.
    x1 <- runif(500)
    x2 <- rnorm(500)
    x3 <- x2 + rnorm(n,sd=0.5)
    y <- x3 + runif(500) 
    [1] 0.8981247

    The correlation of variables x2 and x3 is very high (close to 0.9). It means they are highly correlated. 
    mydata = data.frame(x1,x2,x3)
    Boruta(mydata, y)
    Boruta performed 9 iterations in 7.088029 secs.
     2 attributes confirmed important: x2, x3;
     1 attributes confirmed unimportant: x1;

    Boruta considered both highly correlated variables to be important. It implies it does not treat collinearity while selecting important variables. It is because of the way algorithm works.

    Important points related to Boruta
    1. Impute missing values - Make sure missing or blank values are filled up before running boruta algorithm.
    2. Collinearity - It is important to handle collinearity after getting important variables from boruta.
    3. Slow Speed - It is slow in terms of speed as compared to other traditional feature selection algorithms.

    Sample quantiles: A comparison of 9 definitions

    According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows how to use SAS to visualize and compare the nine common definitions of sample quantiles. It also compares the default definitions of sample quantiles in SAS and R.

    Definitions of sample quantiles

    Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ ... ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.

    Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:

    • The value p and the sample size N are used to determine two adjacent data values, x[j]and x[j+1]. The quantile estimate will be in the closed interval between those data points. For the previous example, the quantile estimate would be in the closed interval between x[6] and x[7].
    • For many methods, a fractional quantity is used to determine an interpolation parameter, λ. For the previous example, the fraction quantity is (Np - j) = (6.4 - 6) = 0.4. If you use λ = 0.4, then an estimate the 64th percentile would be the value 40% of the way between x[6] and x[7].
    • Each definition has a parameter m, 0 ≤ m ≤ 1, which determines how the method interpolates between adjacent data points. In general, the methods define the index j by using j = floor(Np + m). The previous example used m=0, but other choices include m=0.5 or values of m that depend on p.

    Thus a general formula for quantile estimates is q = (1 - λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.

    You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary. The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1 and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.

    Compute all nine sample quantile definitions in SAS

    SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.

    SAS definitions of sample quantiles

    It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 - c.

    The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:

    Definitions of sample quantiles that are not natively supported in SAS

    Visualizing the definitions of sample quantiles

    Visualization of nine defniitions of sample quantiles, from Hyndman and Fan (1996)

    You can download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F. The differences between the definitions are most evident for small data sets and when there is a large "gap" between one or more adjacent data values. The following panel of graphs shows the nine sample quantile methods for a data set that has 10 observations, {0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, ..., 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.

    In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.

    You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.

    You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.

    Comparison of the default  quantile estimates in SAS and R on a tiny data set

    Does the definition of sample quantiles matter?

    As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.

    However, in practice analysts don't worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.

    The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.

    Comparison of the default  quantile estimates in SAS and R on a larger data set

    So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.

    In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.

    SAS program to compute nine sample quantiles.

    The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.


    Dueling Data Science Surveys: KDnuggets & Rexer Go Live

    What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.

    For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll.  Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.

    Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use.  It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors.  You can take the Rexer Analytics survey here (use Access Code M7UY4).  Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.

    As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).




    How to Integrate R with PHP

    This tutorial explains how to integrate R with PHP.

    Online reporting tools have gained popularity in recent years. There is a growing demand to implement advanced analytics in these tools. Use of advanced analytics help to solve various organization problems such as retaining existing customers or acquiring new customers, increasing customer satisfaction etc.

    PHP is one of the most popular programming language to develop websites and online reporting tools. It has rich functionality to write business logic, however they are not effective when it comes to data science and machine learning. In the field of data science, R dominates in terms of popularity among statisticians and data scientists with over 10k number of packages.

    How to make PHP communicate with R

    There are times when you want to showcase the output of R program like charts that you create based on the user inputted data from a web page. In that case you might want your PHP based web application to communicate with the R script.

    When it comes to PHP, it has a very useful function called exec(). It lets you execute the outside program you provide as the source. We will be using the very same function to execute the R script you created. The then generates the graph and we will show the graph in our web page.

    The exec function can be used on both the Linux and Windows environments.

    1. On the Linux environment it will open the terminal window to execute the command you set and arguments you specify.  
    2. While on the Windows environment it will open the CMD to execute the command you provide along with the arguments you specify.

    I will walk you through the process of integrating the R code with PHP web page with code and explanation. 

    Let’s first create a PHP based web form:

        <title>PHP and R Integration Sample</title>
        <div id=”r-output” id=”width: 100%; padding: 25px;”>
          // Execute the R script within PHP code
          // Generates output as test.png image.
        <img src=”test.png?var1.1” alt=”R Graph” />
    Now save the file as index.php under your /htdocs/PROJECT-NAME/index.php.

    Let’s create a sample chart using R code.

    Write the following code and save it as sample.R file.
    x <- rnorm(6,0,1)
    png(filename="test.png", width=500, height=500)
    hist(x, col="red")
    rnorm(6, 0 ,1) means generating 6 random values with mean 0 and standard deviation 1. The dev.off() command is used to close the graph. Once chart created it will save it as the test.png file.
    The only downside of this code is that it will create the same test.png file for all the incoming requests. Meaning if you are creating charts based on user specified inputs, there will always be one test.png file created for various purpose.
    Let’s understand the code

    As specified earlier the exec('sample.R'); will execute the R script. It in turn generates the test.png graph image.

    In the very next line we used the HTML <img /> tag to display the R program generated image on the page. We used the src=test.png?ver1.1 where ver1.1 is used to invalidate the browser cache and download the new image from server.

    All modern browsers supports the browser caching. You might have experienced some website loads way faster on your repetitive visits. It’s due to the fact that browsers cache the image and other static resources for brief period of time.

    How to serve concurrent requests?

    args <- commandArgs(TRUE)
    cols <- args[1]
    fname <- args[2]
    x <- rnorm(cols,0,1)
    fname = paste(fname, "png", sep = ".")
    png(filename=fname, width=500, height=500)
    hist(x, col="red") dev.off()
        <title>PHP and R Integration Sample</title>
        <div id=”r-output” id=”width: 100%; padding: 25px;”>
          // Execute the R script within PHP code
          // Generates output as test.png image.
          $filename = “samplefile”.rand(1,100);
          exec("sample2.R 6 “.$filename.");
        <img src=”.$filename.”.png?var1.1” alt=”R Graph” />
    It will help you eliminate the need to using the same test.png file name. I have used the $filename=”samplefile”. You can use any random sequence as I have used in the end of the samplefile name. rand(min, max) will help you generate a random number.

    It will help you fix the file overwriting issue. And you will be able to handle the concurrent requests and server each with unique set of image(s).

    You might need to take care of old file removals. If you are on a linux machine you can setup a cron job which will find and delete the chart image files those are older than 24 hours.

    Here is the code to find and remove files:

    // set the path to your chart image directory
    $dir = "images/temp/";
    // loop through all the chart png files inside the directory.
    foreach (glob($dir."*.png") as $file) {
    // if file is 24 hours old then delete it
    if (filemtime($file) < time() - 86400) {


    Making PHP communicate with R and showcase the result is very simple. You might need to understand the exec() function and some PHP code if in-case you want to delete those residual files/images generated by your R program.

    Author Bio
    This article was originally written by Darshan Joshi, later Deepanshu gave final touch to the post. Darshan is a programming enthusiast. He loves to help developers in every possible way. He is a founder of AlphansoTech : Web Application Development company. You can connect with him on Twitter and LinkedIn.
    Back to Top