Sasensei (さ先生) Beta Launch

Think you know SAS?  Take the Sasensei challenge! We've developed a question based learning system that will put your SAS knowledge to the test.

Sasensei is a kick ass ninja training tool for an expanding range of topics including SAS Programming, Administration, Viya, Risk Dimensions, DI Studio, AF / SCL, Stored Process Web Apps and more.

You begin with 8 tokens, and it costs 2 tokens to play.  Consecutive correct answers unlock hidden levels!

Once you earn the right to submit new questions (cards), your token earning ability gets a dragon-kick.  Not to mention your SAS skills - the best way to learn, is to teach.  Take on the ultimate challenge in SAS - become a Sasensei!

We're releasing new features every week so check in regularly for updates.  And most importantly - good or bad - we need your feedback.

See you at the dojo - www.sasensei.com.


WUSS 2017: The Papers

Western Users of SAS Software 2017 logoThe Western Users of SAS Software 2017 conference is coming to Long Beach, CA, September 20-22.  I have been to a lot of SAS conferences, but WUSS is always my favorite because it is big enough for me to learn a lot, but small enough to be really friendly.

If you come I hope you will catch my presentations.  If you want a preview or if you can’t come, click the links below to download the papers.

On Wednesday, I will once again present SAS Essentials, a whirlwind introduction to SAS programming in just three hours specially designed for people who are new to SAS.

How SAS Thinks: SAS Basics I

Introduction to DATA Step Programming: SAS Basics II

Introduction to SAS Procedures: SAS Basics III

Then on Friday Lora Delwiche will present a Hands-On Workshop about SAS Studio, a new SAS interface that runs in a web browser.

SAS Studio: A New Way to Program in SAS

I hope to see you there!




Choose Your SAS Interface

A while back, I wrote about the proliferation of interfaces for writing SAS programs.  I am reposting that blog here (with a few changes) because a lot of SAS users still don’t understand that they have a choice.

These days SAS programmers have more choices than ever before about how to run SAS.  They can use the old SAS windowing enviroment (often called Display Manager because, face it, SAS windowing environment is way too vague), or SAS Enterprise Guide, or the new kid on the block: SAS StudioAll of these are included with Base SAS.


Display Manager / SAS Windowing Environment


SAS Enterprise Guide


SAS Studio

I recently asked a SAS user, “Which interface do you use for SAS programming?”

She replied, “Interface?  I just install SAS and use it.”

“You’re using Display Manager,” I explained, but she had no idea what I was talking about.

Trust me.  This person is an extremely sophisticated SAS user who does a lot of leading-edge mathematical programming, but she didn’t realize that Display Manager is not SAS.  It is just an interface to SAS.

This is where old timers like me have an advantage.  If you can remember running SAS in batch, then you know that Display Manager, SAS Enterprise Guide, and SAS Studio are just interfaces to SAS–wonderful, manna from heaven–but still just interfaces.  They are optional.  It is possible to write SAS programs in an editor such as Word or Notepad++, and copy-and-paste into one of the interfaces or submit them in batch.  In fact, here is a great blog by Leonid Batkhan describing how to use your web browser as a SAS code editor.

Each of these interfaces has advantages and disadvantages.  I’m not going to list them all here, because this is a blog not an encyclopedia, but the tweet would be

“DM is the simplest, EG has projects, SS runs in browsers.”

I have heard rumors that SAS Institute is trying to develop an interface that combines the best features of all three.  So someday maybe one of these will displace the others, but at least for the near future, all three of these interfaces will continue to be used.

So what’s your SAS interface?


Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.


Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.


Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.


Dueling Data Science Surveys: KDnuggets & Rexer Go Live

What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.

For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll.  Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.

Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use.  It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors.  You can take the Rexer Analytics survey here (use Access Code M7UY4).  Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.

As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).




Check number of observations in SAS dataset

This post explains how to determine the number of observations in a SAS dataset. Most of the times we need to check whether a SAS dataset is empty or not. In macro, we generally tell SAS to go to the next iteration only when SAS dataset is non-empty. In this post, we will see various methods to count number of  rows (records) in SAS table.

Method I : Proc SQL Count (Not Efficient)

In the example below, we will use CARS dataset from SASHELP library. This dataset contains 428 observations and 15 columns.

The easiest method is to use count(*) in Proc SQL. It returns all rows (missing plus non-missing rows) in a dataset.
proc sql;
 select count(*) as N from sashelp.cars;
Result : 428

In case you want to store it in a macro variable, you can use INTO : keyword.
proc sql noprint;
 select count(*) into :N from sashelp.cars;

%put &N;
This will print the number of records in SAS log. Check log after running the above program.

Is it an efficient method?
No, it is not efficient at all. It does not use metadata information of SAS dataset. Instead it reads through each record (row) of your SAS dataset. It takes a long time to do it in big SAS tables. However, it is a simple and handy trick to calculate the number of rows in a SAS dataset.

Method 2 : Descriptor Portion (Efficient)

Before getting into detail, we need to understand the descriptor portion and how it works -

SAS dataset consists of the following two portion -
  1. Descriptor portion. It constitutes information about name of dataset, number of observations and variables, creation date, engine type.
  2. Data portion. It stores values of data.
This method is one of the most efficient way to count observations in a SAS table as it uses metadata information and does not search in dataset.
data _NULL_;
 if 0 then set sashelp.cars nobs=n;
put "no. of observations =" n;
  1. The 'if 0' statement  does not process at execution time because IF statement does not hold TRUE. The whole IF THEN statement is used to pull the header information of the data set and later hand over to the compiler to adjust it to the PDV.
  2. NOBS is a SAS automatic variable which contains the number of rows in a dataset i.e. SASHELP.CARS dataset.
  3. NOBS = N puts the returns count of records in the variable n.
  4. The STOP statement is used to stop an endless loop.
Like the first method, we can keep it in a macro variable. See the implementation below -
data _NULL_;
 if 0 then set sashelp.cars nobs=n;
 call symputx('totobs',n);
%put no. of observations = &totobs;
SAS Output

CALL SYMPUT is one of the method to create a SAS macro variable in data step. In this case, we have used a newer function i.e. CALL SYMPUTX which left justifies and trims trailing blanks from a numeric value. If you want to stick to the old style CALL SYMPUT, you can write like below -
call symput('totobs',left(n));

3. Proc SQL Dictionary Method (Efficient)

Like second method, we can use metadata information of a dataset with PROC SQL Dictionary.Tables.
proc sql noprint;
 select nobs into :totobs separated by ' ' from dictionary.tables
 where libname='SASHELP' and memname='CARS';
%put total records = &totobs.;
Proc SQL Dictionary.Tables

It is an efficient method as it does not look into each values of  a dataset to determine the count. The LIBNAME= refers to the name of the library in which data is stored. The MEMNAME= refers to SAS table (dataset). The separated by ' ' is used in this case to left align the numeric value.

4. Macro Language Method (Efficient)

This method also uses metadata information but it is via the macro language using DATA step functions. The OPEN function is used to open a data. The ATTRN function returns the value of a numeric attribute for a SAS data set. When it is used with the NOBS argument, it returns the number of observations. Later we are closing the opened dataset using CLOSE function.
%macro totobs(mydata);
    %let mydataID=%sysfunc(OPEN(&mydata.,IN));
    %let NOBS=%sysfunc(ATTRN(&mydataID,NOBS));
    %let RC=%sysfunc(CLOSE(&mydataID));
%put %totobs(sashelp.cars);

SAS : Check if it is empty table

Suppose you only need to check whether a table is empty or not. You can use the same logic as explained above. And if the returned value is 0, write 'Empty Data' in log. Otherwise, count the number of records.
data _NULL_;
if 0 then set sashelp.cars nobs=n;
if n = 0 then put 'empty dataset';
else put 'Not empty. Total records=' n;
Result : Not Empty. Total records = 428

Let's create a blank dataset to check the above code. The following program returns empty dataset as 1=2 condition does not meet.
proc sql noprint;
create table temp as
select * from sashelp.cars
where 1 = 2;
Try it yourself!

Let's wrap the above code in a SAS macro
%macro emptydataset (inputdata=);
data _NULL_;
 if 0 then set &inputdata. nobs=n;
 call symputx('totobs',n);
%if &totobs. = 0 %then %put Empty dataset;
%else %do;
%put TotalObs=&totobs;
Result : TotalObs=428

Result : Empty dataset

If you think it's difficult to memorize sas code of descriptor portion method, you can use the code below.
data _NULL_;
set sashelp.cars nobs=N;
if _N_ = 2 then stop;
put N;
SAS log
It reads only first two observations from the dataset. See log above.

Introducing The Little SAS Enterprise Guide Book

Image of The Little SAS Enterprise Guide BookThere is a new member of The Little SAS Book family: The Little SAS Enterprise Guide Book.  This book is for people who use SAS Enterprise Guide as their interface.  EG is a point-and-click interface for SAS, but it’s great for programmers too.  And this book includes a new chapter covering the wonderful features for SAS programmers along with discussion of special issues that they face.

If you are familiar with our other EG books, you may be wondering why this one isn’t called the “Fourth Edition.”  That is because we changed the title slightly.  Our previous EG books were each written for a specific version of EG, and consequently had the version number right in the title.  This book was written using EG 7.1, but it also applies to some earlier versions (5.1 and 6.1).  With a little luck, this book will also apply to future versions.  So it’s a keeper.

I’m very pleased with how this book has turned out.  Lora Delwiche and I updated it so that all the windows and icons match the current EG, and we also added some great new sections.  Even with the new topics, this book is 60 pages shorter than our previous EG book!  It is shorter because we replaced some chapters on specific types of tasks, with a new chapter that explains how tasks work in general.  The result is a book that is easier to read and more useful.

For more information about this book including the table of contents, an excerpt, and reviews, click here.


Forrester’s 2017 Take on Tools for Data Science

In my ongoing quest to track The Popularity of Data Science Software, I’ve updated the discussion of the annual report from Forrester, which I repeat here to save you from having to read through the entire document. If your organization is looking for training in the R language, you might consider my books, R for SAS and SPSS Users or R for Stata Users, or my on-site workshops.

Forrester Research, Inc. is a company that provides reports which analyze the competitive position of tools for data science. The conclusions from their 2017 report, Forrester Wave: Predictive Analytics and Machine Learning Solutions, are summarized in Figure 3b. On the x-axis they list the strength of each company’s strategy, while the y-axis measures the strength of their current offering. The size and shading of the circles around each data point indicate the strength of each vendor in the marketplace (70% vendor size, 30% ISV and service partners).

As with Gartner 2017 report discussed above, IBM, SAS, KNIME, and RapidMiner are considered leaders. However, Forrester sees several more companies in this category: Angoss, FICO, and SAP. This is quite different from the Gartner analysis, which places Angoss and SAP in the middle of the pack, while FICO is considered a niche player.

Figure 3b. Forrester Wave plot of predictive analytics and machine learning software.

In their Strong Performers category, they have H2O.ai, Microsoft, Statistica, Alpine Data, Dataiku, and, just barely, Domino Data Labs. Gartner rates Dataiku quite a bit higher, but they generally agree on the others. The exception is that Gartner dropped coverage of Alpine Data in 2017. Finally, Salford Systems is in the Contenders section. Salford was recently purchased by Minitab, a company that has never been rated by either Gartner or Forrester before as they focused on being a statistics package rather than expanding into machine learning or artificial intelligence tools as most other statistics packages have (another notable exception: Stata). It will be interesting to see how they’re covered in future reports.

Compared to last year’s Forrester report, KNIME shot up from barely being a Strong Performer into the Leader’s segment. RapidMiner and FICO moved from the middle of the Strong Performers segment to join the Leaders. The only other major move was a lateral one for Statistica, whose score on Strategy went down while its score on Current Offering went up (last year Statistica belonged to Dell, this year it’s part of Quest Software.)

The size of the “market presence” circle for RapidMiner indicates that Forrester views its position in the marketplace to be as strong as that of IBM and SAS. I find that perspective quite a stretch indeed!

Alteryx, Oracle, and Predixion were all dropped from this year’s Forrester report. They mention Alteryx and Oracle as having “capabilities embedded in other tools” implying that that is not the focus of this report. No mention was made of why Predixion was dropped, but considering that Gartner also dropped coverage of then in 2017, it doesn’t bode well for the company.


Jobs for “Data Science” Up 7-fold, for “Statistician” Down by Half

The Bureau of Labor Statistics projects that jobs for statisticians will grow by 34% between 2014 and 2024. However, according to the nation’s largest job web site, the number of companies looking for “statisticians” is actually in sharp decline. Those jobs are likely being replaced by postings for “data scientists.”

I regularly monitor the Popularity of Data Science Software, and as an offshoot of that project, I collected data that helps us understand how the term “data science” is defined. I began by finding jobs that required expertise in software used for data science such as R or SPSS. I then examined the tasks that the jobs entailed, such as “analyze data,” and looked up jobs based only on one task at a time. I switched back and forth between searching for software and for the terms used to describe the jobs, until I had a comprehensive list of both.  In the end, I had searched for over 50 software packages and over 40 descriptive terms or tasks. I had also skimmed thousands of job advertisements. (Additional details are here).


Search Terms 2/26/2017 2/17/2014 Ratio
Big Data 20,646            10,378 1.99
Data analytics 15,774              6,209 2.54
Machine learning 12,499              3,658 3.42
Statistical analysis 11,397              9,719 1.17
Data mining   9,757              7,776 1.25
Data Science 6,873                  973 7.06
Quantitative analysis 4,095              3,365 1.22
Business analytics  4,043              2,867 1.41
Advanced Analytics 3,479              1,497 2.32
Data Scientist 3,272                 974 3.36
Statistical software 2,835              2,102 1.35
Predictive analytics                 2,411              1,497 1.61
Artificial intelligence  2,404                 794 3.03
Predictive modeling 2,264              1,804 1.25
Statistical modeling 2,040              1,462 1.40
Quantitative research                 1,837              1,380 1.33
Research analyst                 1,756              1,722 1.02
Statistical tools                 1,414              1,121 1.26
Statistician 904              1,711 0.53
Statistical packages                    784                 559 1.40
Survey research 440                 559 0.79
Quantitative modeling                    352                  322 1.09
Statistical research 208                  174 1.20
Statistical computing                    153                  108 1.42
Research computing                    133                    97 1.37
Statistical analyst                    125                  141 0.89
Data miner 34                    19 1.79

Many terms were used outside the realm of data science. Other terms were used both in data science jobs and in jobs that require little analytic skill. Terms that could not be used to specifically find data science jobs were: analytics, data visualization, graphics, data graphics, statistics, statistical, survey, research associate, and business intelligence. One term, econometric(s), required deep analytical skills, but was too focused on one field.

The search terms that were well-focused on data science, but not overly focused in a single field are listed in the following table. The table is sorted by the number of jobs found on Indeed.com on February 26, 2017. While each column displays counts taken on a single day, the large size of Indeed.com’s database of jobs keeps its counts stable. The correlation between the logs of the two counts is quite strong, r=.95, p= 4.7e-14.

During this three-year period, the overall unemployment rate dropped from 6.7% to 4.7%, indicating a period of job growth for most fields. Three terms grew very rapidly indeed with “data science” growing 7-fold, and both “data scientist” and “artificial intelligence” tripling in size. The biggest surprise was that the use of the term “statistician” took a huge hit, dropping to only 53% of its former value.

That table covers a wide range of terms, but only on two dates. What does the long-term trend look like? Indeed.com has a trend-tracking page that lets us answer that question. The figure below shows solid the growth in the percentage of advertisements that used the term “data scientist” (blue, top right), while those using the term “statistician” (yellow, lower right) are steadily declining.

The plot on the company’s site is interactive (the one shown here is not) allowing me to see that the most recent data points were recorded on December 27, 2016. On that date, the percentage of jobs for data scientist were 474% of those for statistician.

As an accredited professional statistician, am I worried about this trend? Not at all. Statistical analysis software has broadened its scope to include many new capabilities including: machine learning, artificial intelligence, Structured Query Language, advanced visualization techniques, interfaces to Python, R, and Apache Spark. The software has changed because the job known as “statistician” has changed. Statisticians aren’t going away, their jobs are evolving into what we now know as data science. And that field is growing quite nicely!


Predictive Modeling Interview Questions and Answers

Predictive modeling knowledge is one of the most sought-after skill today. It is in demand these days. It is being used in almost every domain ranging from finance, retail to manufacturing. It is being looked as a method of solving complex business problems. It helps to grow businesses e.g. predictive acquisition model, optimization engine to solve network problem etc.

It is not easy to get into these roles as it requires technical understanding of various statistical techniques and machine learning algorithms with tools like SAS/R/Python. Hence, it is important to prepare well before going for interview. To help you in interview preparation, I’ve jot down most frequently asked interview questions on logistic regression, linear regression and predictive modeling concepts. In general, an analytics interview process includes multiple rounds of discussion. Possible rounds are as follows -
  1. Technical Round on Statistical Techniques and Machine Learning Concepts
  2. Technical Round on Programming Languages such as SAS/R/Python/SQL
  3. Managerial Round on Business/Domain Knowledge
During these multiple rounds of interviews, they also check your communication skill and logical/ problem solving skill.
Predictive Modeling Interview Questions

Let's start with a list of some basic and tricky predictive modeling interview questions with answers.

1. What are the essential steps in a predictive modeling project?

It consists of the following steps -
  1. Establish business objective of a predictive model
  2. Pull Historical Data - Internal and External
  3. Select Observation and Performance Window
  4. Create newly derived variables
  5. Split Data into Training, Validation and Test Samples
  6. Clean Data - Treatment of Missing Values and Outliers
  7. Variable Reduction / Selection
  8. Variable Transformation
  9. Develop Model
  10. Validate Model
  11. Check Model Performance
  12. Deploy Model
  13. Monitor Model

2. What are the applications of predictive modeling?

Predictive modeling is mostly used in the following areas -
  1. Acquisition - Cross Sell / Up Sell
  2. Retention - Predictive Attrition Model
  3. Customer Lifetime Value Model
  4. Next Best Offer
  5. Market Mix Model
  6. Pricing Model
  7. Campaign Response Model
  8. Probability of Customers defaulting on loan
  9. Segment customers based on their homogenous attributes
  10. Demand Forecasting
  11. Usage Simulation
  12. Underwriting
  13. Optimization - Optimize Network

3. Explain the problem statement of your project. What are the financial impacts of it?

Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).

4. Define observation and performance window?
Tutorial : Observation and Performance Window

5. Difference between Linear and Logistic Regression?

Two main difference are as follows -
  1.  Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.
  2. Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)
Please note there are more than 10 difference between these two techniques, refer the link below -

6. How to handle missing values?

We fill/impute missing values using the following methods. Or make missing values as a separate category.
  1. Mean Imputation for Continuous Variables (No Outlier)
  2. Median Imputation for Continuous Variables (If Outlier)
  3. Cluster Imputation for Continuous Variables
  4. Imputation with a random value that is drawn between the minimum and maximum of the variable [Random value = min(x) + (max(x) - min(x)) * ranuni(SEED)]
  5. Impute Continuous Variables with Zero (Require business knowledge)
  6. Conditional Mean Imputation for Continuous Variables
  7. Other Imputation Methods for Continuous  - Predictive mean matching, Bayesian linear regression, Linear regression ignoring model error etc.
  8. WOE for missing values in categorical variables
  9. Decision Tree, Random Forest, Logistic Regression for Categorical Variables
  10. Decision Tree, Random Forest works for both Continuous and Categorical Variables
  11. Multiple Imputation Method

7. How to treat outliers?

There are several methods to treat outliers -
  1. Percentile Capping
  2. Box-Plot Method
  3. Mean plus minus 3 Standard Deviation
  4. Weight of Evidence

8. Explain Dimensionality / Variable Reduction Techniques

Unsupervised Method (No Dependent Variable)
  1. Principal Component Analysis (PCA)
  2. Hierarchical Variable Clustering (Proc Varclus in SAS)
  3. Variance Inflation Factor (VIF)
  4. Remove zero and near-zero variance predictors
  5. Mean absolute correlation. Removes the variable with the largest mean absolute correlation. See the detailed explanation of mean absolute correlation
Supervised Method (In respect to Dependent Variable)

For Binary / Categorical Dependent Variable
  1. Information Value
  2. Wald Chi-Square
  3. Random Forest Variable Importance
  4. Gradient Boosting Variable Importance
  5. Forward/Backward/Stepwise - Variable Significance (p-value)
  6. AIC / BIC score
For Continuous Dependent Variable
  1. Adjusted R-Square
  2. Mallows' Cp Statistic
  3. Random Forest Variable Importance
  4. AIC / BIC score
  5. Forward / Backward / Stepwise - Variable Significance

9. Explain equation of logistic regression model

10. What is multicollinearity and how to deal it?

Multicollinearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity.

It can be handled by iterative process : first step - remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

11. How VIF is calculated and interpretation of it?

VIF measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this means that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

Steps of calculating VIF
  1. Run linear regression in which one of the independent variable is considered as target variable and all the other independent variables considered as independent variables
  2. Calculate VIF of the variable. VIF = 1/(1-RSquared)

12. Do we remove intercepts while calculating VIF?

No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case one can get VIF < 1, implying that the standard error of a variable would go up if that independent variable were uncorrelated with the other predictors.

13. What is p-value and how it is used for variable selection?

The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent variables, it implies whether coefficient of a variable is significantly different from zero,

14. How AUC, Concordance and Discordance are calculated?

15. Explain important model performance statistics
  1. AUC > 0.7. No significant difference between AUC score of training vs validation.
  2. KS should be in top 3 deciles and it should be more than 30
  3. Rank Ordering. No break in rank ordering.
  4. Same signs of parameter estimates in both training and validation

16. Explain Gain and Lift Charts
Check out this tutorial : Understanding Gain and Lift Charts

17. Explain collinearity between continuous and categorical variables. Is VIF a correct method to compute collinearity in this case?

Collinearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multicollinearity. It means changing the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of cases.

VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable.
We can also safely ignore collinearity between dummy variables. To avoid high VIFs in this case, just choose a reference category with a larger fraction of the cases

18. Assumptions of Linear Regression Model
Linear Regression Explained

19. How WOE and Information Value are calculated?
WOE and Information Value Explained

20. Difference between Factor Analysis and PCA?

The main 3 difference between these two techniques are as follows -
  1. In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
  2. Principal Components Analysis is used as a variable reduction technique whereas Factor Analysis is used to understand what constructs underlie the data.
  3. In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the covariances or correlations between the variables.

21. What would happen if you define event incorrectly while building a model?

Suppose your target variable is attrition. It's a binary variable - 1 refers to customer attrited and 0 refers to active customer. In this case, your desired outcome is 1 in attrition since you need to identify customers who are likely to leave.

Let's say you set 0 as event in the logistic regression.
Logistic Regression Output.

  1. The sign of estimates would be opposite which imply opposite behavior of variables towards target variable (as shown in the image above).
  2. Area under curve (AUC), Concordance and Discordance scores would be exactly same. No change.
  3. Sensitivity and Specificity score would be swapped (see the image below).
  4. No change in Information Value (IV) of variables.

Sensitivity and Specificity

22. What is Fisher Scoring in Logistic Regression?

Logistic regression estimates are calculated by maximizing the likelihood function. The maximization of the likelihood is obtained by an iterative method called Fisher's scoring. It's an optimization technique. In general, there are two popular iterative methods for estimating the parameters of a non-linear equations. They are as follows -
  1. Fisher's Scoring
  2. Newton-Raphson
Both are similar except that the Newton-Raphson uses matrix of second order derivatives of log-likelihood function and Fisher uses Information Matrix. In SAS, the default optimization method in PROC LOGISTIC is Fisher's Scoring.

The algorithm completes when the convergence criterion is satisfied or when the maximum number of iterations has been reached. Convergence is obtained when the difference between the log-likelihood function from one iteration to the next is small.

Technical Interview Questions on SAS and R

The following is a list of SAS/R technical interview questions that are generally asked. It includes some tricky questions which requires hands-on experience.

  1. Difference between INPUT and PUT Functions
  2. How to generate serial numbers with SAS
  3. Difference between WHERE and IF statements
  4. Difference between '+' operator and SUM Function
  5. Use of COALESCE Function
  6. Difference between FLOOR and CEIL functions
  7. How to use arrays to recode all the numeric variables
  8. Number of ways you can create macro variables
  9. Difference between MERGE and SQL Joins
  10. How to calculate cumulative sum in SAS
You would find answers of the above questions in the links below -

  1. Difference between sort() and order() functions
  2. Popular R packages for decision tree
  3. How to transpose data in R
  4. How to remove duplicates in R
  5. Popular packages to handle big data
  6. How to perform LEFT join in R
  7. How R handles missing values
  8. How to join vertically two data frames
  9. Use of with() and by() functions
  10. Use of which() function
Check out the link below for solutions of the above questions plus other interview questions on R

SQL and Excel

Prior to interview, you can also look at questions on SQL concepts and Advanced Excel. SQL and Excel are still the most widely used tools for basic and intermediate analytics.
Back to Top