Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.


Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.


Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.


MANOVA from beginning to end: Reliability

Julia yellingWhere is the Multivariate Analysis of Variance ?

You promised there would be MANOVA ! Now we’re in the third post!

First there was recoding of variables.

Then, there was creating scales. 

Now, we’re looking at reliability.

Patience is a virtue.

Before we get to doing a MANOVA we want to be sure that our dependent and independent variables are reliable and valid. Let’s move on to reliability.

I’m going to do a correlation matrix and a Cronbach alpha, which is a measure of internal consistency. The rationale is that if items all measure the same construct – say, knowledge of health practices, or autonomy or acceptance of wife beating – then those items should be related to one another. An alpha of 0 would indicate the covariance of items in the scale are zero, so, your scale sucks. An alpha of .95 would mean your scale is amazingly consistent.

So, I did three analysis for my three scales

Title "Health Variables " ;
proc corr data=example alpha ;
var hbs1 hbs3-hbs7 ;

Title "Wife beating variables" ;
proc corr data=example alpha ;
var GR34 - GR39 ;

Title "Decision Variables" ;
proc corr data=example alpha ;

Let’s skip the simple statistics, mean, etc. you get from these analyses and go to the alpha

Screen Shot 2017-06-14 at 9.48.47 PM

The alpha for the health scale is pretty bad. The value for the raw scores is .31, for standardized items, still really bad at .32.  When we look at how deleting a variable would improve the alpha, if we dropped the first variable , the alpha would go up to .34 – but that is still awful.

For the wife-beating scale the raw value for alpha was .81 and also for the standardized value. So, that one was pretty good as far as reliability.

I put all of the decision variables together, the ones on whether the woman was involved in making decisions, could go places on her own, needed to ask permission to go places. The Cronbach alpha for the raw variables was .65, for standardized variables .81. Note that standardized variables are placed on the same metric, so my idea of some variables being much more important than others did not pan out.

So … I standardized the variables, then I read in that data set and created two scales, one that was a sum of the decision  variables and the other that was the mean of the 6 wife-beating variables. There was no particular reason for using the mean of the six variables as opposed to just adding them up. I did both methods to show it was an option.

BEWARE THE SUM FUNCTION – Note, I did not use the sum function. If you add up the values, as shown below, and one of the variables has a missing value then the value of the sum is going to be missing. If you used the SUM function, the variables that have non-missing values would be added up, so the missing value would be treated as a zero. There are times where that is acceptable. This is not one of those times.

While I’m at it, I want to check whether the scales have approximately normal distributions. A perfectly normal distribution would have skewness and kurtosis values of 0.

proc standard data=example mean=0 std=1 out=MAN_data;

Data create_manova ;
set man_data ;
* I could have used the mean function here, but I didn't ;
decision = D_GR1A + GR2A + D_GR3A + D_GR4A + GR5a + GR6A + D_GR7A + GR8A +
D_GR9A + GR9F + D_GR10A + D_GR12A + GR10F + GR12F ;
beating = mean(of gr34-gr39);

proc univariate data=create_manova ;
var decision beating ;

The skewness values were relatively low: -1.3 and 0.2 for the two scales and kurtosis values were 2.0 and -1.2  . Since my scales aren’t a radical departure from normality, I’m now going on to MANOVA – finally!

When I am not writing about statistics, I’m making games that teach math, social studies and language.

Check them out.

screen shots from our games


MANOVA from beginning to end : Creating the scales

Last time, we saw how to recode variables to score answers correct or incorrect, on a rating scale and weighted by importance. Today, we’re going to look at creating some scales from those variables because for reasons I’m sure I have written about at some point in the past, single items are usually not very reliable. Whether you use SAS, SPSS, R or any other statistical package, you are still going  to need to follow the steps of recoding your variables and creating and validating your scales before you get into MANOVA. Or, at least, you will if you are smart.

First, I want to check that there are no obvious errors or other problems in my data.
PROC MEAN DATA=example ;
VAR gr2A -- gr39 hbs1 --d_gr12a ;

You could type in the variable names but that is a lot of typing. The double dashes mean to include all variables in the data set in order from the first variable to the one that comes after the dashes. How do you know what order the variables are in? Click on the OUTPUT DATA tab at the top and look to the left under COLUMNS.

output da

If you didn’t just run a program creating your data and hence don’t have an OUTPUT DATA tab, you can find your data file by clicking the MY LIBRARIES tab and then clicking on the library (directory) where your data are kept and clicking on the dataset to open it. You can also use the PROC CONTENTS procedure but today we are being all pointy and clicky with SAS Studio.

Sometimes you will see something like:

VAR item1 – item12 ;

The single dash is used for variables that end in a number and if you don’t have item1, item2 all the way through item12, it will give you an error and not run. Then you will be sad.

PROC MEANS will give you the N, mean, standard deviation, minimum and maximum.

Here are a few things to consider.

  • Is the N substantially less than you had expected? If so, you have a lot of missing data and you should investigate that. The lowest N I have is 37, 814 out of 39, 430 people so not bad, but I might want to look at that one item, since most of the items have close to 39,000 for an N
  • Is your standard deviation zero? STOP RIGHT THERE!  On just what variable could 39,000 people give the same response? This likely shows a big problem with your data. I did not have that problem, so I continued.
  • Are your minimum and maximum the minimum and maximum possible scores for the item? Now, this may not always be the case. On a scale of 1 to 10, say, with a sample of 50 people, maybe no one will say 1. However, I have over 39,000 people and the items are 0 or 1, o – 2  or 1- 3, so I should have people from the minimum to the maximum or something is wrong. Nothing is wrong, and I continue.
  • Are the means about what you expect? Well, I’m not really an expert on social structure and family relations in India, so I can’t say. About a third of the women said it was usual for a husband to beat his wife if her dowry was not what was expected. About three-fourths said they would be allowed to visit a family or friend’s home alone.

Okay, so my results from the means procedure looks okay. Now what?

Next, I’m going to do a factor analysis to see if my supposition is supported of three scales related to health, beating your wife and autonomy.

Here is the code for my factor analysis.

VAR gr2A -- gr39 hbs1 --d_gr12a ;

This is actually the second one I ran. In inspecting the results for the first, between the eigenvalues and scree plot, I decided that at most I should retain five factors. I’ve written a lot about factor analysis on this blog previously, so I’m not going to go into detail here.  In short, the decision-making variables mostly loaded on the first factor with factor loadings of .70 and higher. The median communality estimate for those items was about .67.  In short, considerable evidence for a decision-making factor. The wife-beating variables loaded on the second factor. All but one loaded above .67, and even that variable (Beating your wife if she had an extramarital affair – which 84% of the women said was accepted in their communities) loaded at .40. The variables regarding needing permission to go places loaded on the third factor and also had high communality estimates. The variables regarding going places by yourself loaded on the fourth factor and also had high communality estimates.

The health variables were a different story. Four out of six loaded between .47 and .67 on the fifth factor. The other two did not load on any factor.

It is starting to look like at this point that it is okay to retain the wife-beating items as a scale. The various measures of autonomy  – decision-making, going places on your own and needing permission – seem to hang together within factors. I think it would be reasonable to put all three of these together in one scale. I talked about parceling in the past, and I could have done that as a step here, and then re-run the factor analysis to support (or not) my supposed autonomy factor. Since I have limited time and simply doing this analysis for educational and illustrative purposes, I skipped over this to the next procedure, which is reliability analysis.

Since this post is pretty long already, I’ll save that for the next post.

When I am not writing about statistics, I’m making games that teach math, social studies and language.

Check them out.

screen shots from our games


MANOVA beginning to end: Recoding Data is Part of the Process

Other people want to go see the new Wonder Woman movie. I’ve been wanting to talk about MANOVA, but first, we need some decent dependent and independent measures.

I have the India Human Development Survey data on over 39,000 women and my hypothesis is that education is related to women’s rights’ issues, especially autonomy, health practices knowledge and domestic violence. I also think that mobility might be related, as women who get out of their native village might be exposed to new ideas.

Before I can test out my (supposedly) brilliant hypotheses, I need to create some variables because it turns out when they were collecting data in India in 2011 they were not thinking about my convenience. (Yes, I, too, am appalled by this lack of consideration.)

Independent Variables

First, I will need to create my independent variables from

EW11 Differences in family by mobility

1= same village/ town

2= another village

3 = another town

4 = metro (since only 1% fall in here, I’m going to delete this category)

and education (see below)

Items that will go into dependent variables (maybe)


HB1 Milk harmful

HB3. 1st milk good for baby 

Hb4 chulha smoke good

Hb5 child diarrhea drink more

Hb6 illness spread through water

Hb7 malaria spread


The items below are scored 1 if the respondent decides, 0 if the respondent does not decide. (More than 1 person can decide, so if both husband and wife decide, the answer will be 1 for both. In this case, I just looked at if the wife had a say in the decision.)

  • GR1a Cooking
  • GR2A Expensive purchases.      
  • GR3A Decides number of children
  • GR4A Decides what to do if sick
  • GR5A Decides whether to buy land  
  • GR6A Decides wedding expense
  • GR7A Decides if child is sick
  • GR8A Decides who your children should marry

The items below are score 1 if the woman is allowed to do these things alone and 0 if she is not.

  • GR9F Can visit health center alone
  • GR10F Can visit relative/ friend alone
  • GR12F. Can go short distance alone

These items relate to whether the woman needs to ask permission for activities, with  0 = no, 1 = must inform someone and 2 = yes

  • GR9A Ask permission to visit health center
  • GR10A Ask permission to visit relative
  • GR12A. Ask permission to travel by bus/train



GR34 – GR39  – All of these relate to under what circumstances it is acceptable, coded yes = 1 or 0 = no.

As you can see, well, I hope you can see, each of these presents a different date re-coding problem.

  • Mobility and education needs to be coded into categories (there is a minor reason I will explain in a later post why this is not necessary but convenient), with the fourth category deleted,
  • Health questions need to be scored as correct or incorrect.
  • Decision questions are all scored equally – so deciding what food  to cook and how many children you have are each scored a 1. I think that’s not right and I want to weight some decisions more than others.
  • Independence questions need to be reverse coded, so not asking permission is a 2 and asking permission is a 0
  • Wife-beating questions need no recoding

So … here we go. The first thing we’re going to do is create categories. Notice I don’t do anything with the category 4 for mobility, so those people will just have a missing value for MOBILITY and be dropped from the analysis.

Also, a note on ELSE as opposed to just IF statements.

I could just use all IF statements but that would be inefficient. It doesn’t really matter here with 39,000 records but if I had millions it would slow down processing. The ELSE statement is only processed if the preceding IF statement is false.

NOTE!!!  In the second set of IF- ELSE statements, I have

else if ew8 < 9 and ew8 ne . then education = “ELEM”;

This statement is only executed IF the preceding IF statement was false.  Without the ELSE, everything less than 9, including those who had 0 years of education, would be set to ELEM.  Without the and ew8 ne .  in this statement, anyone that had missing data would be set to ELEM along with anyone who had 1-8 years of education.

data example ;
set mydata.india ;
If EW11 = 1  then Mobility = “None” ;
else if EW11 = 2 then mobility = “Vill” ;
else if EW11 = 3 then mobility = “TOWN”;

if ew8 = 0 then education = “NONE” ;
else if ew8 < 9 and ew8 ne . then education = “ELEM”;
else if ew8 > 8 then education = “HS +”;

*** The statements below recode the health items ;

*** For hb1 the correct answer is 0, so  1-hb1   will score respondents who said 0 as correct (= 1) and those who said 1 as incorrect (=0);

*** For hb3 the correct answer is 1, so respondents who said 1 are scored as correct (= 1) and those who said any number higher than 1 as incorrect (=0);

*** For hb4 – hb7, the correct answer is scored as correct (=1) and any numbers in the incorrect set scored as incorrect (=0);
hbs1 = 1- hb1 ;

If hb3 = 1 then hbs3 = 1 ;
Else if hb3 > 1 then hbs3 = 0 ;
If hb4 = 2 then hbs4 = 1 ;
Else if hb4 in (1,3) then hbs4 = 0 ;
If hb5 = 2 then hbs5 = 1 ;
Else if hb5 in (1,3,4) then hbs5 = 0 ;
If hb6 = 2 then hbs6 = 1 ;
Else if hb6 in (1,3,4) then hbs6 = 0 ;

If hb7 = 3 then hbs7 = 1 ;
Else if hb7 in (1,2,4) then hbs7 = 0 ;



**** Here, I multiplied items by a factor based on my estimation of importance ;
D_GR1A = GR1A* 0.5 ;
D_GR4A = GR4A *2 ;
D_GR7A = GR7A *2 ;

**** These items are subtracted from 3 so doesn’t have to tell anyone = 2 ;

****  Needs to inform someone = 1 and needs to ask permission = 0 ;
D_GR9A = 3 – GR9A ;
D_GR10A = 3 – GR10A ;
D_GR12A = 3 – GR12A ;

Keep EW8 EW5  Ew6 EW10  EW14a   EW12a EW12b
D_GR9A GR9F D_GR10A D_GR12A GR10F GR12F GR34 – GR39 mobility education;

So, there we go. You might think I would dive into a Multivariate Analysis of Variance now but you would be wrong. The next thing I am going to do is check the validity of my scales through a combination of factor analysis, univariate statistics and reliability analysis. Only after  that step will I do the MANOVA.


An Introduction to Repeated Measures ANOVA

I’m teaching a course on multivariate statistics and for some of the students it’s been a minute since their last inferential statistics course.

So, I have been doing a few videos here and there to refresh, for example, what is a repeated measures ANOVA and why you might want to do it.


Sometimes I use repeated measures ANOVA to test whether our games are effective in improving math scores (they are!). You can check out the games here.

attacking the aztecs

If you are interested in being a beta tester for our first bilingual game that teaches statistics, please email info@7generationgames.com


Using Characterize Data Task to Inspect Data Quality

Since I had done a few youtube videos on using SAS Studio, I thought I would add them to my blog. This one uses the characterize data task to take a quick look at the data, but I suppose you could have guessed that from the title.


Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV


Dueling Data Science Surveys: KDnuggets & Rexer Go Live

What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.

For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll.  Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.

Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use.  It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors.  You can take the Rexer Analytics survey here (use Access Code M7UY4).  Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.

As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).




Pointy, Clicky Propensity Score Matching With SAS

Hopefully, you have read my Beginner’s Guide to Propensity Score matching or through some other means become aware of what the hell propensity score matching is. Okay, fine, how do you get those propensity scores?

Think about this carefully for a moment, if you are using quintiles, you are matching people by which group they fit into as far as probability of being in the treatment group. So, if your friend, Bob, has a predicted probability of 15% of being in the treatment group, his quintile would be 1, because he is in the lowest 20%, that is, the bottom fifth, or quintile. If your other friend, Luella, has a predicted probability of being in the treatment group of 57%, then she is in the third quintile.

Oh, if only there were a means of getting the predicted probability of being in a certain category – oh, wait, there is!

Let’s do binary logistic regression with SAS Studio

First, log into your SAS Studio account.

Second, you probably need to run a program with a LIBNAME statement to make your data available. I am going to skip that step because in this example I’m going to use one of the SASHELP data sets and create a data set in mu WORK library as so, so I don’t need a LIBNAME for that but, as you will see, I do need it later. Here is the program I ran.

data psm_ex ;
set sashelp.heart ;
if smoking = 0 then smoker = 0 ;
else if smoking > 0 then smoker = 1;
WHERE weight_status ne “Underweight” ;

libname mydata “/courses/blahblah/c_123/” ;


My question is if I had people who had the same propensity to smoke, based on age, gender, etc. would smoking still be a factor in the outcome (in this case, death). To answer that, I need propensity scores.

Third, in the window on the left, click on TASKS AND UTILITIES, then STATISTICS and select BINARY LOGISTIC REGRESSION, as shown below.


Next,  choose the data set you want by clicking on the thing under the word DATA that looks like a table of data and selecting the library and data set in that library. Next, under RESPONSE, click the + sign and select the dependent variable for which you want to predict the probability. In this case, it’s whether the person is a smoker or not. Click the arrow next to EVENT OF INTEREST and pick which you want to predict, in this case, your choices are 0 or 1. I selected 1 because I want to predict if the person is  a smoker.

Below that, select your classification variable,

choosing data


There is also a choice for continuous variables (not shown) on the same screen.  I selected AGEATSTART.

I’m going to select the defaults for everything but OUTPUT. Click the arrow at the top of the screen next to MODEL and keep clicking until you see the OUTPUT tab. Click on the box next to CREATE OUTPUT DATASET. Browse for a directory where you want to save it.  I had set that directory in my LIBNAME statement (remember the LIBNAME statement) so it would be available to save the data. Select that directory and give the data set a name.

Click the arrow next to PREDICTED VALUES and in the 3 boxes that appear below it, click the box next to predicted values.

create output data set


After this, you are ready to run your analysis. Click the image of the little running guy above.  When your analysis runs you will have a data set with all of your original data plus your predicted scores.



Now, we just need to compute quintiles.You could find the quintiles by doing doing this:


tables pred_ ;

and look for the 20th, 40th, etc. percentile

However, an easier way if you have thousands of records is

proc univariate data=mydata.statspsm ;
var pred_ ;
output pctlpre=P_ pctlpts= 20 to 80 by 20;
proc print data=data1 ;

Which will give you the percentiles.

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV


A Beginner’s Guide to Propensity Score Matching

One advantage of writing this blog for almost a decade is that there are a lots of topics I have already covered. However, software moving at the speed that it does, there are always updates.

So, today I’m going to recycle a couple of older posts that introduce you to propensity score matching. Then, tomorrow, I will show you how to get your propensity scores with just pointing and clicking with a FREE (as in free beer) version of SAS.


Before you even THINK about doing propensity score matching …

Propensity score matching has had a huge rise in popularity over the past few years. That isn’t a terrible thing, but in my not so humble opinion, many people are jumping on the bandwagon without thinking through if this is what they really need to do.

The idea is quite simple – you have two groups which are non-equivalent, say, people who attend a support group to quit being douchebags and people who don’t. At the end of the group term, you want to test for a decline in douchebaggery.

However, you believe that that people who don’t attend the groups are likely different from those who do in the first place, bigger douchebags, younger, and, it goes without saying, more likely to be male.

The very, very important key phrase in that sentence is YOU BELIEVE.

Before you ever do a propensity score matching program you should test that belief and see if your groups really ARE different. If not, you can stop right now. You’d think doing a few ANOVAs, t-tests or cross-tabs in advance would be common sense. Let me tell you something, common sense suffers from false advertising. It’s not common at all.

Even if there are differences between the groups, it may not matter unless it is related to your dependent variable, in this case, the Unreliable Measure of Douchebaggedness.

For more information, you can read the whole post here, also read the comments because they make some good points

What type of Propensity Score Matching is for you? A statistics fable

Once upon a time there were statisticians who thought the answer to everything was to be as precise, correct and “bleeding edge” as possible. If their analyses were precise to 12 decimal places instead of 5, of course they were better because as everyone knows , 12 is more than 5 (and statisticians knew it better, being better at math than most people).

Occasionally, people came along who suggested that newer was not always better, that perhaps sentences with the word “bleeding” in them were not always reflective of best practices, as in,

“I stuck my hand in the piranha tank and now I am bleeding.”

Such people had their American Statistical Association membership cards torn up by a pack of wolves and were banished to the dungeon where they were forced to memorize regular expressions in Perl until their heads exploded. Either that, or they were eaten by piranhas.

Perhaps I am exaggerating a tad bit, but it is true that there has been an over-emphasis on whatever is the shiniest, new technique on the block. Before my time, factor analysis was the answer to everything. I remember when Structural Equation Modeling was the answer to everything (yes, I am old). After that, Item Response Theory (IRT) was the answer to everything. Multiple imputation and mixed models both had their brief flings at being the answer to everything. Now it is propensity scores.

A study by Sturmer et al. (2006) is just one example of a few recent analyses that have shown an almost logarithmic growth in the popularity of propensity score matching from a handful of studies to in the late nineties to everybody and their brother.

You can read the rest of the post about choosing a method of propensity score matching here. If your clicking finger is tired, the take away message is this —  quintiles, which are much simpler, faster to compute and easier to explain, are generally just as effective as more complex methods.

Now that we are all excited about quintiles, the next couple of posts will show you how to compute those in a mostly pointy-clicky manner.

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV


SAS vs SPSS for Teaching Multivariate Analysis in Social Sciences

I have to choose between either SAS or SPSS for a new course in multivariate statistics. You can take it up with the university if you like, but  these are my only two options, in part because the course is starting soon.

I need to decide in a few days which way to go. Here are my very idiosyncratic reasons for one versus the other:

  • SPSS
  • There is a really good textbook on multivariate statistics that I think would be perfect for these students and it uses SPSS. The book is Advanced and Multivariate Statistics by Mertler & Vannatta, in case you were wondering.
  • SPSS can be installed pretty easily on the desktop and these are pretty non-technical students, so that’s a plus.
  • The point and click interface for SPSS is pretty easy and similar to Excel which most people have used.
  • Personally, I haven’t used SPSS in a while so it would be nice to use something different.


  • Students can just register and go to the website to use SAS Studio
  • Structural equation modeling and other advanced statistics procedures built in and not on add-on
  • SAS Studio is free vs $80 or so for students and $260 for professor (i.e., me) to buy SPSS academic versions including add-ons needed
  • I’m more familiar with SAS and find it easier to code than SPSS syntax.

I’ve toyed with the idea of showing both options but that uses up class time better spent on teaching, for example, how do you interpret a factor loading or AIC.

My big objection to SAS is I can’t find a recent textbook that is good for a multivariate analysis course that is in a social sciences department. The best one is by Cody and that is from 2005. I also use a couple of chapters from the Hosmer & Lemeshow book on Applied Logistic Regression , but I need something that covers factor analysis, repeated measures ANOVA and hopefully, MANOVA and discriminant function analysis, too.

I think most of these students have careers in non-profits and they are not going to be creating new APIs to analyze tweets or anything using enormous databases, so the ability to analyze terabytes is moot. This will probably be their second course in statistics and maybe their first introduction to statistical software.

Suggestions are more than welcome.

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

P. S. You can skip the hateful comments on why SAS and SPSS both suck and I should be using R, Python or whatever your favorite thing is. Universities don’t usually give carte blanche. These are my two choices.

P.P.S. You can also skip the snarky comments on how doctoral students should have a lot more statistics courses, all take at least a year of Calculus, etc. Even if I might agree with you, they don’t and I need tools that work for the students in my classes, not some hypothetical ideal student.

Back to Top