24
Feb

Whipping your data into shape with SAS : Part 1 for Today

I’m sure I’ve written about this before – after all, I’ve been writing this blog for 10 years – but here’s something I’ve been thinking about:

Most students don’t graduate with nearly enough experience with real data.

You can use government websites with de-identified data from surveys, and I do, but I teach primarily engineering and business students so it would be helpful to have some business data, too. Unfortunately, businesses aren’t lining up to hand me their financial, inventory and manufacturing data (bunch of jerks!)

So, I downloaded this free app, Medica Scientific from the app store and ran a simulation of data for a medical device company. Some friends did the same and this gave me 4 data sets, as if from 4 different companies.

Now, that I have 4 Excel files with the data, before you get to uploading the file, I’m going to give you a tip. By default, SAS is going to import the first worksheet. So, move the worksheet you want to be first. In this case, it’s a worksheet named “Financials”. Since SAS will use the first worksheet, it could just as well be named “A whale ate my sandwich”, but it wouldn’t be as obvious.

While you are at it, take a look at the data, variable names in the first row.  ALWAYS give your data at least a cursory glance. If it is millions of records, opening the file isn’t feasible and we cover other ‘quick looks’ in class.

1. Upload the file into the desired directory
2. Under Tasks and Utilities select Utilities and then Import Data
3. Click select file and then navigate to the folder where your file is and click open
4. You’ll see a bunch of code but nothing actually happens until you click on the little running guy.

menus to select data to import

First select the data set

the import data window

Have you clicked the running guy? Good!

 

Okay, now you have your code. Not only has SAS imported your data file into SAS, it’s also written the code for you.

FILENAME REFFILE '/home/annmaria/examples/simulation/Tech2Demo.xlsx';
PROC IMPORT DATAFILE=REFFILEDBMS=XLSX OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT1;
RUN;

Now, if you had a nice professor who only gave you one data set, you would be done, which is why I showed you the easy way to do it.

However, very often, we want to compare several factories or departments or whatever it is.

Also, life comes with problems. Sigh.

One of your problems, which you’d notice if you opened the data set is that the variables have names like “Simulation Day” .  I don’t want spaces in my variable names.

My second problem is that I need to upload all of my files and concatenate them so I have one long file.

Let’s attack both of these at once. First, upload the rest of your files.

Now,  open a new SAS program and at the top of your file, put this:

OPTION VALIDVARNAME=V7 ;

It will make life easier in general if your variable names don’t have spaces in them. The option above automatically recodes the variables to valid variable names without spaces.

Now, to import the next 3 files, just create a new SAS program and copy and paste the code created by your IMPORT procedure  FOUR TIMES (yes, four).

From Captain Obvious:

Captain Obvious wearing her obvious hat

Although you’d think this would be obvious, experience has shown that I need to say it.

  • Do NOT copy the code in this blog post. Copy the code produced by your own IMPORT procedure, it will have your own directory name.
  • Do NOT name every output data set IMPORT1 because if you do, each step will replace the data set and you will end up with one dataset and be sad.

Since I want to replace the first file, I’m going to need to add the REPLACE option in the first PROC IMPORT statement.

OPTION VALIDVARNAME=V7 ;

FILENAME REFFILE '/home/annmaria/examples/simulation/Tech2Demo.xlsx';
PROC IMPORT DATAFILE=REFFILEDBMS=XLSX
REPLACE
OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT1;
RUN;

FILENAME REFFILE '/home/annmaria/examples/simulation/Tech2Demo2.xlsx';
PROC IMPORT DATAFILE=REFFILEDBMS=XLSX
REPLACE OUT=WORK.IMPORT2;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT2;
RUN;

Do that two more times for the last two datasets

Did you need to do the utility? Couldn’t you just have done the code from the beginning? Yes. I just wanted to show you that the utility existed. If you only had one file and it had valid filenames, which is a very common situation, you would be done at that point.

In a real-life scenario, you would want to merge all of these into one file so you could compare clinics, plants, whatever. Super easy.

[IF you have write access to a directory, you could create a permanent dataset here using a LIBNAME statement, but I’m going to assume that you are a student and you do not. The default is to write to the working directory. ] ;

DATA allplants ;
set import1 - import4 ;

IF you get an error at this point, what should you do?

There are a few different answers to that question and I will answer them in my next post.

SUPPORT MY DAY JOB . IT’S FUN AND FREE!
YOU CAN DOWNLOAD A SPIRIT LAKE DEMO FOR YOUR WINDOWS COMPUTER FROM THE MICROSOFT STORE

14
Feb

jamovi for R: Easy but Controversial

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally,  since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

library("jmv")
ttestIS(
 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!

Acknowledgements

Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.

29
Jan

Maybe you *can* use SAS to teach art majors

I was supposed to be teaching statistics to undergraduate Fine Arts majors this semester but I’m going to Santiago to open a Latin American office for 7 Generation Games instead.

I’m a bit disappointed because even though when I was younger and got asked at cocktail parties what I did for a living, I would say,

I teach statistics to people who don’t want to learn it.

teaching Fine Arts majors would probably be a new experience.

I was planning on using Excel to teach that course. However, as I take a closer look at SAS Studio I think it might be feasible to use SAS.

First of all, it’s free for academics and you can use it on any device, including an iPad. I know because I’ve tested it.

Second, and more important for this group, you can use the tasks and do some real-life analyses with almost no coding.

For example, I want to know if the sample of students we tested on American Indian reservations who had a family member addicted to methamphetamine were, on the average, over the cutoff for depressive symptoms. On the scale we used, the CESD-C , the cutoff score is 15.

Step 1: Run the code to assign the directory with the data I made available for the course, for example,

libname in “/home/annmaria.demars/data_analysis_examples”;
run;

Step 2: Under the TASKS menu on the left select STATISTICS and then t TESTS

selecting t-tests

 

3.  Next to the DATA field you’ll see a thing that looks kind of like a spreadsheet. It’s supposed to symbolize a data file. Click on that and a box will come up that lets you pick the directory (library) and the file within it. In my case, it is the CESD_score file.

selecting the data

4. Now that I have my dataset selected, from the ROLES menu  I select one-sample t-test.

5. Click the + next to Analysis Variable and select the dependent variable, in my case, this is CESDTotal

Data selected for one-sample t-tes

6.  Now click on the OPTIONS tab. Two-tailed test is selected as the default. That’s good, leave it.  The alternative hypothesis tested is usually that the mean is equal to 0, but I want to change that to 15. Just click the little running guy at the top to get results.

options for t-test

 

I showed the results in a previous post, the mean for my sample of 18 youth was 21 (p <.05).

What if we did an UPPER one-tailed t-test? Then my p-value is .015 instead of .03.

What if we did a LOWER one-tailed test? Then my p-value is 1.0.

To get these latter 2 tests takes about 5 seconds. All  I need to do is change the option for tails and click on the running man again.

Now, in just a few minutes, I have data under three different assumptions, from an actual study. My students and I can start discussing what that means.

Bottom line, check out SAS Studio. It may be more of an option for your students than you think.

monkey

Meet the howler monkey in Aztech Games

 

Speaking of baby steps for learning statistics, check out Aztech Games. You can play them in English or Spanish on your iPad. Learn statistics and Latin American history at the same time.

13
Jan

What I Do When One Person Might Change My Results

In a previous post, I asked what you would do if one person’s score changed your results?

  • Would you throw them out?
  • Leave them in?
  • Does it depend on whether they support your hypothesis or not?

A few people suggested collecting more data and I completely agree with their very valid points that if one person can change your results from significant to non-significant, you probably have a small sample size, which we did, and that is a problem for a number of reasons that warrant their own posts. It’s not always possible to collect more data, due to time, money or other constraints (only so many people are considerate enough to die from rabies bites in a given year). In our case, we have a grant under review to follow up on this pilot study with  a much larger sample so if you are on the review committee let me just take this opportunity to say that you are good-looking and your mother doesn’t dress you funny at all.

A couple of other people commented on not getting tied up with significance vs non-significance too much, especially since a confidence interval with a sample size this small tends to be awfully wide. I agree with that also, but that, too, is a post in itself.

So, what would I do?

students at desk

First of all, I would check if there were any problems in data entry. You’d laugh if you knew how often I have heard people trying  to explain results due to an outlier and that outlier turns out to be a data entry person who typed 00 instead of 20 or a student who just went down the column circling everything “Always”.

For example, on this particular screening measure for depression, some of the items are reverse coded. If you did not pay attention to that and you just answered “A lot” for every item you would get an artificially depressed score (no pun intended). That was not the case here. I looked at the individual responses and, for example, the subject answered “Not at all” to “I felt down and unhappy” and “A lot” to “I felt happy”.

I checked to see that the measure was scored properly. Yes, there answers were consistent, with “Not at all” to all of the depressed items and “A lot” to all of the reverse coded items. This was just a happy kid.

So, that wasn’t it.

Second, I checked to see if there was a problem with the subject. Occasionally, we will get a perfect score on the pre or post-tests for our math games and upon closer examination, it turns out that prodigy is actually a teacher who wanted to see what our test was like for him/herself. Either that, or it was a really dumb kid whose failed fifth-grade 37 times.

That wasn’t it, either. This student was in the same target age group from one of the same two American Indian reservations as the rest of the students.

After ruling out both non-sampling error and sampling error, I then went and did what most people recommended. I analyzed the data both ways. Now, in my case, the one student did not change the results, so when I reported the results to staff from the cooperating reservations, I mentioned that there was one outlier but 2/3 of the youth tested were above the screening cut off for symptoms of depression and the cut-off score is 15 while the mean for the young people assessed on their reservation was 21.  I should note that this was not a random sample but rather a sample of young people who had a family member addicted to alcohol or drugs, mostly methamphetamine.

Since in this case the results did not change substantively, I just reported the results including the outlier.

If there HAD been a major difference, I would have reported both results, starting with the results without the outlier and state that this was without one subject included and that with that outlier, the results were X.

I think the results without the outlier are more reliable because if you finding significance (or not) depends on that one person it’s not a very robust finding.

Here is my general philosophy of statistics and it has served me well in terms of preventing retracted results and looking like an idiot.

Look for convergence.

What I mean by that is to analyze your data multiple ways, and, if possible, over multiple years with multiple samples.USDA logoThat’s one reason I’m really grateful we’ve received USDA Small Business Innovation Research funding over multiple years. Where university tenure committees are fond of seeing people crank out articles, the truth is, at least with education, psychology and most fields dealing with actual humans, it often takes quite some time for an intervention to see a response. Not only that, but there is a lot of variation in the human population. So, you are going to have a lot more confidence in your results if you have been able to replicate those with different samples, in different places, at different times.

If your significant finding only occurs with a specific group of 19 people tested on January 2, 2018 in De Soto, Missouri, and only when you don’t include the responses from Betty Ann McAfferty, then it’s probably not that significant, now is it?


What I do when I’m not blogging — make educational video games.  

girl in jungle

Please check our latest series in the app store for your iPad, Aztech Games, which teaches Latin American history and (what else) statistics. The first game in the series is free.

30
Dec

What would you do if one person changed your results?

This is a hypothetical question, but it could easily happen. Let me give you a real example.

Using a mobile phone game, we administered a standard depression screening measure (CESD-C) to 18 children living on or near an American Indian reservation. All children had a family member who was an alcoholic or addicted to drugs.  I decide to do a one-sample t-test of the hypothesis that the mean for this population = 15, which is the cutoff value for symptoms of depression .  Here is the code but I didn’t code it (more about that later).

PROC TTEST DATA=cesd_score SIDES=2 H0=15 plots(showh0);

var CESDTotal;

The results are shown below, with  a mean of 21 and a range from 3 to 38.

ttest results

You can see that the t-value of 2.34 is significant at p < .05, that is the mean for this sample is significantly different than the cutoff score of 15. You can see more results here.  What if it hadn’t been, though? What if, instead of .0317 the probability was .0517?

What if dropping out this one person with a score of 3 changed the result? In fact, it did change the mean to 22, and the p-value to .0115 . You can see all of those results here.

So, let’s say that hypothetically dropping out this outlier WOULD change your results. Would you do it? Would you report it?

Think about it. In a couple of days, I will give you my answer and my justification.

As to not having coded it – I used the tasks in SAS Studio which I found to be pretty fun, but more on that in my next post.


Play Aztech: Meet the Maya – for your iPad in the app store, in Spanish and English.  The second in our series of bilingual games teaching basic statistics and Latin American history. Only $1.99 

girl in jungle

P.S. There is a third possibility here, which is changing the test from a two-tailed test to one-tailed test. Surely, an argument can be made that we don’t expect children with a family member who is addicted to alcohol or drugs to be less depressed than the cut-off score? They would either be equal or more depressed. Personally, I don’t buy that argument. I could accept that the sample might be more depressed than the average but I’m not sure one could justify that the mean necessarily MUST be more than the cut-off for depressive symptoms. 

 

 

 

27
Dec

DO statistics and you can go almost anywhere

Let me say right off the bat that the number of contracts I’ve had where people wanted me to tell them what to do I can count on one hand – and I’ve been in business 30 years. Generally, whether it is an executive in an organization where I’m an employee or a client for my consulting services, people don’t want me to tell them what to do,

Hey, you should do a repeated measures ANOVA.

Nope, they want me to DO it. It’s funny how often I find myself doing the same procedures for vastly different organizations, everywhere from the middle of Missouri to downtown Los Angeles to American Indian reservations in North Dakota to (soon) Santiago, Chile.

view over the top of my ipad

There are also those procedures I only use once in a great while, but that’s the topic of another post. Here are a couple of my go-to procedures.

Fisher’s Exact Test

Earlier this year I wrote about the Fisher’s Exact Test and how I had used this teeny bit of code

PROC FREQ DATA = install ;
TABLES rural*install / CHISQ ;

is an example of how you do it in SAS for everything from testing whether urban school districts have significantly more bureaucratic barriers to using educational technology than rural districts (they do) to whether mortality rates are lower in a specialized unit in a hospital than for patients with the same diagnosis in a standard unit.

Confidence Limits for the Mean

Working with small samples in rural communities, I often don’t have the luxury of a control group. I know this makes me sound like a terrible researcher and that I never read a quantitative methods or experimental design textbook. However, let me give you an example of the types of conversations I have all of the time.

Me:  I’d like to use your program as a control group. I’ll come in and test all of your students and then two months later, I’ll test them all again.

Principal/ Superintendent/ Program Director:  You mean you want me to take up two periods of class / counseling time for your tests?

Me: Yes.

Them: You wouldn’t actually be giving our students any services or educational program, you’d just be taking two hours from all of our students.

Me: Yes, and then I’ll compare their results to those of the students who do get services.

Them: What do our students get out of it?

You can see where this conversation is going. One solution might be to pay all of the students some amount to stay after school or come in for an extra counseling period or whatever is being compared, so they aren’t missing out on services to take the test. However, Institutional Review Boards are cautious about having substantial incentives because then they feel very low income might be coerced into participating – for some of the people on our research, $10 is a lot of money.

The result is that I don’t always have a control group, but all is not lost. Being smarter than I look (yes, really),  I often use standardized measures for which there is a lot of research documenting the mean and I can do a one-sample test.

proc means data=cesd_score alpha=.05 clm mean std ;
var cesdtotal ;

This will give me the 95% confidence interval for the mean and I can see if my sample is significantly different from the mean .  For example, with a sample of 18 children from an American Indian reservation, the mean score on the CESD – C, a measure of depression, the mean score was 21. The cutoff for considering the respondent as showing depressive symptoms is 15. With a confidence interval from 15.6 to 26.4  I can say that there is a greater than 95% probability that the population mean fits the cutoff for depressive symptoms. Notice that the lower confidence limit still is above the screening cutoff point of 15.

There is an interesting question related to this specific study, but it will have to wait for tomorrow since I have to head to the airport in a few hours. This week, I’m heading to Missouri. If you want to meet up and talk statistics, video games or just drink beer, let me know.


Play Aztech: The Story Begins – free for your iPad in the app store, in Spanish and English.  The first in our series of bilingual games teaching math and history.

girl in jungle

 

5
Dec

Teaching statistics tip: Know your students

Almost always when I get asked to teach anything my answer is:

No. 

I don’t even think about it . Just, no. I’m too busy.  Usually, I’ll teach one graduate class a year and that’s it. However, recently I had the opportunity to teach an introduction to statistics course and design the whole course from the ground up, which sounded like my idea of fun. The college is predominantly an arts school, with students majoring in screenwriting, dance, drama and a smattering of entertainment business majors.

Normally, when I teach graduate statistics courses I use SAS, I require students to learn at least a minimal amount of programming and be able to do things like partition the sums of squares.

Julia in Trinidad

The Spoiled One NOT computing the area under the curve

It just so happens that The Spoiled One, who is a Creative Writing major (what does she want to be when she graduates? Unemployed, apparently) took statistics last year, which resulted in many 11 pm (2 am Eastern time where she attends school) phone calls to me on things like how to compute the area under the curve between two z-scores.

Despite my best efforts, I believe she left the class with zero conviction that she would ever use statistics, and I really don’t blame her. There is not a lot of call in one’s daily life for looking up values in a z table, it being the 21st century and all and us having computers.

Here is my honest appraisal of my soon-to-be students – nearly 100% of them will be able to use skills such as creating graphs with Excel, computing averages, understanding the difference between the median and the mean and when which measure is appropriate. I can tell them truly how they could use this information in deciding which contract to accept, in which film to invest and whether a particular dance studio is preferable to another in terms of business viability. There is less than a 10% chance that as juniors and seniors in an arts college they are going to change their minds and decide they want to go into a research career. If they do make that choice, everything they learn in this course will apply.  What I did not do was include a lot of proofs and matrix algebra or computation.

I gave some thought to using JMP because of the graphics, and to SAS Studio, because it is available free and we could use the tasks menu, which is pretty cool, but the fact is these students are most likely familiar with Excel and the campus already has a license. It’s installed on every computer in the lab. Installing the analysis toolpak is super-easy, whether you are using Office 365 or the regular Office (I hear some people calling that the productivity suite).

So, if I am not having students use SAS or calculate the area under a curve, what am I doing?

One thing I am requiring is that every student create their own livebinder. You’re welcome to take a look at it in the livebinder I’m preparing for my own purposes for the course. Just look under the livebinder assignment tab.

I have a lot more to write about this later. Right now,  I have guests on the way so I’ll try to post more tomorrow.


Want to learn statistics in a game?  Play Aztech: The Story Begins – free for your iPad in the app store, in Spanish and English.  The first in our series of bilingual games teaching math and history.

20
Nov

The statistical knowledge you need the most – almost everywhere

deer in back yardWhat do a herd of deer and a sea lion have to do with statistics?

Friday, I was on the Spirit Lake Dakota Nation in North Dakota. Most of the time while I was there, I spent at the Spirit Lake Vocational Rehabilitation Project, an impressively effective group of people who help tribal members with disabilities get and keep jobs. A few years back, I wrote a system to track their data using PHP and MySQL. It is deliberately simple because they wanted a basic database that would give reports on the number of people served, how many had jobs, and some demographic information. A research project used SAS to analyze the data to try to identify predictors of employment.

Due to a delayed flight, I spent the night with my friend in Minot, discussing, among other things, the decline in native speakers of Cree, and not the herd of deer in her backyard, which was common place enough to pass without comment.

harbor at night

Saturday, I was back home in California, on a dinner cruise in Marina del Rey. We were discussing how to analyze the data on persistence in our games to show that the re-design, with a longer lead-in story line and a higher proportion of game play early on was effective. I suggested maybe we could use survival analysis. Really, it’s the same scenario as how many people are alive after 2, 3 or 4 months or how many people kept playing the game after the 2nd, 3rd or 4th problem.sea lion on dock

The deer, the large loud sea lion on the dock and I spent the exact same amount of time discussing the probability mass function for a Poisson distribution and proving the Central Limit Theorem.

My point is, that everywhere I go, and that is a REALLY broad range of places, people are interested in the application of statistics, but SO much of school is focused on teaching how to compute the area under the normal curve or how to prove some theorem or computing coefficients using a calculator and plugging numbers into a formula, inverting matrices. I’m not sure how helpful that was to a student and I can guarantee you that the last time I computed the sums of squares without using a computer was about 35 years ago.

Whether you are are using SAS, SPSS, Excel, R, JMP or any one of a dozen other statistical packages, it lets you focus on what’s really important. Does age actually predict whether or not someone is employed (in this case, no)? Do rural school districts have fewer bureaucratic barriers ? Is this a reliable test? Did students who played these games improve their math scores?

When I was young, and many of the current statistical packages were either very new and limited or didn’t yet exist, someone asked me if I was worried that I would be out of a job. I laughed and said no, because what computers were replacing was the computational part of statistics, and except for that tiny proportion of people who were going to be developing new statistics, the jobs were all going to be in applying formula, not proving them and sure as hell not computing them with a pencil and a piece of paper. A computer allows you to focus on what’s important.

What IS important? That’s a good question and another post.


Having trouble teaching basic statistics to students? Start with Aztech: The Story Begins  — free from the app store (and it’s bilingual)

girl in jungle

5
Nov

Forget your dream, kid. Follow statistics

basketball

I saw this poster in a high school, supposedly said by a basketball coach:

People say, “Follow your dreams. ” I say, “Forget your dreams, kid, follow math.”

He goes on to give the percentage of high school athletes who compete in college – 3.4% for men’s basketball, by the way, 1% of high school athletes make it in Division I. Even if you make it to the college level, your odds of becoming a professional athlete are dismal – 1.1% of college basketball players make it to the major professional teams, yes, that is 1% of 1%, so you have a .01% chance of making it into the Lakers even if you are playing in high school.

If you are that 1 in 10,000 who makes it on the roster, your median salary will be $3.7 million and you will play for around 4.8 years, giving you a career salary of around $18.5 million.

Let’s say you are a statistician with a Ph.D.  With 5-9 years of experience, your median salary is around $130,000. In my experience, it is going to be considerably less your first year but go up fairly rapidly. Let’s say you have the sense to get some scholarship and grant funds to pay for your tuition – my total student loan debt was $900 – and that you graduate in your 30s – I was 31 and that was with taking a few years out to work as an engineer. There isn’t any particular reason you have to retire before 65 or 70. It’s not like your knees go out and they fire you from your statistician job.  I’m going to give a ballpark figure of $150,000 a year average over those 36 years, which is turns out to be about the median salary for a statistician who doesn’t work in academia, according to the American Statistical Association. You’re at $5.4 million. That’s not counting 36 years of health insurance, 401 K and other benefits like not having a boss who is referred to as your “owner” , which I personally find kind of creepy weird, but you also have to consider you don’t get all the $5.4 million at once, either.

So, let’s present this to you:

  • You have a 1 in 10,000 chance of making $18.5 million
  • You have a 55 out of 100 chance of making $5.4 million.

You can only buy one ticket. Which lottery ticket do you buy?

Oh, by the way, did I mention you have a 90 out of 100 chance of making over $3 million ?

The coach’s point was that you may be dreaming about a spot in the NBA but you have a much greater chance of success in life if you spend your time in the math class instead of on the court. As a good friend of mine often says, “Too many people confuse wishes with plans.”

So, you may dream of slam dunks in the NBA but you would be a lot better off planning to take Calculus, several statistics courses and study a field like business, psychology, political science or epidemiology where you can apply those statistics.

You might think I don’t have any heart, that I have no idea what it means to dream of being a successful athlete. Actually, you’d be wrong. I ran track in college. I won the world championships in judo. Then, the next year, I went into a Ph.D. program and specialized in statistics because, well, I’m good enough at math to see what had the better probability of paying off in the future.

There are SO many ways to learn and use statistics. That’s another post, though. I’d best toddle off to bed since I need to catch a plane tomorrow after I go do a charity walk in the morning.

Early morning and snow, two things I hate the most. Well, life can’t be perfect all the time. I think I can prove that statistically.

Get started learning statistics with the Aztech Games series for iPad. The first game is available now and it’s free!

pyramid

 

14
Sep

It only seems like this has nothing to do with statistics

Last post, I talked about bricolage, the fine art of throwing random stuff together to make something useful. This is something of a philosophy of life for me.

Seems rambling but it’s not …

Over 30 years ago, I was the first American to win the world judo championships. A few years ago, I co-authored a book on judo, called Winning on the Ground. 

Winning on the ground cover

When it came to judo, although I was better than the average person, I was not the best at the fancy throws – not by a long shot. I didn’t invent any new judo techniques.  I wanted to call our book The Lego Theory of Judo but my co-author said, “That’s stupid” and the editor, more tactfully, said, “Nobody will know what you are talking about unless they read the book and you want a title that will get them to buy the book”. So, I lost that argument.

What I was really good at was putting techniques together. I could go from a throw to a pin to an armbar and voila – world champion! Well, it took a long time and a lot of work, too.

How does this apply to statistics?

Let’s start with Fisher’s exact test. Last year, I wrote about using this test to compare the bureaucratic barriers to new educational technology in rural versus urban school districts. Just in case you have not memorized my blog posts, Fisher’s exact test can be used when you have a 2 x 2 matrix that fails to meet the chi-square minimum of five observations per cell. In that instance, with only 17 districts, chi-square would not be appropriate. If you have a 2 x 2 table, SAS automatically computes the Fisher exact test, as well as several others. Here is the code:

PROC FREQ DATA = install ;
TABLES rural*install / CHISQ ;

Ten years ago, I was using this exact test in a very different context, as a statistical consultant working with a group of physicians who wanted to compare the mortality rates between  a department that had staff with a specific training program and a similar department where physicians were equally qualified except for participation in the specialized program. Fortunately for the patients but unfortunately for statistical analysis purposes, people didn’t die that often in either department. Exact same problem. Exact same code except for changing the variable names and data set name.

In 35 years, I have gone from using SAS to predict which missiles will fail at launch to which families will place their child with a disability in a residential facility to which patient in a hospital will die to which person in vocational program will get employed to which student will quit playing an educational game. ALL of these applications have used statistics and in some cases, like the examples above, the identical statistics applied in very diverse fields.

Where do the Legos come in?

In pretty much every field, you need four building blocks; statistics, foundational programming concepts, an understanding of data management and subject specific knowledge. SAS can help you with three of these and if you acquire the fourth, you can build just about anything.

More on those building blocks next post.

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

 

Back to Top