sasCommunity.org Planet

May 24, 2013

In my article "Simulation in SAS: The slow way or the BY way," I showed how to use BY-group processing rather than a macro loop in order to efficiently analyze simulated data with SAS. In the example, I analyzed the simulated data by using PROC MEANS, and I use the NOPRINT option to suppress the ODS output that the procedure would normally produce.

About 50 SAS/STAT procedures support the NOPRINT option in the PROC statement. When you specify the NOPRINT option, ODS is temporarily disabled while the procedure runs. This prevents SAS from displaying tables and graphs that would otherwise be produced for each BY group. For a simulation that computes statistics for thousands of BY groups, suppressing the display of tables results in a substantial savings of time.

Newer SAS procedures do not always support a NOPRINT statement. However, you can still suppress the ODS output. The following macros encapsulate statements that turn the ODS system off and on. I call the %ODSOff macro before I start the BY-group analysis; I call the %ODSOn macro after the analysis completes.

%macro ODSOff(); /* Call prior to BY-group processing */
ods graphics off;
ods exclude all;
ods noresults;
%mend;
 
%macro ODSOn(); /* Call after BY-group processing */
ods graphics on;
ods exclude none;
ods results;
%mend;

For example, if I were using PROC ROBUSTREG to analyze many samples of simulated data, I might use the following pseudo-code:

%ODSOff
proc robustreg data=MySimData;
   BY SampleID;
   model y = x;
   ods output ParameterEstimates = OutputStats;  /* <== insert name of ODS table */
run;
%ODSOn

Even though ODS is suppressed to the display destinations (such as LISTING and HTML), you can capture the statistics that result from each analysis by using an ODS OUTPUT statement, which saves an ODS table to a SAS data set. Other ways to save statistics include using an OUTPUT statement, an OUT= or OUTEST= data set, and so forth.

Be aware that some SAS procedures (such as PROC MIXED) write a NOTE to the SAS log as part of their normal operation. The NOTE might say something like "NOTE: Convergence criteria met." For these procedures, you will also want to turn off notes, lest they fill the SAS log:

%ODSOff
options nonotes;  /* use NONOTES to suppress notes to the log */
proc mixed ...;
model y = ...;
run;
options notes;   /* turn NOTES back on */
%ODSOn

The material in this blog post is taken from my book Simulating Data with SAS, which contains many more tips and techniques for the efficient simulation of data.

tags: Sampling and Simulation, Tips and Techniques

It must be that time of year because I was asked to speak at two different schools in downtown Los Angeles this week, one elementary school and one middle school.  The Perfect Jennifer probably won the coolest teacher award for getting her younger sister, a world champion in mixed martial arts and subject of a made for TV movie this summer to come talk for career day.

jenn_ronda2013-05-22 10.06.12

 

However, after the mobs of autograph seekers had departed, there were still plenty of questions for the old mom, just as there were at the elementary school in MacArthur Park (yes the same of disco song and gang fame).

Here are some of my favorite questions and the answers that I gave.

Q. Were you always a math genius?

I was not a particularly good student. I got in trouble a lot for fighting and I wasn’t all THAT interested in school. I think I started being interested in math when I was in the sixth grade just because the math teacher (Sister Marion) was really nice and some of my other teachers were really mean. I mean, really mean, like throwing stuff at me. It’s true, I was an annoying child, but still. Since I liked her, I liked her class, so I studied harder for it and did better.

Q. Is your mother proud of you?

Yes, I believe she is. I’ve gotten a lot of education, started a company that does good work, been a teacher and been able to take care of my children well, so I would say, yes, she is proud of me.

Q. What do you dislike about your job?

I really had to think about this one and for a long time I could not think of anything. Then, The Perfect Jennifer reminded me that sometimes I have to go to North Dakota in the winter. That is the one thing I don’t like about my job, when I have to go somewhere it is really cold because I hate cold weather.

Q. What was your Plan B?

I had to think about that, too, for a while. I finally said that I really like being a statistician and the work that I do and if it doesn’t work out, if the grant that I’m working on now doesn’t get funded, if my game I’m working on now doesn’t sell then I think I will just try again. It’s like my daughter Ronda (who spoke earlier in the morning) said. Someone asked her in an interview once,

“You’ve won every match so far in your career with the arm bar in the first round. What are you going to do if you try the arm bar on someone one day and it doesn’t work?”

She replied,

“Well, I guess in that case, I’d probably try again.”

(In fact, if you saw her last match, that is exactly what she did.) So, I said, I think my Plan B would be to try again to succeed as a statistician.

Q. What do you like about your job?

Everything. I like traveling. I like working with really smart, nice people which is all I work with any more, because if they are jerks, I just turn down the contract and don’t work with them. I like the fact that every project is something new, sometimes it’s seeing if a program works, some days it’s trying  to catch fraud, other days it is teaching a class. I like the fact that I don’t have to get up before 10 o’clock in the morning.

Finally I told them,

If you don’t remember anything else I said or that anyone else said today, remember this, because it took me a long time to figure it out. Don’t EVER believe that other people are smarter than you, that they have some special kind of math brain that they can get it and you can’t, that everyone knows more than you. If they do know more than you it is just because they worked at it longer and harder and if you work long enough and hard enough you will get to the same place. Don’t believe you need  to  be a certain race or age or look a certain way to start a technology company and be successful. It just is not true. I used to think that way, that people who are really good at math were not people like me, certainly none of the math professors I had in college or people I saw on television talking about starting companies looked like me. None of that matters. Now I write the sort of things that I could not imagine even understanding when I was young and I toss it off like it’s nothing and it IS nothing because I’ve been doing it for twenty years. Math, martial arts, programming – anything – you just bang away at and you get it eventually. Why do you think they call it hacking?

May 23, 2013

 
 
UPDATE: Rick Wicklin kindly shared his visualization efforts on the output to put a more straightforward sense on the results. Thanks. Here is the code, run after my code below. Note that this is designed for K=2.
 

proc iml;
use out;      read all var {ID x y}; close out;
use neighbor; read all var {Nbor}; close neighbor;
NBor = num(Nbor);         /* convert to numeric */
xEnd = x[Nbor];
yEnd = y[Nbor];
create Viz var {ID x y xEnd yEnd}; append; close Viz;
quit;
proc sgplot data=Viz noautolegend;
        scatter y=y x=x / datalabel=id;
        vector x=xEnd y=yEnd / xorigin=x yOrigin=y;
run;

***************************************************************************



More often than not, the analyst wants to find the pair of observations that are the closest in terms of certain metrics, such as Euclidean distance. For example, on the popular SAS-L archive group, Randall Powers posted such a question at here, but he was using PROC DISTANCE.

His idea is straightforward: First calculate the pairwise distance among all observations, and Second, find the one that has the shortest distance. For the first step, he was trying to use PROC DISTANCE, then then he planned to use DATA STEP to search through. This is not workable in even the modest data size, say 100K observations, because PROC DISTANCE will generate a dense double precision 100K-by-100K matrix and will blow up your workstation. That is the reason Randall turns to SAS-L for help.

Well, SAS actually has an old PROC that can readily solve his problem. Of course, before a solution can be crafted, there are several issues that have to be clarified first.

First, what if some observations share the same closest observations, do you allow shared case or you need a mutually exclusive solution that each observation only has its unique non-shared counterpart as the closest point, then how would you determine which one should take the shared observation? First come first serve or any other rule?

Second, when we talk about pairs, we know N should be even number, but what if the data we have has an odd number for N? How to deal with the left over?

That being said, let's consider the simplest and most tolerate case, that is each observation is allowed to serve as the closest point to multiple other observations and N is an even number.

The solution uses PROC MODECLUS. This is an old procedure for density based clustering analysis, but its build-in nearest neighbor search algorithm and output capability make it the perfect candidate for this type of job. We use the example data in the "Getting Started" section.

The ODS OUTPUT statement directly outputs the list of nearest neighbors (closest points) for each observation and your have to specify either ALL or NEIGHBOR option in the PROC MODECLUS statement in order to use this functionality. In the same statement, we also specify K=2. K=2 means 2 nearest neighbor but since this applies for all observations, for each one, the nearest neighbor is K-1=1. So if you specify K=3, you actually ask this procedure find 2 closest point for each 1 observation at hand.

The good thing about the nearest neighbor list data set is that it also contains the calculated distance, therefore, in cases you need to deal with more complex situation as listed above, such as non-shared nearest neighbor, etc, you need some room for post processing. Here you can specify K=3 and PROC MODECLUS will output 2 neighbors for each observation, like below.


You can see that observation 3 can be matched to either observation 1 or 2, but with 2, it yields the shortest distance, and if this is the rule you are going to apply, your post process can work on this data set to implement the rule.

Please feel free to verify the result. If you find error, let me know or post in the comment.

 
data example;
      input x y @@;
   ID=_n_;
      datalines;
   
   18 18  20 22  21 20  12 23  17 12  23 25  25 20  16 27
   20 13  28 22  80 20  75 19  77 23  81 26  55 21  64 24
   72 26  70 35  75 30  78 42  18 52  27 57  41 61  48 64
   59 72  69 72  80 80  31 53  51 69  72 81
   ;
run;

ods select none;
ods output neighbor=neighbor; /* ODS output dataset */
proc modeclus data=example method=0  k=2    /*Find a pair of nearest neighbor*/
                 out=out  
                 all  /* required option in order to output Nearest Neighbors*/;
     var x y;
  id  id;
run;
ods select all;


May 22, 2013

I recently added a new widget to the right margin of the NOTE: blog - "Recent Topics". It's a form of word cloud, but it's far more dynamic and interactive than a traditional word cloud. Hover over a word (or click on it) to see a list of NOTE: articles featuring the specified word; click Drill-Down to get a sub-cloud of associated words.

If you don't subscribe to NOTE: (through RSS or email) then it can be especially difficult to make the best use of the blog's content. The new widget shows words from the most recent 25 NOTE: posts and I think it will be of benefit for catching-up on recent content that is of interest to you.


The word cloud is supplied by Infomous. Hover your mouse near the bottom of the diagram for a menu of options. From the Infomous FAQ:

  • The size of each word reflects the frequency with which it appears in the source

  • If you click on a word, a drop down list appears with links to articles that are related to the specific word. The drop-down will also appear if your mouse lingers over a word. By clicking on a link in the list, you will navigate to that specific article

  • Topics become linked when they are mentioned in the same context or discussed together multiple times. Related terms and concepts are linked together with lines so you can grasp the context of any relevant topic

  • The words in the Infomous cloud are organized in groups of related words. This provides you with a quick glimpse of which topics belong together in conceptual clusters
Plus, it looks cool, and it's fun. Try it, and drop me a comment!

I probably shouldn’t admit this…  When I first started learning the SAS programming language 15 years ago, I just couldn’t figure out how to easily create pretty, customized graphs and reports.  So, I would prepare my data in SAS, then export to some other application to produce my results.  I [...]

I recently read an interesting article in The Economist, where they describe "The Big Mac index."  This is an index they invented as a lighthearted guide to compare currencies in different countries. In their article they create a multi-panel display (similar to a dashboard) where they compare the index for several countries using [...]

A good chunk of the SAS year revolves around SAS Global Forum. Pre-conference, everyone is busy polishing presentations and planning meetings. Post-conference is the best—attendees come back to Cary with heads full of customer ideas to implement and notebooks full of contacts to follow up on. One user's request found its way to my Inbox last week when a coworker asked me to review a list of SAS administration resources. 

My coworker's note reminded me that new SAS adminstrators join the ranks every month so it seems like a good time to compile a list of online resources.  If I've overlooked your favorite, please share with everyone by commenting on this post. 

For connecting with other SAS administrators, there are several good options:

If you're responsible for installing and maintaining a SAS installation, you will find most of the information necessary to accomplish your tasks on these support.sas.com sites:

 Prefer videos?  SAS Talks On Demand offers several videos for the SAS administrator. 

 And, of course, don't forget training and documentation:

 Please let us know how SAS can continue to support this important role.

tags: SAS Administrators

Ever had the need to write SQL create statements for existing tables but felt too lazy to write it by hand?  Ever wanted to reverse engineer tables into SQL code? Have no fear, PROC SQL is here.  Use the DESCRIBE statement to get the full blown SQL code to create the table.  This is particularly good for generating empty table structures to insert data with ETL code.  In the data warehousing world, having SQL code to create empty tables is referred to as the Data Definition Language (DDL).

Example: Describe Statement Using PROC SQL

Running the following describe statement produces the SQL create statement to define the table and columns.

proc sql;
  describe table SASHELP.CLASS;
quit;

The SAS log provides the create statement:

SQL Describe Code

SAS Support provides additional information on the DESCRIBE statement.

The post Describe Your Table in SAS to Write the SQL Code appeared first on Business Intelligence Notes for SAS® BI Users. Written by .

Last week I discussed a program that had three nested loops that used scalar operations in the innermost loop. I mentioned that this program was not vectorized, and would therefore be slow in a matrix language such as SAS/IML, MATLAB, or R. I then went through a series of steps in which I rewrote the program to be more efficient by using basic linear algebra operations such as dot products (level-1 BLAS), matrix-vector multiplication (level-2 BLAS), and matrix-matrix multiplication (level-3 BLAS). At each step, the number of loops decreases, and the efficiency increases.

Someone asked me what kind of speed-ups can be expected by vectorizing a SAS/IML program. In general, the answer depends on the size of your data and the particular operations that you are performing, but it is straightforward to run a series of examples that compare the performance at each step of the vectorization of the previous example.

In this post, I generate a square matrix of size N for N=100, 200, 400, 600, 800, and 1000. For each size, I time how long it takes to compute an operation that is equivalent to X`*X. The operations are as follows:

  • Level-0: The original program, which consists of three nested loops and scalar operations.
  • Level-1: A program that contains two nested loops and dot products of vectors.
  • Level-2: A program that contains one loops and matrix-vector multiplication.
  • Level-3: A program that contains one matrix-matrix multiplication.

You can download the program that performs the experiment.

On my desktop computer, the results are summarized by the following plots. The first plot shows all of the times. However, the graph is dominated by the slow times that are associated with the original program that had three nested loops and only scalar operations. The middle plot omits the times for the original program. In this view you can see that the level-1 operation is about 25 times faster than the level-0 operations for N=1000. Again, however, the graph is dominated by the slowest method, and so the last plot shows only the timings for the level-2 and level-3 operations. The level-2 operations are about 16 times faster than the level-3 operations for N=1000. The level-3 operations are almost three times faster than level-2.

In summary, the slowest method (scalar operations in three nested loops) is about 1,000 times slower than the equivalent level-3 operation, which does not require any loops. This is good motivation: the time you invest to fully vectorize your code can pay dividends by running 1,000 times faster.

If you are adept at interpreting logarithmic scales, you can also see the results plotted on a Log10 axis.

No matter how you visualize the results, the conclusion is the same: if you want your program to scale to handle large data, vectorize the program to take advantage of the SAS/IML matrix operations.

tags: vectorization

Last week I wrote a bit about how to get an exploratory factor analysis using Mplus. The question now, is what does that output MEAN ?

First, you just get some information on the programming statements or defaults that produced your output:

INPUT READING TERMINATED NORMALLY

Exploratory Factor Analysis ;

SUMMARY OF ANALYSIS
Number of groups                                                 1
Number of observations                                         730

Number of dependent variables                                    6
Number of independent variables                                  0
Number of continuous latent variables                          0

Observed dependent variables

Continuous
Q1F1        Q2F1        Q3F1        Q1F2        Q2F2        Q3F2

Estimator                                                       ML
Rotation                                                    GEOMIN
Row standardization                                    CORRELATION
Type of rotation                                           OBLIQUE

This tells us we our analyzing all of the data as one group, and not, for example, separate analyses for males and females. We have 730 records, six variables, all of which are continuous and listed above. The maximum likelihood method (ML) of estimation is used and the default rotation, GEOMIN, which is an oblique method, that is it allows the factors to be correlated.

Here we have a list of our eigenvalues

RESULTS FOR EXPLORATORY FACTOR ANALYSIS

EIGENVALUES FOR SAMPLE CORRELATION MATRIX
1           ………  2         ………    3             4             5
________      ________      _____     ________      ________
1.866         1.262         0.866         0.750         0.716

EIGENVALUES FOR SAMPLE CORRELATION MATRIX
6
________
0.539

In this case, you could go ahead with the eigenvalue greater than one rule, but let’s take a look at a couple of other statistics. First, we have the results from the one factor solution.  Here we have the chi-square testing the goodness of fit of the model

Chi-Square Test of Model Fit

Value                             96.228
Degrees of Freedom                     9
P-Value                           0.0000

We want this test to be non-significant because our null hypothesis is there is no difference between the observed data and our hypothesized one-factor model. This null is soundly rejected.

Let’s take a look at the Chi-square for our two-factor solution
Chi-Square Test of Model Fit

Value                              3.016
Degrees of Freedom                  4
P-Value                           0.5552

You can clearly see that the chi-square is much smaller and non-significant.

Let’s take a look at two other tests. The Root Mean Square Error of Approximation (RMSEA) for the one-factor solution is .115, as shown below. We would like to see an RMSEA less than .05 which is clearly not the case here.

RMSEA (Root Mean Square Error Of Approximation)

Estimate                           0.115
90 Percent C.I.                    0.095  0.137
Probability RMSEA <= .05           0.000

For the two factor solution, our RMSEA rounds to zero, as shown below

RMSEA (Root Mean Square Error Of Approximation)

Estimate                           0.000
90 Percent C.I.                    0.000  0.049
Probability RMSEA <= .05           0.954

Clearly, we are liking the two-factor solution here, yes? The eigenvalue > 1 rule (which should not be TOO emphasized) points there, as does the model fit chi-square and the RMSEA.

In their course on factor analysis, Muthen & Muthen give this very nice example of a table comparing different factor solutions using the data

Mplus_EFAmodel_selection

They also like the scree plot, which I do, too. I also agree with them that one should never blindly follow some rule but rather have some theory or expectation about how the factors should fall out. I also agree with them in looking at multiple indicators, for example, scree plot, chi-square, RMSEA and eigen-values.

May 21, 2013

I saw a nice post by Rob Allison last month on creating infographics with SAS. Whilst we mostly endeavour to create hi-fidelity graphics in SAS that show a relatively high volume of detailed graphical information, there are a wide variety of uses for graphical presentation. Infographics should not be overlooked.

As Rob says in his post, there's no firm definition of the term "infographic", but I think Rob's description sums it up nicely: something half way between data visualisation & artwork. SAS graphics are typically created straight from the data - rightly so - but infographics then apply some analysis and some presentational elements in order to enrich the result.

In his post, and links to his site, Rob describes how he created the half dozen samples that Rob includes in the post.

Whilst there's no specific mention of infographics, there is rich store of information about creating SAS graphics in this year's SAS Global Forum proceedings. See the Reporting and Information Visualisation stream, and the Posters stream.

To experiment with infographics and try ideas and styles, there are some useful online resources such as Infogr.am which allow you to create infographics with a set of tools intended specifically for creating infographics.

It's important to produce accurate graphics, but making them attractive and approachable will mean more people get to see the fruits of your labours. And if you're in the right position to apply some interpretation to the material then so much the better. And it can be fun letting your artistic side have a little space to express itself!

May 20, 2013

We are a month out from Analytics 2013 in London! I am already getting excited about the trip and am starting my list of what to pack (and where to go and what foods to eat and what pubs to visit --- the list goes on!). As host of the [...]

Last week I alluded to some very useful applications of the Copy Files task. This is one of them.

Using the SAS programming language, you can manipulate data and create files of just about any size, shape, and format: Excel, PDF, CSV, RTF, and more. A challenge for SAS Enterprise Guide users has been: how to capture those files and bring them back to your local PC, when the SAS Workspace is running on a remote machine?

Example: Export to a CSV file and download the result

Here's a typical scenario: You have a simple SAS program that produces one or more CSV files that you will ultimately use in another program. How can you get the CSV files to your PC automatically?

STEP 1: Build a program step to create the CSV file
This program is easy to adapt for any data set and environment. It works on Windows and UNIX. All you need to know is the library and member name of the data that you want to export, and then the destination folder for your local PC. The program will perform the export operation, stage the CSV file in a temp location, and define the macro variables that the next step will use.

/* Data to export */
%let lib  =         sashelp;
%let datafile =     class;
 
/* Local folder to download to */ 
%let download_to =  c:\projects\data\results;
 
/* detect proper delim for UNIX vs. Windows */
%let delim=%sysfunc(ifc(%eval(&sysscp. = WIN),\,/));
 
%let download_from =
  %sysfunc(getoption(work))&delim.&datafile..csv;
 
filename src "&download_from.";
 
proc export data=&lib..&datafile.
  dbms=csv 
  file=src
  replace;
run;
 
filename src clear;

STEP 2: Use Copy Files task to download the result
The Copy Files task accepts SAS macro expressions. That's a key feature, as the macro variables we need are defined in the previous program step. Here's a screen shot of the task settings:

This makes the use of the Copy Files task very "generic". In fact, you can create a Task Template that defines these exact task settings, and thus always have it available on your Tasks menu directly.

STEP 3: Link these steps together in a process flow
Create a user-defined link between the program and the task, ensuring that they will run in the correct sequence.

THAT'S IT!
The power of SAS and the flexibility of the Copy Files task really makes this a simple operation. However, you might want to consider a few variations:

  • Export and download a collection of files in one step. With minor mods to the SAS program, you can loop through a collection of SAS data sets and export multiple CSV files. Instead of defining a single file to download, set the &DOWNLOAD_FROM variable to a file spec with a wildcard. The Copy Files task can handle wildcard notation -- no problem. (Well, no problem anymore, as long as you grab this update.)
/* specify a wildcard */
%let download_from =
  %sysfunc(getoption(work))&delim.%str(*).csv;
 
/* file to create in step */
filename src "%sysfunc(getoption(work))&delim.&datafile..csv";
  • Add a date stamp to your results file. You might have a requirement to keep older versions of your results. With a simple adjustment to the macro expression, you can append a date stamp to the files you create. This will ensure that even if you download the results to the same location each day, the previous results will not be replaced. When you download the file, the name with the date stamp will be intact.
    filename src 
     "%sysfunc(getoption(work))&delim.&datafile._%trim(%sysfunc(today(),date9.)).csv";

    Sample result from this step:

    NOTE: The file SRC is:
          Filename=/sas/work/class_19MAY2013.csv,
    

  • This is just one example of the useful things you can do with the Copy Files task. SAS users are a creative bunch. What other uses can you think of for this task?

    Related articles

    Copying files in SAS Enterprise Guide
    Fixes for the Copy Files task in SAS Enterprise Guide

    tags: FTP, SAS custom tasks

    Every time you see a rainbow, do you look to see where it begins and where it ends? Legend has it that there is a pot of gold at the end of each rainbow with leprechauns guarding it. While this might be popular Irish folklore, and you may not find gold at the end of the rainbow, you will find there is a treasure trove at the end of every SAS Global Forum.

    SAS Global Forum is one of the largest user-based conferences, and it has amassed a lot of treasure over the past three decades. Each year, the locale and the conference theme change to meet the current needs of SAS users. While the 80’s "large volumes" may have been associated more with the "big hair" trend, the 2013 conference was fashionably bigger and interested in large volumes of another kind--Big Data and Big Analytics!

    To reflect the title of Dr. Dave Dickey’s paper, SAS conferences are all about finding the gold in your data. And like the information in Dr. Dickey’s paper, there are numerous conference treasures in the form of Paper Presentations, trend-setting interviews with industry leaders on Live Reports, geek wisdom on Tech Talks, on-site interviewers with SAS attendees and many other cool videos captured on demand for later viewing.

    SAS Global Forum Take-Out is another treasure that offers a selection of some of the best audio presentations from the conference. For those of you primarily interested in reading technical papers and posters, we have it all! Proceedings for 38 conferences from 1976 to 2013 have been archived for your reading pleasure.

    The meetups and networking events at the conference inspire each one of us differently. If you are inspired and interested in learning more about your local users groups and participating at similar engagements, take a look at the Happenings page. It has all the current information on customer events, SAS Talks, users groups, newsletters, blogs and much more. It’s a goldmine. The more you dig, the more gold you find!

    We hope you find these SAS resources helpful and find creative ways to reap the benefits of this "pot of gold" by reading, commenting and sharing with your network.

     Image provided bySin Amigos//attribution by creative commons

    tags: SAS Global Forum

    For programmers who are learning the SAS/IML language, it is sometimes confusing that there are two kinds of multiplication operators, whereas in the SAS DATA step there is only scalar multiplication. This article describes the multiplication operators in the SAS/IML language and how to use them to perform common tasks such as the elementwise product, the dot product, and the outer product of vectors.

    Elementwise multiplication (#)

    The elementwise multiplication operator (#) is used to perform element-by-element scalar multiplication. This operator is not part of the DATA step syntax. If you have two matrices of the same dimension, then u#v is the matrix whose ith element is the product of the ith elements of u and v. (This product is also known as the Hadamard product.) This is shown in the following PROC IML example:

    proc iml;
    u = { 1, 2, 3};         /* 3x1 column vector */
    v = {-1, 0, 2};         /* 3x1 column vector */
     
    elemProd = u#v;         /* elementwise product (Hadamard product) */
    print elemProd;

    The elementwise multiplication operator can also be used in some situations in which u is a vector that has the same row or column dimension as v. See my article on how the SAS/IML language "knows what you want."

    True matrix multiplication (*)

    The matrix multiplication operator (*) performs true matrix multiplication. Whereas the * operator is used for scalar multiplication in the DATA step, the operator is used for matrix multiplication in PROC IML. If u and v are any two matrices where the number of rows of u matches the number of columns of v, then the matrix product u*v is defined.

    When u and v are vectors, matrix multiplication gets a special name. When a row vector is multiplied with a column vector, the result is a scalar and the operation is called the dot product (or inner product or scalar product). The following example uses the transpose operator (`) to create a row vector:

    dotProd = u`*v;         /* dot product (scalar product, inner product) */
    print dotProd;

    There is an interesting connection between the elementwise product and the dot product of two vectors. The dot product of u and v is the same as the sum of the elements of the elementwise product: u`*v = sum(u#v).

    Matrix multiplication is not commutative, so you get a different result if you multiply a column vector with a row vector. The result is a rank-1 matrix. This is called the outer product of two vectors. An example follows:

    outerProd = u*v`;       /* outer product: column vec times row vec */
    print outerProd;

    Other matrix products

    The SAS/IML language supports other kinds of multiplication, including the direct product (or Kronecker product) and the horizontal direct product of matrices:

    dirProd = u`@v;        /* direct product */
    hdirProd = hdir(u`,v); /* horizontal direct product */

    There are many special-purpose products that are not covered in this short article, but remember that you can always define your own SAS/IML function that compute any conceivable product. For example, in physics classes students use the "cross product" (also called the skew-symmetric product) to compute quantities that arise in electromagnetism. The following SAS/IML function implements the cross product computation:

    /* cross product (3D vectors only) */
    start CrossProd(u, v);
       i231 = {2 3 1};
       i312 = {3 1 2};
       return( u[i231]#v[i312] - v[i231]#u[i312] );
    finish;
     
    uxv = CrossProd(u,v);

    Be aware that in statistics, the "cross product" often refers to the multiplication X`*X, where X is a data matrix. In this matrix product, the (i,j)th element of X`*X is the dot product of the ith and jth columns of X.

    tags: Getting Started, Matrix Computations

    May 19, 2013

    Recent versions of SAS Enterprise Guide (version 5.1 and later) use Microsoft .NET 4.0, which enforces additional security requirements before running custom task DLLs that you download from the Web, including those that you download from support.sas.com. Because these task DLLs are downloaded from the (big and scary) Internet, the Microsoft .NET runtime does not automatically "trust" them as it would trust a properly installed application. To enable the task to run, you must first "unblock" the file using Windows Explorer.

    1. Using Windows Explorer, browse to the assembly (DLL) that you downloaded from the samples and extracted from the Zip file.
    2. Right-click on the DLL file, and from the shortcut menu, select Properties.
      The Properties dialog box opens.
    3. On the General tab, click Unblock to indicate that this DLL is trusted
      (Note: the Unblock button will not appear if the assembly is already unblocked and available.)
    4. Click OK to close the Properties dialog box.

    If the DLL is blocked when you try to add it in SAS Enterprise Guide, you might see a message such as the following, and the task will not appear in the Tools->Add-Ins menu:

    Unable to load program .... 
    Could not load file or assembly 'SAS.Tasks.Examples' or 
      one of its dependencies
    

    You can read more about this security feature and behavior in this Microsoft Knowledge Base article. If you build your own custom tasks (for example, by using the example projects and source code), you will not need to unblock the DLLs as you build them.

    Related articles

    Custom tasks for SAS Enterprise Guide: Q & A
    Introduction to SAS Custom Tasks [SAS Talks webinar]
    Custom Tasks for SAS Enterprise Guide using Microsoft .NET

    tags: .net, SAS custom tasks

    May 17, 2013

    SAS Global Forum 2013 is a couple weeks in the past, but the feedback and anticipation shared by customers as they heard about SAS 9.4 are still fresh in our minds here at SAS.  As we put the final touches on the June release, the excitement we felt in San Francisco is still in the air.

    To borrow a quote from Ron Burgundy in the movie Anchorman -  "I don’t know how to put this, but SAS 9.4 is kind of a big deal."   So why is SAS 9.4 a big deal?  

    Presenters at Global Forum shared some of the SAS 9.4 story and there are many, many new features I could include here, but these are some of my favorites so far: 

    • Clustering  support for the metadata and middle-tier servers enabling scalability and higher availability.  Through server clustering, where multiple servers manage copies of the same data, the threat of data loss or whole-system downtime is drastically reduced. Clustering also allows users and tasks to be spread across the clustered environment to distribute the workload and improve performance.
    • Introduction of SAS Environment Manager, which provides SAS administrators with a deeper understanding of their SAS deployments  and increased capabilities to monitor and manage their SAS servers. Through a plug-in interface, SAS Environment Manager deploys software agents on each managed SAS server and interacts with these agents to gather health and availability information, perform resource control actions, collect server resource usage, and more.
    • For mobile  delivery of SAS reports, there are the new options in Base SAS Output Delivery System (ODS) EPUB support. With ODS EPUB you can output SAS reports as e-books that can be read with iBooks on the iPad and iPhone.  Additionally, support is provided for output to HTML5 and Microsoft PowerPoint files.
    • DS2 sessions and demos were highly attended as SAS programmers were interested in hearing about the new SAS programming language that allows code to be submitted from Base SAS sessions to run in-database to perform advanced data manipulation without moving the data out of the database.
    • In the Opening Session, SAS CEO Jim Goodnight explained, “SAS 9.4 allows your IT team to deploy SAS with confidence that it meets the requirements for security, authentication, scale and resiliency.”  Attendees were interested to learn more about middle-tier and metadata server clustering, scalability support and increased authentication support in SAS 9.4.
    • Goodnight’s statement during the Opening Session, “SAS 9.4 delivers cloud-friendly architecture,” added to considerable interest in the new SAS Cloud tools and technologies, including SAS App Central, SAS App Engine, SAS vApp technology and SAS Web Editor.  Cloud-friendly SAS enables rapid deployment of SAS in cloud environments, provides simple management tools and processes, and promotes innovation through a SAS cloud development platform.
    • The SAS Web Application Server embedded in SAS 9.4 reduces overall cost and complexity of SAS deployments. By eliminating the cost to acquire, integrate,  maintain and support third-party software, the SAS Web Application Server saves you money and time.  By reducing integration complexity and embedding software optimized for SAS, the SAS Web Application Server provides right-sized, integrated technology that simplifies IT management and makes your SAS environment cloud-ready.
    • Availability of High-Performance Analytics procedures (HP PROCs) to customers running on a single server. Customers will be able to leverage the performance benefits of multi-core computing as they take advantage of these new HP PROCs. An added benefit of the HP PROCs is that when you grow your SAS deployment from a single server to multiple servers, your high-performance code automatically scales to run in your multi-server, distributed environment.
    • A Decision Management user experience that combines data management, business rules management, data lineage, orchestration and more – all from within one unified user interface.

    Which SAS 9.4 topic is the "biggest deal" to you? Which topics would you like to hear more about prior to the June release?

    There is plenty to say about SAS 9.4.  Be sure to check back for more information!

    tags: SAS 9.4, SAS Global Forum

    Many people have commented how ironic it is that I’m writing computer games these days because I’m one of the least playful people you’ll meet.

    I have a confession to make, although confession is perhaps the wrong word because I don’t feel the least bit bad about it.

    Playing with small children bores me.

    Don’t get me wrong – I love my children and grandchildren and I would do anything for them. I taught my children to read, took them to soccer/ judo/ track/ swim practice , to piano/ bassoon / guitar/ drum lessons and ballet / tap/ hip-hop classes. I worked thousands of hours of overtime to pay for camps in Europe, in marine biology, private universities.

    And yes, I went to the park, played with my little ponies, pushed children on swings, threw them up in the air (and caught them – any problems they have are NOT because they were dropped on their heads at a young age no matter how much their behavior during adolescence might lead you to believe otherwise). I read The Perfect Jennifer her favorite book – Where the Wild Things Are – so many times that I still have it memorized years after she finished graduate school.

    AND YET …. when I hear those women rave about sitting down with their children and eating carrot sticks while they played with my little ponies together were the most fulfilling moments of their lives, I think to myself,

    What? Are you fucking kidding me?

    And apologies to the nice man at SAS Global Forum who reminded me that some people read my blog at work and asked me if I could not swear quite so much. I did post four days in a row on factor analysis and no swearing was involved, so I made a good faith effort, I really did.

    Seriously, though, that’s what fulfills you? My little ponies?

    Because as I was listening to my granddaughter talk about my little ponies what was going through my head was how I could use a statistical test for the difference in sample proportions to prove that a set of data I was asked to analyze was fraudulent. I’ll probably post about that next week. I was also intrigued by the very simple way the Muthuens had demonstrated comparison of competing factor solutions by using a table showing the chi-square, RMSEA and presence/ absence of Heywood cases.

    When my four-year-old granddaughter told me she wanted to be a princess when she grew up I told her,

    Princesses suck and I hate princesses. They’re useless and they don’t DO anything.

    To which my darling daughter number one responded that “we” don’t say “hate” and “we” don’t say “suck” and I believe she muttered under her breath something about it being a wonder that she turned out normal with a mother like me. Obviously, this is a new meaning of the word “we” that doesn’t include the other person.

    I am certain that I muttered under my breath, “Well, it’s true. They DON’T do anything useful.”

    As penance I was forced to go to Disneyland and visit the Pavilion of Princesses. My granddaughter ADORED it. I was bored out of my mind by the princesses but the radiant look on her face DID make it worth taking a day away from work and paying Disneyland the equivalent of the median annual income in many countries for seven of us to eat churros and buy random pink crap bearing the stamp of useless women a.k.a. princesses.

    The truth is, as much as I truly loved my children – and I had three under age five while working on my PhD – at the end of each day, when they were all asleep, I sighed deeply, sat down and read books on multivariate statistics and matrix algebra and was satisfied with life. I did NOT wish they would wake up so we could dress up like princesses.

    There you have yet another of the 55 things I have learned in (almost) 55 years – you can be bored to death by Curious George, Strawberry Shortcake and every other thing designed to appeal to people with the mind of a three-year-old and still be a good mother.

    It reminds me of a story I heard about someone who had a son who was crazy about baseball. The father bought season tickets, attended every home game and when the team made the World Series he flew to whatever city it was being held in to attend the games. When someone said to him,

    I never knew you loved baseball so much.

    He replied,

    I don’t. I think baseball is the most boring game ever invented. But I love MY SON that much.

    20130517-013210.jpg

    There are times when you need to account for holidays when performing date calculations. The below holidays macro creates a data set of holidays for the specified year range. It makes adjustments for holidays that fall on a weekend such as Independence Day in 2015 which is adjusted to July 3, 2015.



    /**************************************************************************
    * Program: holidays.sas
    * Author: Tom Bellmer
    * Created: 17MAY2013
    * Purpose: create a list of holidays for the specified years
    * Usage: %holidays( startyear = 2013, stopyear = 2014 )
    * Notes: use with a hash object to determine holiday adjustments
    **************************************************************************/

    %macro holidays
    (
    startyear = 2010
    , stopyear = 2050
    , outdsn = work.holidays
    , view = y
    ) ;

    data &outdsn.
    %if %lowcase( %substr( &view, 1, 1 ) ) = y %then / view = &outdsn. ;
    ;
    attrib
    year length = 3 label = 'Year'
    date length = 4 label = 'Date' format = date9.
    dayofweek length = $16 label = 'Day of Week'
    holiday length = $32 label = 'Holiday'
    ;

    array aholidays[ 10, 2 ] $32 _temporary_
    (
    "NewYear", "New Year's Day"
    , "MLK", "Martin L King, Jr. birthday"
    , "USPresidents", "President's birthdays"
    , "Memorial", "Memorial Day"
    , "USIndependence", "Independence Day"
    , "Labor", "Labor Day"
    , "Columbus", "Columbus Day"
    , "VeteransUSG", "Veterans Day"
    , "Thanksgiving", "Thanksgiving Day"
    , "Christmas", "Christmas"
    )
    ;

    do year = &startyear. to &stopyear. ;

    do _n_ = 1 to dim( aholidays, 1 ) ;
    date = holiday( aholidays[ _n_, 1 ], year ) ;
    /* adjust date forward on Sunday or back on Saturday */
    date = intnx( 'day', date
    , choosen( weekday( date ), 1, 0, 0, 0, 0, 0, -1 ) ) ;
    dayofweek = put( date, downame.-l ) ;
    holiday = aholidays[ _n_, 2 ] ;
    output ;
    end ;

    end ;

    run ;

    %mend ;

    /* EOF: holidays.sas */

    May 16, 2013

    A few months ago I released the Copy Files task for use with SAS Enterprise Guide. The task allows you to transfer any files between your PC and a SAS Workspace session, much like an FTP process. It doesn't rely on FTP though; it uses a combination of SAS code, Windows APIs, and SAS Integration Technologies to get the job done.

    It's proven to be a very popular task, because it can be useful in so many situations. It even earned a mention in a SAS Global Forum paper this year (and no, it wasn't a paper that I wrote).

    Today I'm going to point out the things that the task doesn't do so well. Or at least, that it didn't do well until I made some updates. My changes were based on two "complaints" from several SAS users.

    Read on for the details. But if you don't care and you just want the latest version of the task, you can download it from here.

    Complaint #1: Wildcards that are a little too "wild"

    The task allows you to use wildcard characters in your file specifications so that you can match multiple files to transfer. A problem occurs though, when your file specification looks like this:

    /usr/local/data/*.xls
    

    Can you guess the problem? What if I told you that the task stores your file specification in a SAS macro variable? Yep, it's that "/*" sequence in the value that trips things up, because SAS interprets it as the start of a comment. Left unchecked, this sabotages the remainder of the SAS code that is included in the process.

    The SAS macro experts are already shouting out the answer to fix this: use %STR to wrap the slash and "hide" the token from the SAS parser. That's a great idea! Except that the task relies on the SAS "internal" value for this value --and not the displayed value -- when it comes time to process. These values are different when %STR wraps a special character like the forward slash. The macro facility changes out this character with a hexadecimal character called a delta character.

    To illustrate, I used another popular custom task -- the SAS Macro Variable Viewer -- to show the inner value of a SAS macro variable:

    Notice the funky arrow characters. Is that what you were expecting?

    Now the task detects the presence of a forward slash (and some other special characters) and will automatically add the %STR so you don't have to. (But you can still use %STR if you want to.) And it correctly detects the delta characters, if present, to convert them back to their correct form before trying to use the value.

    Complaint #2: Fixing line-ending characters but breaking other stuff

    Users of FTP might be familiar with binary versus ASCII mode for file transfers. Because UNIX line-endings are different than Windows line-endings for text files, transferring a file in ASCII mode helps to ensure proper line-ending behavior for the target host.

    The Copy Files task transfers ALL files using a binary mode. Why? Because in today's global workplace even text-based files often don't adhere to the limited English-centric ASCII standard. Attempting a text-based file transfer could result in encoding mismatches, so it's much safer to transfer content as "binary blobs".

    But you still want your text files to have the proper line endings for the target host. To answer that, the Copy Files task offers a "Fix line-ending characters" option that does the following:

    • Scans the file to determine whether it's a text file. (This relies on the file content and not on special file extensions such as .TXT or .CSV.).
    • Rewrites the file and replaces the line-ending characters as needed for the target file system (Windows or UNIX).

    The problem was that in rewriting the file (using Windows-based StreamReader and StreamWriter functions), the Copy Files task was changing the file encoding to UTF-8. That encoding works fine on Windows and most users didn't even notice. But some users sent me output from file dump tools and comparisons that showed the byte-order mark characters that were added to the file. (SAS users: I knew I could count on you!)

    To address this, I changed the "fix line endings" process to use lower level I/O functions that simply scan through the text files as a binary stream, byte-for-byte, and change the line endings as needed. Trying to decide on proper encoding is risky business, so I decided to leave the character encoding untouched.

    In addition to my own testing, a couple of users out there have confirmed that my changes fix the issues -- at least for now. Thanks for that! If you want to try the latest, get it now from here:

    >> Download the Copy Files task

    Related articles

    Copying files in SAS Enterprise Guide
    Inspecting SAS macro variables in SAS Enterprise Guide

    tags: FTP, macro programming, SAS custom tasks, SAS Enterprise Guide

    TOP NEWS

    Headline

    Company

    Type

    Amgen, Novartis aim to fuel biotech startups in alliances with Atlas Venture

                   

    St. Jude snags CE mark for 3-D stent imaging

     

    FDA Approves Janssen Biotech's Simponi to Treat Ulcerative Colitis

    Atlas Ventures

     

     

    St. Jude

     

    Janssen Biotech

    Collaboration

     

     

    EMA Approval

     

    FDA Approval

               

    CLINICAL TRIALS

    Headline

    Product

    Therapeutic Area

    Company

    State

    Experimental Gilead Drug Effective in Early-Stage Leukemia Trial

    Pharmaceutical

    Oncology

    Gilead

      CA

     

    ALSO IN THE NEWS

    Are you an NFL fan, or curious about analyzing social media data? -- Well, in either case, this blog's for you! I recently read a fascinating Facebook article that included a U.S. county map showing which NFL (U.S. football) team had the most 'likes' in each county (based on ~35 million [...]

    SGF2013logoI’ve already written about one highlight of SAS Global Forum 2013: the SAS Web Editor. Here are some more features that I think deserve mention. Please note that I make no claims about the comprehensiveness or completeness of this list.

    SAS 9.4 and Enterprise Guide 6.1 are scheduled for release in June and bring some important new features.  Here are a few:

    ODS POWERPOINT destination  The usefulness of this destination is obvious.  There will be two new styles designed specifically for PowerPoint: one with a white background and one with a black background.  You can use other styles too, but these new styles have the advantage of being fully compatible with the PowerPoint theme selector.  Any graphs you create using ODS Graphics will be embedded in this destination.

    ODS LAYOUT  If I remember correctly, the first time I ever heard about the ODS LAYOUT statement was at SUGI 28 in 2003.  I don’t know why it has taken so long to move to production, but I’m glad it’s finally here. If you need custom reports that combine results from multiple procedures, then you will probably love ODS LAYOUT.

    PROC ODSLIST and PROC ODSTEXT  These new procedures allow you to create bulleted lists and formatted blocks of text in reports. The content can be static or dynamic (based on a data set).

    ODS Graphics  The SG procedures continue to mature.  When I attended Dan Heath’s super-demo on SG procedures, members of the audience repeatedly said “Oh, good, I need that.” New features include a SORT= option in SGPANEL, insets in SGPANEL, split characters for tick and axis labels, and PERCENT and MEDIAN options for STAT=.

    Enterprise Guide  For a long time, one of the problems with Enterprise Guide was that it kept evolving so quickly that users felt like they had to learn it all over again with each release. EG users will be glad to know that EG 6.1 uses the same basic layout as EG 4.2 and 5.1.  Improvements in EG 6.1 include sticky notes and a log summary to help people who write code.  Developer Casey Smith said there will be better integration with the ODS Graphics Editor.  I am glad to hear that ODS Graphics is being supported in EG, and I would like to see this dramatically increased. EG users should have the best graphics that SAS can offer.

    SGF is finally global  28 percent of attendees were from outside the US.  I have attended SGF and it’s predecessor SUGI for decades, and most of that time I didn’t see a single attendee from outside the US. It’s exciting to see SAS Global Forum living up to its name.

    Finally, if you didn’t get to attend SGF (or even if you did) there were some great presentations that you should watch.  I know these have been mentioned by many other people, but they are surprisingly hard to find online. So here are the links:

    Opening Session The Opening Session was informative and included an amazing performance by the dance troupe Les Ombres.

    Roger Craig’s talk about how he used analytics to train to be a contestant on Jeopardy! was fascinating.


    SASwebEditorSAS Global Forum ended two weeks ago.  I thought by now someone would have written about SAS on the Mac and saved me the trouble, but since I don’t see much discussion of this in the blogosphere, here are my belated two cents.

    If you have been using SAS as long as I have, then you probably know that running SAS on a Mac is nothing new.  SAS Institute released SAS for the Mac lo these many years ago, but then dropped it just a couple years later because there weren’t enough users (read licenses) to justify it. And since then, of course, Mac users have gotten several different products that allow them to run Windows software.  So anyone who really wants to run SAS on a Mac has had that ability for a while.

    Given that history, the last thing I expected to see at the Opening Session was a demo of SAS on the Mac–much less on the iPad.

    Of course, this is not the same SAS for the Mac that was dropped so long ago.  This is the SAS Web Editor.

    The SAS Web Editor is a nimble version of Display Manager that runs in a browser (any HTML 5 compliant browser).  I learned about it just over a month ago when my husband mentioned to me, as we ate dinner, that he had read an interesting blog describing the SAS Web Editor.  Thank you to AnnMaria deMars for getting the word out!  Here is an official press release from SAS Institute dated March 6, 2013.  The SAS Web Editor is a client-server application.  The editor is the client.  To use it, you must have SAS running on some server. That server can be local or remote.  Considering how aggressively SAS Institute has promoted cloud computing over the last decade, it is perhaps surprising that it has taken this long to come up with Display Manager for the Web.  The SAS Web Editor feels like a missing link.  It makes a lot of sense.

    Here are some specifics from the Opening Session.  They used the SAS Web Editor in a browser on the Mac to access VMware to run SAS for Linux on the same Mac.  Then they demoed the SAS Web Editor on an iPad (pictured above) which also used the Mac as its server.  (Currently academic users of the SAS Web Editor use SAS Institute’s servers.  Maybe for the opening session they were concerned about slow connection speeds to Cary.  Given the complaints I’ve heard about the internet service at the Moscone Center, this is easy to believe.)

    Of course, you can use the SAS Web Editor on Windows (which is what I am doing).  So I find it interesting that they chose to demo it on Apple hardware.  Not only did they show Macs and iPads in the Opening Session, but I saw a lot of iPads being used by SAS staff at the conference.  I think this was a smart move for SAS Institute.  Firstly, there is an undeniable Cool Factor associated with Apple hardware that can only help SAS’s reputation.  At the present, SAS is loosing the battle for the academic market.  Maybe this will help turn the tide.  Secondly, this is a good time to distance oneself from Windows.  This fact was underscored for me by an article in last week’s Economist magazine titled Microsoft blues: Windows 8 is only the beginning of Microsoft’s problems.”

    A few other interesting tidbits about the SAS Web Editor:  It is not exactly the same as Display Manager, but the developers showing it in the Demo Room made it clear that they are working hard to get the kinks out. It is currently available only for academic use, but in the Opening Session it was said that it will be available as a free download–no mention of when. They also mention that it will be available for Android platforms.

    You can still view the Opening Session online. The SAS Web Editor demo starts around 1 hour in.

    SGF2013OpeningSession


    Previously, I discussed how to do a confirmatory factor analysis with Mplus. What if you aren’t sure what variables should load on what factor? Then you are doing an exploratory factor analysis. Really, you should probably do the exploratory factor analysis first unless you have some very large body of research behind you saying that there should be X number of factors and these exact variables should load on them. If you’re analyzing the Weschler Intelligence Scale, you probably could skip the exploratory step. For everyone else …. here is how you do an exploratory factor analysis with Mplus.

    TITLE : Exploratory Factor Analysis ;
    Data:  FILE IS ‘values.dat’ ;
    VARIABLE: NAMES ARE q1f1 q2f1 q3f1 q1f2 q2f2 q3f2 ;
    ANALYSIS: TYPE = EFA 1 3 ;
    ESTIMATOR = ML ;

    When no rotation is specified using the ROTATION option of the ANALYSIS command, the default oblique GEOMIN rotation is used.

    I explained the first three statements earlier this week.

    The fourth statement is new. Like the other statements, you need to follow the ANALYSIS key word with a colon and end each statement in the command (or if you are familiar with SAS, think of it as a procedure) with a semi-colon.

    TYPE = EFA 1 3 ;

    Requests an exploratory factor analysis with a 1 factor solution, 2-factor solution and 3-factor solution.  Of course, depending upon your own study, you can request whatever solutions you want. This is really useful because often in an exploratory study you aren’t quite sure of the number of factors. Maybe it is two or maybe three will work better. Mplus gives you a really simple way to request multiple solutions and compare them. I’ll talk more about that in the next post.

    ESTIMATOR = ML ;

    requests maximum likelihood estimation.

    If you are interested in factor analysis at all, there is a really good video on the Mplus site. Far more of it discusses exploratory and confirmatory factor analysis – methods, goodness of fit tests, equations, interpretation of factor matrix – than Mplus, which as you can see, is pretty easy, so even if you are using some other software the video is definitely worth checking out.

     

     

    May 15, 2013

    TOP NEWS

    Headline

    Company

    Type

    Roche gains FDA sign-off on lung cancer companion Dx

     

    GlaxoSmithKline gains blockbuster FDA approval of lung drug Breo

     

    Ranbaxy inks record-setting $500M manufacturing settlement with the feds

    Roche

     

     

    Theravance

     

     

    Ranbaxy

    FDA Approval

     

     

    FDA Approval

     

     

    Report

     

    FUNDING

    Headline

    Company

    Therapeutic Area

    State

    Billionaires back Kite Pharma in $20M raise for cancer immunotherapies

     

    AbbVie grabs rights to biotech startup’s anti-gluten drug for $70M upfront

    Kite Pharma

     

     

    Alvine Pharma

    Oncology

     

     

    Gastroenterology

    CA

     

     

    CA

                                                                                                               

    ALSO IN THE NEWS

    Industry Voices: VC questions thinking of Obama’s BRAIN initiative

     

    EVENT

    Date

    Name

    City

    Time

    Price

    5/22

     

    5/28

    EXECUTIVE BREAKFAST WITH KAREN LICITRA

     

    Bio2Device Group – Treating Sleep Apnea

    Menlo Park

     

    Sunnyvale

    8-10A

     

    830-1030A

    FREE

     

    FREE

     

    Being able to find SPSS in the start menu does not qualify you to run a multi-nomial logistic regression.

    This is the kind of comment statisticians find funny that leaves other people scratching their heads. The point is that it’s not that difficult to get output for some fairly complex statistical procedures.

    Let’s start with the confirmatory factor analysis I mentioned in my last post. Once you get past the standard stuff that tells you that your model terminated successfully, the number of variables and factors, you see this:

    Chi-Square Test of Model Fit

    Value                              8.707
    Degrees of Freedom                 8
    P-Value                           0.3676

    The null hypothesis is that there is no difference between the patterns observed in these data and the model specified. So, unlike many cases where you are hoping to reject the null hypothesis, in this case I certainly do NOT want to reject the hypothesis that this is a good fit. As you can see from my chi-square value above, this model is acceptable.

    Another measure of goodness of fit is the root mean square error of approximation (RMSEA).

    RMSEA (Root Mean Square Error Of Approximation)

    Estimate                           0.011
    90 Percent C.I.                    0.000  0.046
    Probability RMSEA <= .05           0.973

    An acceptable model should have an RMSEA less than .05. You can see above that the estimate for RMSEA is .011, the 90 percent confidence interval is 0 – .046 and the probability that the population RMSEA is less than .05 is 97.3%. Again, consistent with our chi-square, the model appears to fit.
    …………………………………………………………Two-Tailed
    …………………Estimate       S.E.  Est./S.E.    P-Value

    F1       BY
    Q1F1               1.000      0.000    999.000    999.000
    Q2F1               1.828      0.267      6.833      0.000
    Q3F1               1.697      0.235      7.231      0.000

    F2       BY
    Q1F2               1.000      0.000    999.000    999.000
    Q2F2               1.438      0.291      4.943      0.000
    Q3F2               1.085      0.191      5.687      0.000

    Here are the unstandardized estimates. By default the first variable for each factor is constrained to a value of 1, so, of course, there is no real standard error, probability or standard error of estimate. It isn’t really an estimate, that was set. Let’s look at the other two. Since they are unstandardized the more useful measure for us is the estimate divided by the standard error of the estimate, for example 1.828/ .267 . This is done for us in the column under Est. / S.E.  and in that case comes out to 6.833. You interpret these values in the same way as any z-score, with 1.96 as the critical value, and you can see in the last column that all of my variables loaded on the factor hypothesized with a p-value much less than .05.

    The next thing I look at is the residual variances. At this point my only concern is that I *not* have a residual variance that is negative. It makes no sense that you would have a negative variance because (among other reasons) variance is a sum of squares and squares cannot be negative. Also, in this case, the commonality is greater than 1, meaning you have explained over 100% of the variance in this variable by its relation to the latent construct. This also makes no sense. These are referred to as Heywood cases and explained beautifully here (even though the linked documentation is from SAS it applies to any confirmatory factor analysis).

    The final thing I want to look at, for right now, anyway, is the R-squared

    R-SQUARE

    Observed                                        Two-Tailed
    Variable        Estimate       S.E.  Est./S.E.    P-Value

    Q1F1               0.142      0.032      4.473      0.000
    Q2F1               0.475      0.065      7.256      0.000
    Q3F1               0.438      0.061      7.123      0.000
    Q1F2               0.174      0.045      3.883      0.000
    Q2F2               0.376      0.078      4.827      0.000
    Q3F2               0.179      0.044      4.057      0.000

    You can see that the r-square is pretty decent overall. These are interpreted just like any other R-square values. I didn’t show the standardized factor loadings here but just take my word for it that the R-squared values are the standardized loadings squared. So this is the variance in q1f1, for example, explained by factor 1.

    I started this whole thing working with Mplus to do a factor analysis and overall, I’d have to call it a pretty painless experience.

     

    We’re just coming back from SAS Global Forum, and what a show! SAS Books was there to provide users with the highest-quality resources for learning SAS, and our users were there to tell us what new books they were most looking forward to reading. Kevin Smith's PROC TEMPLATE Made Easy: [...]

    In my last post, I promised to lift the lid on British etiquette in time for your visit to the Analytics conference. Well, I’m going to be a bit more specific and focus my insight on London Etiquette, primarily, Greetings and Tube Etiquette. Greetings Around the world, we greet people [...]

    As we have seen my previous post "Seeing SAS data through metadata", there is a fundamental difference between accessing a SAS library using a physical reference or a metadata reference to that library. By now, you should now be an expert on the nuances of physical references to SAS data versus metadata references! This time, we are going to dive into one of the more subtle aspects of metadata library management:  pre-assigned libraries.

    Why should you pre-assign SAS libraries?

    As outlined in the SAS 9.3 Intelligence Platform: Data Administration Guide, there are two ways you can make sure that your server is aware of your library reference: 

    • pre-assigning the library
    • letting the client application define the library reference

    Even if you use a LIBNAME statement in your program, it is useful to register SAS libraries and tables in SAS metadata so that they can be used by some of the SAS clients (for example, SAS Data Integration Studio). We saw this demonstrated in our last article when we assigned a libref through a LIBNAME statement versus one assigned in SAS metadata.  Different clients may assign libraries differently, such as SAS Data Integration Studio or SAS OLAP Cube Studio where the library can be automatically generated based on the metadata. 

    When you pre-assign a library, you are making explicit which engine will be used to control the library (such as BASE or ORACLE). Pre-assigning the library has a number of benefits:

    • Pre-assigning librefs helps maintain consistency across users and applications
    • Libraries are always available to the server, regardless of how the program is run (batch versus interactive)
    • Libraries are easier to migrate to new storage since the physical location can be abstracted from the user
    • Pre-assigning libraries makes it easier for developers to eliminate redundant code (for example, not having to manage library references in stored processes)

    As discussed last time, if you don’t pre-assign our library and try to use that library in SAS Enterprise Guide, you get an error.

    PROC Print data=metaref.postassessments;
    run;
    ERROR: Libname METAREF is not assigned.
    

    As a result, you are forced to right click on the library in SAS Enterprise Guide and select “Assign” manually each time you start a new session.

    To see this libref as “Assigned” when connecting to the server, you would need to pre-assign the library.  This action tells SAS to run the assignment code on the server whenever the server is started. If you select “Library is Pre-assigned” in the Advanced options (SAS Management Console→Data Library→ Manager→<Library> Properties), you see three Pre-Assignment Types listed:

    • By native library engine
    • By metadata library engine
    • By external configuration

    Understanding three methods for pre-assigning libraries

    So which of the three options are right for your libraries? Here is a nice graphic and description of what really happens behind the scenes when you access a library governed by one of these processes.  

    Key to understanding these choices is to knowing the difference between the Metadata Library Engine method (option 2 above) versus the other two approaches that use the native engine. 

    • In the case of the native engine, the system does not check the SAS metadata to see if the user has read access on the table; however, it does check the physical security and whether the user has permission to ReadMetadata. When you use a native engine, the data-level authorizations of Read, Write, Create, and Delete are not checked.
    • If you want to use the metadata authorization layer to control Read, Write, Create, and Delete permissions, then you must pre-assign the library and use the Metadata Library Engine method. But remember, operating system level file permissions should always be considered whether you use the SAS Metadata Libname Engine or the underlying native engine.
    • Additionally, as we have noted, it is not always efficient to pre-assign every library that you might want to use. (In fact, a little birdie reminded me that too much of good thing may affect performance.) You should determine which libraries will actually be needed before you choose to pre-assign them. You should also set READMETADATA privileges to limit which libraries a group of users sees, as not everyone needs every library.
    • One related area you may want to consider is the notion of read-only access. You will note that in the properties for a library, you have the option of specifying the library as READONLY. This capability augments the metadata permissions on the library or table in the fact that it eliminates the possibility of anyone writing to or destroying elements contained in the library – even if they have metadata permissions to do so.

    In the discussion that follows, I will focus only on the mainstream servers (Workspace Servers, Pooled Workspace Servers, Stored Process Servers, SAS/SHARE servers and OLAP servers) as these servers automatically read metadata when they start and assign the libraries. You will have to edit the server configuration files if you want these to be read by other servers such as SAS/CONNECT, SAS Data Step Batch Server or SAS/IntrNet.

    So let’s summarize the three assignment options in SAS metadata and explore how each one works and why we would select one over another.

    By Native Library Engine

    When you select the native library engine as the Pre-Assignment Type, you will notice that, in SAS Enterprise Guide, the client immediately has access to the library and the defined metadata tables.

    The library is assigned when the SAS session starts and referenced through a SAS System Option called METAAUTORESOURCES. SAS uses the library engine defined for the library in SAS metadata. As noted above, the data-level authorizations are not checked in the SAS metadata. To illustrate this, we will set the metadata permissions in this example to read only.

    Then we will try to replace the table.

    Here, you notice that despite the metadata permissions of Read and ReadMetadata, the user was able to delete the dataset. As we mentioned above, if the native engine is used to access the table, then metadata permissions are not evaluated. 

    Note: For SAS 9.3, SAS has added a feature called metadata-bound libraries, which cannot be overridden. Metadata-bound libraries offer better protection than do other metadata-based approaches to access control because the enforcement begins with the physical data. The enforcement occurs regardless of how the user requests access to the data (metadata clients such as SAS Web Report Studio or direct (through a LIBNAME statement submitted from SAS Enterprise Guide). For more information, see the documentation for metadata-bound libraries.

    By Metadata Library Engine

    When you select the Metadata Library Engine as the Pre-Assignment Type, the library is similarly assigned through METAAUTORESOURCES options; however, this option uses the Metadata Library Engine that ensures access controls that are placed on the library and its tables and columns are enforced. Here, we submit a simple DATA step that tries to replace an existing dataset. Note that our metadata permissions did what they were supposed to do.

    By External Configuration

    The third and final option for pre-assigning libraries is to use an external definition or an autoexec file. This method essentially makes the library available to client applications but moves the code out of the user interface to an external file. This option is particularly useful if you want to manage your configuration outside of SAS metadata or share the configuration among multiple services such as SAS/Connect.

    Four questions to help choose the best method

    Since I am a visual thinker, I also thought it might helpful to show these comparisons in a flow chart. Admittedly, this is a simple set of decisions, but the chart should help you decide which pre-assigned library option might be best for your case.

    Hopefully, this post was a useful discussion, and you know a bit more about what it means to pre-assign libraries and how the various options affect whether users can read or both read and write data.

    Happy Data!

    --greg

    tags: metadata, SAS Administrators

    I was pleased to be invited to present a paper on Visual Techniques for Problem Solving and Debugging at this year's SAS Global Forum (SGF) conference. I spoke about the importance of human interaction in solving complex issues; the process and people make a far greater contribution than the associated software tools. I spoke about seven more-or-less visual techniques, some of which I've highlighted in NOTE: before:
    DMAIC is an excellent end-to-end process to give structure to your whole problem solving endeavour. 5 Whys is a flexible technique for probing root causes. Ishikawa is a terrific approach to information gathering and helps ensure comprehensive coverage of the problem area.
    The Ishikawa diagram (and most of the other techniques I discussed) is a top-down approach. The distinctive element of the Affinity diagram is that it is created bottom-up. Whilst the Ishikawa (and Mind Map) are drawn by starting with general topics (or questions) and then drilling down into detail, the process of drawing an Affinity diagram begins with a brainstormed set of detailed observations and facts.

    The bottom-up idea can sound unstructured, but is it ever a bad thing to have too many ideas? Probably not, but if you've ever experienced information overload or struggled to know where to begin with a wealth of data you've been given, you may have wondered how you can use all of these ideas effectively.

    When there's lots of "stuff" coming at you, it is hard to sort through everything and organise the information in a way that makes sense and helps you make decisions. Whether you're brainstorming ideas, trying to solve a problem or analysing a situation, when you are dealing with lots of information from a variety of sources, you can end up spending a huge amount of time trying to assimilate all the little bits and pieces. Rather than letting the disjointed information get the better of you, you can use an Affinity diagram to help you organise it.

    Also called the KJ method, after its developer Kawakita Jiro (a Japanese anthropologist) an Affinity diagram helps to organise large amounts of data by finding relationships between ideas. The information is then gradually structured from the bottom up into meaningful groups. From there you can clearly "see" what you have, and then begin your analysis or come to a decision.

    Here’s how it works:
    1. Make sure you have a good definition of your problem (ref: DMAIC)
    2. Use a brainstorm exercise (or similar) to generate ideas, writing each on a sticky note. Remember that it’s a brainstorm session, so don’t restrict the number of ideas/notes, don’t be judgemental, don’t be afraid to re-use and enhance ideas on existing sticky notes, and don’t try to start solving the problem (yet)
    3. Now that you have a wall full of sticky notes, sort the ideas into themes. Look for similar or connected ideas. This is similar to the Ishikawa’s ribs, but we’re working bottom-up, and we’re not constrained a by a set of ribs as our start points. When you’re doing this, it may help to split everybody into smaller teams
    4. Aim for complete agreement amongst all attendees. Discuss each other’s opinions and move the sticky notes around until agreement is reached. You may find some ideas that are completely unrelated to all other ideas; in which case, you can put them into an “Unrelated” group
    5. Now create a sticky note for each theme and then super-themes, etc. until you've reached the highest meaningful level of categorisation. Arrange the sticky notes to reflect the hierarchical structure of the (super)themes
    You’re now in a similar position to where you would be with an Ishikawa diagram and can proceed accordingly. The benefit of the Affinity diagram over Ishikawa is that the bottom-up approach can produce different results and thereby offer different perspectives on your problem.

    Affinity diagrams are great tools for assimilating and understanding large amounts of information. When you work through the process of creating relationships and working backward from detailed information to broad themes, you get an insight you would not otherwise find. The next time you are confronting a large amount of information or number of ideas and you feel overwhelmed at first glance, use the Affinity diagram approach to discover all the hidden linkages. When you cannot see the forest for the trees, an Affinity diagram may be exactly what you need to get back in focus.

    If you'd like to know more about some of the other techniques, you can catch an audiovisual recording of my whole paper on Brainshark.

    Last week someone posted an interesting question to the SAS/IML Support Community. The problem involved four nested DO loops and took hours to run. By transforming several nested DO loops into an equivalent matrix operation, I was able to reduce the run time to about one second.

    The process of converting loops into vector or matrix operations is called vectorization. The programmer who posted the question knew that vectorization would improve the program, but also stated that "vectorization is really difficult." I agree. However, life is full of difficult but necessary tasks. In statistical programming, vectorization is necessary (but rewarding!) because it can dramatically decrease the run time of a program.

    Statistical programmers need to know how to vectorize programs. Whether you write your programs in SAS/IML language, MATLAB, or R, it is essential to know how to convert loops into equivalent vector operations. Therefore, I have decided to post a series of examples that show how to recognize certain patterns that arise in loops, and how to replace those loops with matrix operations. These examples will be tagged with the "vectorization" tag in the word cloud (in the right sidebar) on my blog. Today is the first example.

    Here are some general tips and techniques:

    • When you have nested loops, vectorize the inner loop first.
    • Convert a loop with scalar operations into a single vector operation. For example, convert sums into vector dot products.
    • Convert a loop of vector operations into a single matrix-vector multiplication.
    • Convert a loop of matrix-vector multiplications into a single matrix-matrix multiplication.

    In terms of matrix computations, follow this general rule:

    Solve the problem by using the highest possible level of Basic Linear Algebra Subroutine (BLAS).

    A level-0 BLAS is a scalar operation. A level-1 BLAS is a vector operation, such as a vector addition, a dot product, or a vector norm. A level-2 BLAS is a matrix-vector operation, such as a matrix-vector multiplication, an outer product, or solving a triangular linear system. A level-3 BLAS is a matrix-matrix operation, such as matrix multiplication, forming a cross-product matrix, or solving triangular linear systems with multiple right-hand sides (see the SAS/IML TRISOLV function).

    The notions of "level" and BLAS are discussed in Chapter 1 of Golub and van Loan's Matrix Computations.

    An example problem with three nested loops

    The problem in the discussion forum involved four nested loops, but I'll extract the essence of the problem into a simpler example. Suppose that you have implemented a textbook formula and your SAS/IML program looks like the following:

    proc iml;
    X = {1 2 3,            /* example data */
         4 3 1,
         1 1 2,
         0 0 3};
    n = nrow(X);
    p = ncol(X);
     
    S=j(p,p,.);            /* allocate matrix to hold results */
    do j = 1 to p;
       do k = 1 to p;
          u=0;     
          do m = 1 to n;
             u = u +  X[m,j]*X[m,k];
          end;
          S[j,k] = u;      /* the (j,k)th result */
       end;
    end;
    print S;

    At this stage, the program consists of three loops and scalar operations (level-0 BLAS).

    When I study these loops, I notice the following:

    • The outer loop (the j loop) is a loop over the columns of the matrix X because the expression X[m,j] appears in the innermost loop.
    • The middle loop (the k loop) is also a loop over the columns of the matrix X because the expression X[m,k] appears in the innermost loop.
    • The inner loop (the m loop) is performing a summation. (Notice that u=0 before the loop and u=u+stuff inside the loop.) The terms that are being summed are the product of the jth and kth columns.

    Step 1: Attack the innermost loop

    Because the inner loop is a sum, replace the inner loop by the SUM function. The SUM function takes a vector. For this example, that vector will be the elementwise product of the jth and kth columns: X[ ,j]#X[ ,k]. The following statements show the simplified program, which now has only two loops:

    /* Step 1: Replace inner loop with sum of the product of j_th col and k_th col. */
    S=j(p,p,.);
    do j = 1 to p;
       do k = 1 to p;
          S[j,k] = sum( X[ ,j] # X[,k] );
       end;
    end;

    Step 2: Replace summations by dot products

    The inner loop now involves the sum of the elementwise product of two vectors. This sum has a more common name: the vector dot product. Consequently, replace the SUM function by a dot product. In order to get the dimensions to match, transpose the first vector, as follows:

    /* Step 2: The sum of the product of two vectors is their dot product. */
    S=j(p,p,.);
    do j = 1 to p;
       do k = 1 to p;
          S[j,k] = X[ ,j]` * X[,k];
       end;
    end;

    At this stage, the program consists of two loops and vector operations (level-1 BLAS).

    Step 3: Replace vector operations with matrix-vector operations

    The next step involves converting vector operations into a matrix-vector operation. Notice that the X[,j] term does not change in the inner DO loop. Furthermore, the product is over all columns, so the inner loop is equivalent to a vector-matrix multiplication. The program is thereby reduced to one loop, as follows:

    /* Step 3: Convert to vector-matrix product */
    S=j(p,p,.);
    do j=1 to p;
       S[j, ] = X[ ,j]` * X;
    end;

    At this stage, the program consists of one loop and matrix-vector operations (level-2 BLAS).

    Step 4: Replace matrix-vector multiplications with matrix-matrix operations

    The last step is to recognize that the only remaining loop iterates over the columns of X. But that is equivalent to iterating over the rows of X` (the transpose of X). The loop of vector-matrix multiplication is consequently equivalent to the cross-product operation, X`X, as follows:

    /* Step 4: A loop over all rows is equivalent to multiplication by X` */
    S = X` * X;
    print S;

    Notice that the three loops have been replaced by a single level-3 BLAS: matrix multiplication. As stated earlier, for large problems you can realize substantial savings of time if you eliminate loops in favor of high-level linear algebra operations.

    One final comment. If you've struggled and labored, but still can't figure out how to vectorize your SAS/IML program, post it to the SAS/IML Support Community. That community is a helpful place to discuss issues related to efficiency and programming in the SAS/IML language.

    tags: Efficiency, vectorization

    Someone had a question about factor analysis with Mplus and even though it is not a piece of software I work with normally, we aim to please at The Julia Group, so I downloaded the demo version and away I went.

    It truly was, as my granddaughter says, easy-peasy lemon squeezie.

    You might not think so, because the first thing you are confronted with is pretty much a blank window like this

    screen shot of editorFor people who are used to Excel, SPSS, SAS Enterprise Guide or other friendly GUI interfaces, this might be a bit off-putting. However, doing a confirmatory factor analysis was this easy.

    1. Create a .dat file from the original file. The file was in a SAS format and I did not have SAS on the laptop I was working on (I’m in Cambridge, MA at the moment). What I did was

    • Open the file in SPSS by, from the FILE menu selecting READ TEXT DATA and then selecting SAS as the format
    • Ran this SPSS command from the syntax window to output a tab-delimited file with no header, which was the type of input Mplus would expect.

    2. Type in this program to do a two-factor solution with the first three variables loading on the first factor and the next three loading on the second factor.

    TITLE : Confirmatory Factor Analysis ;
    DATA:  FILE IS ‘/Users/annmaria/Documents/mplustest/values.dat’ ;
    VARIABLE: NAMES ARE q1f1 q2f1 q3f1 q1f2 q2f2 q3f2 ;
    MODEL: f1 BY q1f1 q2f1 q3f1 ;
    f2 BY q1f2 q2f2 q3f2 ;
    OUTPUT: standardized ;

    3. Click the RUN button.

    That is really all there was to it.

    Okay, well that is easy if you knew what to type so let me explain a few things. If you know SAS or SPSS this will be easy.

    Each of those things that I put in all capitals is a command in Mplus, analogous to a DATA or PROC step in SAS and a command in SPSS. They don’t need to be in all caps, I just did that for ease for the reader. They DO need to be followed by a colon and then end the statement in a semi-colon.

    Title – pretty obvious, gives your output a title.

    DATA: FILE IS  — gives the path to locate your data.If your file is in the same directory as your program, you don’t need a fully qualified path and can just call it ‘values.dat’

    VARIABLE: NAMES ARE

    Give the names of your variables. You can specify a format but if you do not Mplus assumes they are in free format, which is the same as what SAS refers to as list format.  You might want to note that if you are using the demo version you can only have a maximum of 6 independent and 2 dependent variables.

    MODEL:  This is my model (duh) and I am modeling two factors. The first factor I creatively named f1 and it is represented BY (notice the BY in the command) variables also creatively named q1f1 q2f1 and q3f1.

    Similarly, I have a second factor named f2 ;

    I added an OUTPUT statement with a standardized option because I wanted (surprise) standardized estimates. That statement is not required but as you’ll see in my next post on interpreting factor analysis data, you do want it.

    I am intrigued by Mplus. It sort of assumes you have close to perfectly cleaned up data because I wouldn’t want to be doing a lot of data management with it, but for doing some relatively complex models  – factor analysis, path analysis, structural equation modeling – it looks pretty cool.

     

    May 14, 2013

    The annual conference in SAS Francisco is now a pleasant memory. And what a pleasant one it was!  I saw so many of my user friends and met some new ones too. When I attend this conference I wear at least two hats. The first is that of a SAS [...]

    This year we’re holding the first of our Analytics Series conferences in London, right by Waterloo Station.   Not too far from the hotel is the Florence Nightingale Museum.   Most of us (at least in Britain) know her well from school as ‘the Lady with the Lamp’.  Florence Nightingale is most [...]

    I recently stumbled across the work of John Graunt, a London resident in the mid 17th century. Graunt used London's Bills of Mortality to publish an insight into the causes and spread of the plague. Among other things, he was able to use the data to prove that plague was not spread by person-to-person contact, and peaks of plague were not related to the reign of a new king. He found that more boys were born than girls but that infant mortality equalised the ratio. Most importantly, he found that by analysing data you actually uncover knowledge.

    From humble beginnings as a haberdasher, he rose to the respect of King Charles II and was elected a member of the Royal Society. Graunt was a self-educated man, yet the statistical, epidemiological and demographic work evidenced in his Observations set him out as a pioneer. 350 years ago, Graunt was doing what we might now call "public health intelligence". Graunt calculated that 36% of children didn't reach the age 6 (a startling figure by today's standards). With further categorisation and analysis, he deduced that people were dying of causes unrelated to age - preventable diseases.

    Graunt's 17C London
    Graunt's work helped to encourage medical practitioners of the day from merely treating symptoms to investigating preventative measures. There are strong similarities with the evolution of business intelligence techniques (from reporting on history, to predicting the future, to influencing the future).

    Despite Graunt's successes with the analysis of the data, routine collection and analysis of health data didn't start until 200 years later (William Farr was appointed as the 1st compiler of scientific abstracts). Nonetheless, we should acknowledge his achievements and his pioneering of "analytics".

    Further reading:

    Excerpt from The Lancet, 1996:
    http://www.epidemiology.ch/history/PDF%20bg/Rothman%20KJ%20lessons%20from%20john%20graunt.pdf

    Ed Stephan's collection:
    http://www.edstephan.org/Graunt/graunt.html

    StatProbe Encyclopedia:
    http://statprob.com/encyclopedia/JohnGRAUNT.html

    Missing data can be a pain.  Having missing data and not knowing where it is can be even more of a pain.  Here is a quick tip for potentially handling missing values during an ETL process, or during any data processing step, and how to quickly spot.  Mileage may vary depending on the business requirements for processing your data.

    Coalesce the Missing Values

    The coalesce function (alternatively the coalesceC function for character values) is very useful for selectively loading a field depending on the state of data.  The parameters are simple.  Just reference variables in your data or explicit hard-coded values and the coalesce function picks the first non-missing value for that observation.  It selects based on the order variables are entered, from left to right.

    coalesce( [first variable], [second variable], .... , [Nth variable])

    Sometimes I hard-code the following values at the end of the coalesce parameter list to ensure something gets entered (depending on requirements):

    • !UNKNOWN
    • !MISSING
    • !HEY LOOK AT ME

    Using these standardized values can help the business spot missing values very quickly, especially if you use a special character such as the exclamation point which sorts missing values at the top when viewing in ascending order.  

    The following code fills missing values of ‘DeathCause’ in the SASHELP.HEART dataset:

    data out;
      set SASHELP.HEART; 
      DeathCause = coalesceC(DeathCause, '!UNKNOWN');
    run;

    The missing values are converted to !UNKNOWN:

    Coalesce Missing Values

    Identify Missing Data when Loading a Dimensional Model

    Coalescing missing foreign key values can also be useful when loading a dimensional model.  In a star schema, categorical values are stored in dimension tables with corresponding foreign keys that references these values from fact tables.   The purpose of foreign keys is to describe the factual numeric values contained in the fact table by joining to the related dimension table.  A good best practice is to always load explicit non-NULL foreign key values to ensure numeric data is always identified and because your DBA may not like NULL values within integrity constraints.  If a numeric value truly has a missing dimension, you can use the coalesce function to stage a “zero” value for the foreign key in a fact table.  You could also use a value of “-1″ as the “missing” foreign key value.  This also acts as a “catch all” to make sure the ETL process completes with no errors due to attempting to insert a missing or NULL value in a fact table. 

    This is an example DIMENSION table I’m using to reference address locations in a fact table.  

    Unknown Values DIM Table

    The fact table can reference the ‘address_key’ of 0 for anything that is missing or unknown.

    These are two ways I’ve used the COALESCE() and COALESCEC() functions.  Do you have any other uses?

    The post Coalesce Missing Data to Highlight the Unknown appeared first on Business Intelligence Notes for SAS® BI Users. Written by .

    Here are four more of Dr. De Mars 55 things I have learned in (almost) 55 years, and that is that there are four thing students should have learned in school but often didn’t.

    1. Say what you mean. I don’t know who those teachers are who reinforce students for using longer words, longer sentences and writing more pages but I hope someone finds them and beats them senseless with The Elements of Style , which nearly a century after it was first published I still think is one of the best books on writing out there. When you write,

    In the experiment under discussion we utilized two conditions in the manner such that one group of the subjects referred to in the preceding paragraph received no treatment, that is, they were what is referenced as the control group. The other group, that is the second group, which was the group receiving our treatment described in the section under procedures which follows is hereafter referred to as the treatment group. A treatment group is defined by Academic-Guy (2012) as …

    instead of,

    Subjects were randomly assigned to either a treatment or control group.

    You may think the first example makes you sound intelligent and well-educated but it doesn’t. It makes you sound like you learned English by watching the Power Puff Girls and imitating Mojo Jojo. People – clients, your boss – are busy, and grant applications have page limits.

    2. Don’t be a pain in the ass. I wrote a post about this, Why the cool kids won’t hang out with you. In brief, no matter how smart you are, if you constantly run down your co-workers, flaunt the policies of your organization and are rude to your boss, at some point they will replace you with an equally smart person who is less of a pain. This may sound hypocritical because if you have been reading this blog for long you are well aware that I swear, don’t do mornings and, if I have to wear a suit, I charge extra. However, I work with clients that are cool with that.

    Really points 1 & 2 generally reveal a person trying to prove that he or she is smarter than the other people in the room. That usually reflects an underlying insecurity. I have met some absolutely brilliant scientists and businessmen/women. None felt the need to try to impress me. I was already impressed when I met them, and I’m sure that was the reaction they got from almost everyone.

    3. Mean what you say. If you say you will be in the office at 8 a.m., be in the office at 8. I tell clients I will be in by 9:30 or 10 if necessary because I know there is no way on God’s earth I am dragging myself out of bed at 7 a.m. It’s not happening. On the other hand, they know that if I say I will be in by 10, I will. If you say you can write programs in Perl or are experienced creating multi-media PowerPoint presentations, then when I ask you to do that, you should be able to do it. [I don't really need anyone to do either so if you are applying for our summer intern position, you don't need to mention these. It was just an example.]

    child at computer

    4. Learn to code. It doesn’t matter what language. It’s absolute bullshit that once you know one programming language you know them all, but it is certainly true that once you have the idea of loops, arrays, properties, methods, classes, extend, functions and a few dozen other key concepts, it will be much easier for you to pick up a second, third or fourth programming language. The Perfect Jennifer is an amazingly great history teacher and she is in one of the minority of fields where you can not do any programming and have a decent, stable job. Did I mention she is amazingly great, and works an enormous amount of extra hours? However, if you are planning on going into consulting, management or a large number of other fields, knowing how to code will help you immensely. Even our Chief Marketing Officer, who only focuses on marketing, has done a little coding and has some idea of the constraints of developing a new product. I’m so convinced of the personal and professional value of learning at least a little bit of programming that I have gone back to requiring it in my statistics courses. Often students don’t learn to code because they underestimate themselves. They believe programming is done by people who are smarter, more focused or in some way better than them. That’s simply not true and learning to code will give them both more skills and more confidence.

    So, those are four more things I have learned in (almost) 55 years and that I think any student graduating should learn as well.

     

     

    May 13, 2013

    The Little SAS Book: A Primer, Fifth EditonOne of the problems that Lora Delwiche and I face as authors of two books with similar titles (The Little SAS Book and The Little SAS Book for Enterprise Guide) and multiple editions (five of LSB and three of LSBEG) is explaining how the books are different.

    The two books are totally different–and complementary.

    So I was delighted to see that someone at SAS Press has written a great summary comparing the various editions.

    Did you know that the title The Little SAS Book was originally a joke? We explain that and give a little history on sasCommunity.org.


    Ever heard your grandmother say when you were little: If you have your heart set in the right place, you can achieve anything you set out to do! That’s what SAS users tried to do at SAS Global Forum 2013 held at Moscone West, San Francisco. The conference had a heart, and it was filled with passionate people who were industrious, intelligent and loved challenges.

    SAS users from all over the globe were accustomed to meet-ups and team building at conferences to talk about SAS, software and solutions. But the team that came together this time had a different agenda in mind. A team that believed in giving back to the local community as much as it got from the opportunity to gather in San Francisco to reconnect and learn from each other. The Build-a-Bike charity event was one such moment that brought back childhood memories and the carefree joys of riding a bike.

    Attendees scattered into smaller groups armed with wrenches and excitement to assemble bikes for underprivileged children in the San Francisco area. . “I love bicycles, I cycle to work” said one such enthusiastic attendee stating that every time she comes to the conference, she meets new people and learns a lot. They built 50 bikes for young boys and girls in an hour. The mood was upbeat and it gave everyone involved a rare opportunity to connect outside of work, re-live their childhood and, most importantly, bring big smiles on the faces of some very lucky children. Many thanks to the participants for making this a memorable event and certainly one of the highlights from the conference.

    What was your memorable moment from the conference and how would you like to give back to the community? Please share.

    tags: SAS Global Forum

    I've conducted a lot of univariate analyses in SAS, yet I'm always surprised when the best way to carry out the analysis uses a SAS regression procedure. I always think, "This is a univariate analysis! Why am I using a regression procedure? Doesn't a regression require at least two variables?"

    Then it dawns on me. In the SAS regression procedures, a MODEL statement that does not contain explanatory variables simply performs a univariate analysis on the response variable. For example, when there are no explanatory variables, a classical regression analysis produces sample statistics, such as the mean, the variance, and the standard error of the mean, as shown in the following output:

    /* estimates of mean, variance, and std err of mean */
    ods select FitStatistics ParameterEstimates ;
    proc reg data=sashelp.cars;
       model MPG_City= ;
    quit;

    The estimates in this PROC REG output are the same as are produced by the following call to PROC MEANS:

    proc means data=sashelp.cars mean std stderr cv;
     var MPG_City;
    run;

    The ODS graphics that are produced by PROC REG also includes a histogram of the centered data and a normal Q-Q plot.

    Here are some other instances in which a SAS regression procedure can be used to carry out a univariate analysis:

    • Robust estimates of scale and location. This univariate analysis is usually performed by using PROC UNIVARIATE with the ROBUSTSCALE option. However, you can also use the ROBUSTREG procedure to estimate robust statistics. The ROBUSTREG procedure provides four different estimation methods, which you can control by using the METHOD= option. For example, the following statements display robust estimates of location and scale:
      ods select ParameterEstimates;
      proc robustreg data=sashelp.cars method=M;
         model MPG_City= ;
      run;
    • Detection of univariate outliers. How can you detect univariate outliers in SAS? One way is to call the ROBUSTREG procedure! Again, there are four estimation methods that you can use.
      ods select DiagSummary;
      proc robustreg data=sashelp.cars method=LTS;
         model MPG_City= ;
         output out=out outlier=outliers;
      run;
       
      proc print data=out(where=(outliers=1));
         var make model type mpg_city;
      run;
    • Estimation of quantiles, with confidence intervals. Last week I showed how to use PROC UNIVARIATE to compute sample quantiles and confidence intervals. An alternative approach is to use the QUANTREG procedure, as follows:
      /* estimate univariate quantiles, CIs, and std errors */
      ods output ParameterEstimates=QntlCI;
      proc quantreg data=sashelp.cars;
         model MPG_City= / quantiles=0.025 0.2 0.8 0.975;
      run;
       
      proc print data=QntlCI; run;
    • Fit discrete parametric models to univariate data. I've previously shown how to use the GENMOD procedure to fit a Poisson model to data, and the same technique can be used to fit other discrete distributions, including the binomial, geometric, multinomial, negative binomial, and some zero-inflated distributions.

    • Fit parameters for a mixed density model to univariate data. I've previously demonstrated how to use the FMM procedure to fit a finite mixture distribution to data.

    There are other examples, but I hope you see that the SAS regression procedures are useful for computing univariate statistics and analyses.

    Do you have a favorite univariate analysis that can be accomplished by using a SAS regression procedure? Let me know about it by leaving a comment.

    tags: Data Analysis, SAS Programming

    May 11, 2013

    There are, or so I have heard, people who are energized by parties, meet-ups and social events. I am not one of those people.

    Dinner with the family

    If I had my choice, I would never go to any gathering larger than our family dinners for the rest of my life. It’s not that I don’t enjoy talking to intelligent people nor that I don’t appreciate all of the great people that I get to work with in the course of the year – I really do. However, I have to confess, that is a fringe benefit. What I am most interested in doing is sitting at my computer solving problems. If there was some way to get anyone else to go to the meet-ups, demos, conferences and pitches, I would do it.

    Most of our staff at The Julia Group is like that. When meet-ups or other networking opportunities there is more whining than taking a kindergarten class to church.

    “Oh, man, do I *have* to go?”

    “I just went last time.”

    “Can’t I go next time?”

    “Isn’t it somebody else’s turn?”

    In fact, we DID hire someone, our new Chief Marketing Officer to handle these responsibilities because I got so tired of hearing the whining from everyone, including me. Now I only go when she tells me that I have to – and I still whine.

    In my experience, most meet-ups will have from zero to one good point that  is worth knowing. Usually that comes from whoever they have as a speaker, but not always. You’ll meet, if you are lucky, one interesting person with whom you wish to follow up, several people who want to sell you stuff and a couple of people who have an idea and are looking for someone to give them money so they can pay someone else to make it. Yet, I still go because that one point is worth hearing and the one person is worth knowing.

    Here are five points I have learned from start-up meet-ups. Since you read my blog you can tell your CMO that you get to skip the next five (she probably won’t buy it, but it’s worth a try).

    1. Cash is more than king. – From Jenny Q. Ta , founder of sqeeqee.com This advice from a highly successful founder confirmed what I have thought for years. At one point our company rented an office because I thought we should have one to look like a “real company”. Almost no one ever went there. Most of us work at home and we have people in several states. Now we Skype, FaceTime , email or meet in the office downstairs in my house. If we need a conference room, I rent one at the business center a half-mile away. Sometimes people are unimpressed that we still haven’t permanently moved out of the downstairs, but what we save on renting offices for a dozen people goes a long way to making sure we are in the black every month. If you have a healthy cash flow, you can get by without investor money for a long time.

    2. Put off taking investor money as long as you possibly can – This is another good tip from Jenny Q. Ta The sooner in the game that investors come in, the more of a risk they are taking and the larger percentage of your business they are going to want.

    I find it ironic that the two things that might impress a casual observer – paying for office space and getting angel investor money are the exact points that she argued against. (She’s not the only one, check Paul Hawken’s wonderful book Growing a Business). We have people putting in considerably more hours than they are getting paid for a share of the business – those are co-founders and that is the best investment we can get because not only is it equivalent to funds but it brings the talent with it.

    3. Don’t believe everyone knows more than you. I heard this at a General Assembly start-up event and it is worth repeating. There was a time when I thought all of these people spouting so confidently that the target market for their product was in the hundreds of millions (it isn’t) or that the best choice for an application was Ruby (it wasn’t) knew so much more than me. Now I realize that many of them are just posturing. They’re either trying to sound confident for investors, or they just have a different world view than me. I’m a statistician. If I tell you we’ll make $5 million on a product I believe there is a greater than 50% chance based on the facts at my disposal. Others, if they say they’ll make $500 million are basing it on an assumed 5% chance and convinced they’ll make it with the right strategy.

    4. Find a co-founder or two. I believe the optimal number of co-founders is three. More than that, you dilute decision-making too much. Less, and you probably haven’t covered all of the key skills.

     

    The fifth and most important thing I have learned and I have heard it several times – most of success is just keeping working even when it’s hard and frustrating.

    Speaking of which, I was taking a break from revising our first game to write this post but now I’m going to get some sleep and hit it in the morning.

    (And there you have five more things I have learned in almost 55 years.)

    May 10, 2013

    This week's SAS tip is from Frederick Pratter and his handy book Web Development with SAS by Example, Third Edition. If you're a regular reader of this blog, you've probably noticed that this book is frequently excerpted. There's just so much to choose from! View Pratter's previously featured tips here. The [...]

    I didn't see this paper presented at SAS Global Forum(!) even though there's plenty of pattern matching and analytics involved in the project, but maybe I'd have benefited from having the associated software installed on my Android tablet whilst writing notes and blog posts.

    It's (yet another) alternative keyboard for mobile (phone and tablet) devices. It dares to diverge from QWERTY, and it's thumb-focused, i.e. it doesn't expect you to be a Mavis Beacon alumni. Thus, the researchers claim "it will take about 8 hours of practice to reach the typing rate that is comparable to that of a regular Qwerty keyboard on the same device. Practice beyond that point will improve the rate further". However, it promises much because the layout has the following properties:

    • The division of work is almost equal, at 54% and 46% for the right and left thumb, respectively.
    • Alternation is rapid: 62% of the taps are switches.
    • Travel distances are short: On average, the left thumb moves 86 px, the right 117.
    • The space bar is centrally located.
    • The right thumb handles all vowels except y. The clustering of vowels around the space bar favours quick switches and minimises travel distance. The right thumb is responsible for 64% of same-side taps.
    • The left thumb has most of the consonants, exploiting its ability to hover above the next button sooner. It has most first letters of words and most of the consonants.

    I'll confess. I bought a Nexus 7 in San Francisco, sitting alongside my Galaxy Nexus phone and my Asus TF101 tablet/laptop. Yes, I'm an Android fan. But, in my defence, the battery on my Asus had run dry and I'd brought the wrong recharging kit, so what was I to do!

    I saw a lot of people at SGF writing notes on tablets and phones, so KALQ has a large target market. I'm going to try it on my Nexus 7. I'll let you know if it's a success.

    Do you remember when CPU time was a high-priced commodity? "Today, if you are any good at what you do, the constrained resource is you," says Timothy Berryhill from Wells Fargo. Berryhill has years of experience with SAS on "many platforms and operating systems." He says there are several things you can do to save your time - and your company's money.

    According to Berryhill, there are two very important things you can do to make the best of your time and those who look at your code later: Make sure your code is clear and correct. "To me, the main thing is correct. If the answers are wrong, it doesn't matter how you got there," he says.

    Here are three of his tried-and-true tips:

    1. "I like to use the %LET, particularly at the top of my code where variables are going to change," says Berryhill. He use a series of the %LET at the top of the program to remind him of changes he needs to make.
    2. The NOBS option tells you how many observations you have in a dataset. According to Berryhill, the option is most useful when NOBS is 0. "If you try to do a PROC PRINT or a global dataset, you set that empty dataset, and Boom you're gone. (This only works for disk files, not views or tapes.)"
    3. Try your luck. When he isn't on a tight deadline, Berryhill says that he likes to experiment with code just to see if it  works. Recently, he found that the double question mark will suppress expected errors in the input function. "I was surprised to find out that it is also supported in a statement."

    Read 27 more of Berryhill's SAS tricks in his paper, "30 in 20 things you may not know about SAS."

    tags: papers & presentations, performance, SAS Global Forum, SAS Programmers

    May 09, 2013

    San Francisco! The Bay Area and Silicon Valley: innovation, the brightest minds, creating, inspiring new technology and style. This city is known for its healthy food choices.  For example, I spotted a homeless man on my way from the Westin St. Francis to the Moscone Conference Center with this never- [...]

    Developing a new concept for a book can be daunting, but it can also be very rewarding. Think about The Little SAS Book. I bet Susan Slaughter and Lora Delwiche didn’t know they were stumbling on a huge hit when they created that format. I love to see authors create [...]

    Back in November last year I mentioned Metadata-Bound Libraries. This v9.3 M2 (and above) functionality allows you to force access to your data through metadata libraries, thereby enforcing your metadata security plans.

    One of the nuggets of information I learned at SAS Global Forum 2013 was that v9.4 will introduce menus in SAS Management Console to ease the effort of building PROC AUTHLIB code. Plus, the process of unbinding data sets from the metadata libraries will be made easier and simpler. Currently, one has to copy the data sets to an unbound library; v9.4 will allow unbinding to be performed in-place.

    In a future release, administrators will optionally be able to make encryption compulsory for all data sets and libraries; and support for AES encryption will be provided. Finally, the metadata server will be able to store the encryption key and send it (encrypted) when required. This will remove the current need to hard-code keys into batch code (and thereby remove the security weakness).

    No sooner has SAS Global Forum 2013 finished than we get to see the 2014 web site. Next year's conference is in Washington, D.C. between March 23rd and 26th.

    I hear there are some changes afoot in the organisation of the conference.  Along with the absence of a Closing Session at this year's conference, there was no announcement of section chairs for the streams of papers in next year's conference. The web site offers no further information on section chairs, but it does tell us that the Call For Content opens in July. This appears to be different to previous years' Call For Papers, and it's much earlier in the year too. All-in-all, I'm intrigued to see what the plan is.

    I clearly need to get my skates on and do more than just think about next year's papers over the next few months.

    May 08, 2013

    In today's fast-paced, jam-packed work day, many people answer email and read reports after business hours. And more and more, they're doing those things on a smartphone or tablet. How are your users accessing and using your reports? Statistics South Africa has found that their end-users would prefer a mobile environment for accessing reports, so it has moved to a mobile BI platform.

    Koketso Moeng says that Statistics South Africa has allowed its users to access internal sites from their smartphones for some time, but reading text heavy reports and tables on an iPhone can be frustrating. And even in the areas on the website where they were using visualization, the information still wasn't as accessible as they would have liked.

    Choosing a platform

    Moeng says the team had to carefully consider the level of security that would be offered. "We have a lot of very sensitive information that should not be released before it's time," he said.

    Additionally, Statistics South Africa had to ensure that the platform integrated with its existing infrastructure.  "We have invested a lot of money in our SAS environment and our operating environment," Moeng said. "So whatever we bring into that environment must integrate - you don't want to have to reshuffle things just to bring an additional tool into your environment."

    Finally, Moeng's team wanted to make sure their users were happy with the platform so they would use it. They conducted focus groups to test the mobile BI on several platforms. "We said, 'Play around with these tools and give us feedback. Which ones do you like? Why do you like them? and Which one is most appealing to you?'"

    Pinch, swipe, tap

    They decided to go with Roambi because of the easy interface and many of the end-users were already using the iPad and iPhone. Moeng says there are "two flavors" of Roambi: Roambi Analytics produces beautiful graphics and with Roambi Flow, you can embed those graphics. This gives the end-user an interactive document that gives context to the data visualization.

    "Configuration is very easy," says Moeng. In a SAS Enterprise Business Intelligence environment, it takes only a matter of minutes to configure. (Roambi runs off TomCat and MySQL.) Roambi uses the URL from your SAS Web Report Studio environment to interact with SAS.

    As you can imagine, Roambi is touch-enabled - pinch, swipe and tap to access and manipulate the reports. Moeng did his entire presentation using Roambi Flow on an iPad.

    In Moeng's paper, "Extending SAS Reports to your iPhone," you can read more about publishing reports and giving access to your end-users. Also check out what SAS Visual Analytics can do on the mobile.

    tags: papers & presentations, SAS Administrators, SAS Global Forum

    One of the most notable features of v9.4 wasn't mentioned in the SAS Global Forum Technology Connection but I caught a paper by Bryan Wolfe on the subject. SAS v9.4 will remove SAS's most notable "single point of failure" - the metadata server. SAS architects and administrators will optionally be able to specify and create a cluster of metadata servers (with real-time shared data) to mitigate metadata server failure.

    For those with SAS systems providing high value operational services, this enhancement could be a key deciding factor in choosing to upgrade to v9.4. Sites with less demanding applications can choose to retain a single metadata server.

    Whilst SAS has hitherto offered a large degree of resilience for failure of most processes and servers (particularly with the use of Grid and EGO), the metadata server has always been a weak link. V9.4 resolves this shortcoming by introducing the ability to cluster a group of metadata servers, all of whom are running 24x7, communicating with each other, and able to take-over the work of a failed metadata server.

    The coordinated cluster of metadata servers appears as a normal metadata server to SAS users. Hence, no code changes will be required if your site implements this technology. The chosen approach is intrinsically scalable.

    The cluster requires three or more nodes; each is a full metadata server. One is nominally a master, the others are slaves. The system decides who is the master at any point in time. Each metadata server must have access to a shared backup disk area.

    Client connections go to slaves. Load balancing causes redirects when required. The load balancing means that read performance is the same or better when compared with v9.3 performance. To keep all metadata server instances synchronised, slaves pass write requests to the master, and the master then passes those requests asynchronously to all other slaves so that they can update their own copy of the metadata storage (in-memory and on disk).

    SAS clients (such as Enterprise Guide and Data Integration Studio) keep a list of all nodes. Each client is responsible for reconnection. This is transparent to users. Hence, in the event of a slave failure, the client will automatically establish communication with an alternate server. If the master fails, the remaining slaves need to negitiate with each other to "elect" a new master. As a result, there can be a more noticeable delay, although it's unlikely to exceed 10 seconds.

    The new functionality will be supported in v9.4 on all SAS platforms except IBM Z/OS. All metadata servers must be on the same OS. The cluster license is included in SAS Integration Technologies. Unlike some of SAS's other high availability and failover solutions, no additional 3rd party software is required.

    All-in-all, this is a very significant enhancement for those who rely on their SAS systems to reliably deliver information, knowledge and decisions.

    At a recent conference, I talked with a SAS customer who told me that he was using an R package to create a three-panel visualization of a distribution. Unfortunately, he couldn't remember the name of the package, and he has not returned my e-mails, so the purpose of today's article is to discuss some ideas related to this visualization and to solicit critiques of my implementation.

    The customer wanted to create a paneled display in SAS that includes three graphs: a histogram, a box plot, and a normal quantile-quantile (Q-Q) plot. We sketched out an idea, and the plot at the left is my implementation of our sketch. (Click to enlarge.)

    One question I asked was, "Do you want the usual Q-Q plot, or should we flip it?" The usual Q-Q plot is a scatter plot of the ordered data values (on the vertical axis) plotted against the corresponding quantiles of a normal distribution (on the horizontal axis). The purpose of a Q-Q plot is to see whether the points fall along a straight line, which would indicate that the data are normally distributed. I remarked that if we flip the plot so that the data values are displayed horizontally, then the histogram, boxplot, and Q-Q plot can all share a common horizontal axis. The customer said that this seemed like a good idea.

    In SAS, you can create this kind of paneled layout by using the Graph Template Language (GTL) and the SGRENDER procedure.

    A GTL template for a three-panel display

    When I returned from the conference, I created a GTL template that defines a three-panel display. The top panel, which occupies 50% of the height of the display, is a histogram of the data overlaid with a normal curve and a kernel density estimate. The second panel, which occupies 10% of the height, is a horizontal box plot. The third panel is a normal Q-Q plot, but is flipped so that the normal quantiles are plotted on the vertical axis. A diagonal reference line is added to the Q-Q plot. Normally distributed data should fall near the reference line.

    The threepanel template takes five dynamic variables. The data and the normal quantiles are referenced by the dynamic variables _X and _QUANTILE, respectively. The title is supplied by using the _Title variable. Lastly, the parameter estimates for the normal curve that best fits the data are supplied by using the _mu and _sigma dynamic variables. The template definition follows:

    /* define 'threepanel' template that displays a histogram, box plot, and Q-Q plot */
    proc template;
    define statgraph threepanel;
    dynamic _X _QUANTILE _Title _mu _sigma;
    begingraph;
       entrytitle halign=center _Title;
       layout lattice / rowdatarange=data columndatarange=union 
          columns=1 rowgutter=5 rowweights=(0.4 0.10 0.5);
          layout overlay;
             histogram   _X / name='histogram' binaxis=false;
             densityplot _X / name='Normal' normal();
             densityplot _X / name='Kernel' kernel() lineattrs=GraphData2(thickness=2 );
             discretelegend 'Normal' 'Kernel' / border=true halign=right valign=top location=inside across=1;
          endlayout;
          layout overlay;
             boxplot y=_X / boxwidth=0.8 orient=horizontal;
          endlayout;
          layout overlay;
             scatterplot x=_X y=_QUANTILE;
             lineparm x=_mu y=0.0 slope=eval(1./_sigma) / extend=true clip=true;
          endlayout;
          columnaxes;
             columnaxis;
          endcolumnaxes;
       endlayout;
    endgraph;
    end;
    run;

    You can download the %ThreePanel macro, which creates a three-panel display for any variable in any data set. If you want to learn more about how to write GTL templates, I recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

    The macro calls PROC UNIVARIATE to compute the normal parameter estimates and the quantiles. That information is then used by PROC SGRENDER to create the plot according to the specifications in the threepanel template. The macro uses a cool trick: I get the data for the Q-Q plot by using an ODS OUTPUT statement on a graph that is created by PROC UNIVARIATE.

    The image at the top of this post shows how the template renders the MPG_City variable in the Sashelp.Cars data set. The image was created as follows:

    ods graphics on;
    %ThreePanel(Sashelp.Cars, MPG_City)

    The MPG_City variable is not normally distributed, as is evident by looking at the poor fit of the data in the Q-Q plot (lower panel). In contrast, the distribution of the SepalLength variable in the Sashelp.Iris data set appears to be more normal, as shown below:

    %ThreePanel(Sashelp.Iris, SepalLength)

    Discussion

    What do you think? Try it out on your own data and let me know if you have suggestions to improve it. Is this a useful display? Leave a comment.

    tags: Data Analysis, Statistical Graphics

    May 07, 2013

    Clinical information tends to be more complex, comes from multiple sources in different formats. As a result, clinical data submission has become time-consuming, costly and error-prone. CDISC® (Clinical Data Interchange Standards Consortium) established new data standards to speed up data-review and improve clinical data exchange, storage and archival. Our technology edge combined to our experience in standards implementation allows us to develop tailored CDISC solutions to accelerate your FDA review. Clinovo introduced a new opportunity to learn these recognized clinical data standards!

    Clinovo’s new “CDISC Standards: Theory and Application” class is an 8-week training program starting in June 11th, 2013. The TechTrainings are technical hands-on classes for entry-level or experienced clinical trial professionals designed to help them reach the next step in their professional career. The class will be held in Palo Alto at Dentons Offices or remotely.

    Taught by Sy Truong, President at Meta-Xceed and author of award-winning papers, this new course will give an overview of CDISC standards: ODM, SDTM, ADaM and Define.XML. Students will learn how to transform legacy data into these clinical standards through real-life examples. Case studies will include data exchange, archival, and electronic submission to regulatory agencies such as the FDA.

    Clinovo will continue to offer the “Base Clinical SAS Programming” class to help entry-level programmers prepare the Base SAS certification, as well as the “Advanced Clinical SAS Programming” class to tackle advanced real-world SAS programming challenges. Clinovo offers $50 gift cards for referrals.

    More information on the class can be found on clinovo.com/techtrainings.

    TechTrainings by Clinovo

    Olivier Roth launched the TechTrainings by Clinovo in 2012, a series of hands-on courses for clinical trial professionals, leveraging his company’s years of on-field experience and industry expertise. He is the Marketing & Communication Coordinator at Clinovo, a CRO based in Sunnyvale, focused on streamlining clinical trials for life science companies through technology solutions. Olivier helps managing Clinovo’s marketing and communication from marketing strategy to partnership management, lead generation, event planning and new business opportunities. Prior to Clinovo, Olivier was working as a Strategic Marketing Consultant at VivaSante, an international consumer healthcare company based in Paris.

    Even though it's been around for well over a decade, SAS Enterprise Guide was still a hot topic among attendees at SAS Global Forum this year.

    In the Technology Connection -- the big session on Monday morning -- SAS R&D staff used the conference agenda content to demonstrate the power of SAS Text Miner. By categorizing the papers for this year and comparing to previous years, you can see the continued growing interest in several key topics, including SAS Enterprise Guide.

    I captured this (grainy) screenshot from the Livestream archive of the Technology Connection. Enterprise Guide papers are represented by the yellow bar in this screen shot from SAS Visual Analytics, showing 4 years of conference data:

    I presented one of those papers: For All the Hats You Wear: SAS Enterprise Guide Has Got You Covered. In the presentation (which you can watch on SAS Global Forum Take-out), I describe several types of users who accomplish work in SAS Enterprise Guide, including:

    • The Newbie
    • The Business Analyst
    • The Programmer
    • The Statistician
    • The Data Scientist
    • The Administrator
    • The Consultant

    I've done my share of blogging and presenting about SAS Enterprise Guide over the years, but with over 80 papers or posters that addressed it at this year's conference, it's obvious that others are also keen to share their experiences. That's great, because I have an obvious bias when I describe SAS Enterprise Guide as an essential tool for SAS users; you no longer have to simply take my word for it.

    I've also seen others sharing on their own blogs. For example, here's a series from the bi-notes.com blog:

    And another series from OptimalBI in New Zealand:

    I'm happy to see that the SAS user community now creates, sustains, and propagates some really excellent information on these topics. Keep up the great work!

    tags: SAS Enterprise Guide, SAS GloFo, sasgf13

    Two questions I get asked occasionally are:

    • Do I get paid to write nice things about software?
    • Why don’t I write more about things I DON’T like?

    1. No. The only one who pays for my blog is BlogHer – which is the ads you see – and they don’t seem to care what I write as long as people read it. The checks from them pretty much cover my Chardonnay bill. Incidentally, they pay WAY better than Google AdSense.

    2. My original reason for writing this blog was to remind myself of stuff. Ever see that comic where the kid raises his hand and asks the teacher

    “May I be excused from class? My brain is full.”

    Well, my brain is like that a lot. At any given time I may go from javascript to jquery, impact.js, SAS, PHP, SQL to giving a lecture on logistic regression. I forget stuff. That terrific site, really cool application, a function that allowed me to do exactly what I wanted, so I write it down here to remember the next time I need to know something like that.

    Here is a fact: A lot of stuff sucks.

    There are over 1,000,000 books on Amazon. It’s a lot more productive to write about a book I read that was good than the 867,345 books on Amazon that are mediocre or worse. You’ll find something good much faster by looking for stuff that’s good than ruling out everything that sucks.

    Whether you’re talking about a mobile app, a new suite of software or added functions or formats, there will be a lot that are boring, useless or just plain suck. It just seems really inefficient to waste my time writing about them unless the rise to a truly notable level of suckiness – and most things don’t, they don’t even excel at sucking and will fade out of the general consciousness, sinking under the weight of their own mediocrity.

    I honestly don’t understand people who spend a lot of time writing about the stuff they DON’T like.

    May 06, 2013

    I just completed the conference survey. Overall I had a good conference - nobody booed during either of my papers, so that's a positive outcome! I liked the conference city (San Francisco), the conference venue was relatively compact (avoiding long walks between papers), I attended a number of well-presented papers, I learned useful stuff about current releases of SAS plus additional stuff about forthcoming SAS releases and how to plan for them.

    I tried to offer some constructive feedback in my survey. Here's what I wrote. What do you think? Did your experiences match (or diverge)?

    • I was very disappointed to see the loss of the Closing Session. It felt like the conference just petered-out. There was no opportunity to say "thank you" to organisers and volunteers, and no recognition for presenters (best paper??). A big shame

    • I was greatly disappointed with the conference wi-fi. Even stood in one place it still seemed to come-and-go, and required a fresh login every time the signal was lost. As an overseas attendee (with very expensive data rates because I was roaming) I relied on conference wi-fi to keep in touch with emails, etc. The wi-fi caused me huge frustration

    • I'd like a few more papers on general software development best practice, e.g. requirement, design and testing. It's great to learn about the technology and the language(s), but my clients' investments are wasted if they don't build stuff effectively and efficiently. So, I don't mean papers about syntax, I mean advice, recommendations and experience about the operation of a SAS development and support team

    • I liked the conference's Android app. Hopefully it'll be a bit more "finished" next year, e.g. maps of the venue included in the app rather than requiring an internet connection to access them (see my comments re: conference wi-fi above). The ability to search for papers by name-of-author would be appreciated. An ability to filter My Agenda would be much appreciated because it involved a lot of scrolling to see my plans for Wednesday (having to scroll past Sunday, Monday and Tuesday to see Wednesday)

    • I enjoyed the keynotes from Billy Beane and Roger Craig. Both were eloquent and amusing speakers, but both were talking about analytics and hence there was a strong tie-in with the conference.
    Organising a conference is a demanding task. Organising an international conference for 4,200 attendees must be a mammoth task. My list of "enhancement opportunities" is miniscule when viewed in the context of the conference's organisers achievements. But there's no harm in trying to make next year's even better!

    PROC UNIVARIATE has provided confidence intervals for standard percentiles (quartiles) for eons. However, in SAS 9.3M2 (featuring the 12.1 analytical procedures) you can use a new feature in PROC UNIVARIATE to compute confidence intervals for a specified list of percentiles.

    To be clear, percentiles and quantiles are essentially the same thing. For example, the median value of a set of data is the 0.5 quantile, which is also the 50th percentile. In general, the pth quantile is the (100 p)th percentile.

    The CIPCTLDF option on the PROC UNIVARIATE statement produces distribution-free confidence intervals for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles as shown in the following example:

    /* CI for standard percentiles: 1, 5, 10, 25, 50, 75, 90, 95, 99 */
    ods select Quantiles;
    proc univariate data=Sashelp.Cars cipctldf;
       var MPG_City;
    run;

    However, prior to the 12.1 releaase of the analytics procedures, there was not an easy way to obtain confidence intervals for arbitrary percentiles. (Recall that you can specify by nonstandard percentiles by using the PCTLPTS= option on the OUTPUT statement.)

    I am happy to report that the OUTPUT statement in the UNIVARIATE procedure now supports the CIPCTLDF= option, which you can use as follows:

    proc univariate data=sashelp.cars noprint;
       var MPG_City;
       output out=pctl pctlpts=2.5 20 80 97.5 pctlpre=p
              cipctldf=(lowerpre=LCL upperpre=UCL);    /* 12.1 options (SAS 9.3m2) */
    run;
     
    proc print noobs; run;

    The CIPCTLDF= option computes distribution-free confidence intervals for the percentiles that are specified on the PCTLPTS= option. The LOWERPRE= option specifies the prefix to use for lower confidence limits; the UPPERPRE= option specifies the prefix to use for upper confidence limits.

    If your data are normally distributed, you can use the CIPCTLNORMAL= option on the OUTPUT statement to compute confidence limits. However, if your data are not normally distributed, the CIPCTLNORMAL= option might produce inaccurate results. For example, on the MPG_City data, which is highly skewed, the confidence intervals for large percentiles (like the 99th percentile) do not contain the corresponding point estimate. For this reason, I prefer the distribution-free intervals for most analyses.

    tags: Data Analysis

    I highly recommend, The Dip,  a short book by Seth Godin that lauds the value of quitting. I wrote about this at greater length on my other blog on judo and life, under the topic, “Know when to hold ‘em and know when to fold ‘em.” where, being the horrible mean old woman that I am, I suggested that giving up trying to make the Olympic team, going back to school and getting a real job might be a better path for some people.

    Or in the words of not one, but two of my professors in graduate school, at two different institutions thousands of miles apart, the 19th thing I have learned in (almost) 55 years is

    “Never play with a stacked deck.”

    The deck might be stacked against you for a number of reasons. One of the professors who told me that was an African-American woman and at the end of the academic year, she left for another university. She was right that she would probably never get the job she wanted at that university. Her research wasn’t African-American studies – it was policy analysis, and she taught not multi-cultural something or other but statistics. She could have stuck around hoping to get tenure and make them see that she really was just as good, just as smart – or she could have gone to another university where they already knew that.

    The other professor was white, male and vice-president of a major corporation who had come to teach in the MBA program for a year because he felt like it and he was rich and important, so there. We were glad to have him. He was a great professor. He pointed out there are times that you are not going to get what you want, because, say, the company was a family business and the owner’s son was going to end up as president no matter how wonderful you are. It could also be that there is an entrenched group and they are not going to support you in your job no matter what you do. They’ve worked together for twenty years and you just came in here because the boss hired you over them. One of the students asked,

    “Isn’t that letting them win if you just give up and leave?”

    The professor answered,

    “Or, you could stay there for five years and fight them and maybe after five years, bring them around to recognize your contribution to the team and support you. In the meantime, you’ve wasted five years when you could have been working somewhere else where people got behind you and got the job done and been five years further ahead in your career. So tell me, what did you win?”

    Sometimes, it’s not people that have stacked the deck against you. Maybe you have had too many injuries to come back and compete. That may sound hypocritical since I won the world championships with a knee missing all the cartilage and 2/3 the ligaments. The fact is, I was lucky and if I had taken one more shot that took out that last ligament, I would have been done not just competing but probably walking.

    So, that brings me to my 20th thing,

    Know what you are willing to risk.

    In the case of competing, I was willing to risk never walking again without crutches. Thank God for the medical advances in knee replacements or I’d be on crutches now. Right now, I’m making half the money I could be making because I’m spending a lot of time on starting up 7 Generation Games and not taking any new consulting clients.

    This might sound hypocritical again, because isn’t doing a start-up something for only young people? As Vivek Wadhwa said, isn’t it true that the average venture capitalist portfolio consists solely of white and Asian males barely old enough to shave? So isn’t this playing with a stacked deck?

    Not at all. We may not get $10 million in venture capital but I’m okay with that (really). We have learned not to trade our lives for stuff. We’re pretty happy with life because we’ve learned not to want too much what we haven’t got.

    We’re willing to risk some of our own funds and half (or more) of our time for two or three years to make this game happen. Looking at the progress we’ve made so far, the people we have working with us and the work we are all doing, I am pretty optimistic, but it’s a risk. If it doesn’t succeed, I will be disappointed, we all will. Then, we’ll pick ourselves up and after some swearing and possibly a martini or two, we’ll go on to the next idea, because we have learned that failure is never permanent and neither is success.
    See how it all fits together – it’s like Legos.

    capybaraI know that’s actually a picture of a capybara and not Legos, but you see, I didn’t have a picture of Legos and I had this one of a capybara and I really do like capybaras.

    Which brings me to my twenty-first thing I have learned …

    You’ll be a lot happier in life if you don’t  take yourself too seriously.

    May 03, 2013

    Infographic: why choose open-source EDC?

     

    This infographic is a visual and appealing way to understand why open-source Electronic Data Capture (EDC) is an alternative of choice to proprietary systems or paper-based studies. Do you own a website, a forum or a blog on clinical trials or Electronic Data Capture (EDC) ? You are invited to share this infographic  with your readers! How? Simply copy this code to your page: <a href=”http://blog.clinovo.com/new-infographic-why-choose-open-source-edc/”><img alt=”Infographic: Why choose open-source EDC?” choose=”open-source=” src=”http://blog.clinovo.com/wp-content/uploads/2013/05/infographic-2-resized.png” style=”height: 1817px; width: 675px;” why=”" /></a>[Source: <a href="http://blog.clinovo.com" title="Infographic: Why choose open-source EDC?">eClinical Trends by Clinovo</a>]

     

    These activities may have been announced at SAS Global Forum, but they apply to a much broader audience!  And you can take advantage of them today.  New Program:  Publications is pleased to announce that you can now use Training Points to buy books.  Training Points (previously known as EPTO's) is [...]