Here is the last video from the last day at SAS Global Forum 2012. This one is great! Make sure you stay in there long enough to watch the outtakes: they are so funny!
Here is the last video from the last day at SAS Global Forum 2012. This one is great! Make sure you stay in there long enough to watch the outtakes: they are so funny!
This past January, I wrote, “As we wrapped up 2011 and began preparing for 2012," we were notified by the Society for Technical Communication, Carolina Chapter, that three SAS Press books received awards in the 2011-2012 competition. . . . Entries receiving a Distinguished or Excellence award from the local [...]
Annette Harris spends several minutes during this video extolling many of the high-performance virtues of Pete Lund, Information Systems Manager, Looking Glass Analytics. One thing she didn't mention (it was mentioned to me later) is that Pete is a long-time member of SAS-L. Do you know how many other SAS User Feedback Award Winners have also been SAS-Lers?
Here is a list of some of the SUGI, SAS Global Forum and SUG papers that Lund has published.
SAS stored processes are similar to SAS programs in that they use the same programming language. Many of my SAS programs I created early on were only used by me, so I could live with a little uncertainty and it was easy enough to check the logs for any issues. If anything went wrong then I knew what I had to change.
Stored processes required a whole new level of thinking – my first few stored processes back in the SAS 9.1.3 days were – well they left a lot to be desired. When I would roll out a new stored process I would often get a call from a user who had done something crazy and the stored process wouldn’t work. For instance, why would anyone put a state name in a customer name field to see if it would return all the customers from that state? Only one thing could happen – no report!
Here’s some usability tricks that I have learned with my SAS Stored Processes to make them more robust and harder to break. Really the out-of-the box prompts provide a lot of functionality that really helps. That’s right – let’s build a better mousetrap!
If your SAS stored process code will break unless it gets a value from the prompt, require the user to have a value. For example if your code is something like the following example, your stored process will fail if the user does not make a selection:
proc print data=OPS.orders;
where status = “&StatusPrompt”;
run;
One way around this situation is to require an answer. When you create the stored process, select the Requires a non-blank value check box on the General pane. With this selection, SAS will not let the user move forward without an answer. SAS adds an asterisk to the prompt and if the user selects Run without providing an answer, an error message is generated. Very nice built-in functionality. It will save you hours of coding.
Tip #1 can be annoying and you might want to make sure the prompt situation warrants it. As an alternative you can also set a default value for the prompt. Based on the prompt type, there are many ways to indicate the value. This does allow a user to just go with the defaults so the stored process does not generate an error message. The only downside I see is if the user doesn’t understand they can make changes – but most people are computer-savvy enough that they can understand it.
Hint: Set the default value when testing your stored process so you don’t have to fill out the values each time.
If you know the data starts at and ends in a certain time frame, then don’t allow the user to select past those values. Check the SAS Stored Process: How do I use a Date Range Prompt? for an example of setting minimum and maximum date values.
Here’s some examples to give you some other ideas. For numeric prompts, if you know the user cannot select a value greater than 100, then limit the value to 100. Likewise, if you are allowing the user to type a value, then set the minimum value to 5. It is more likely to be a word. You have to decide based on your data – but I hope this example just gives you some ideas of how to use the prompts built-in error-checking.
These built-in error checks from SAS will save you a lot of coding – so use them to make your stored process more robust and usable. But mainly to prevent users calling you to report what your stored process does not do.
You can learn more tips and tricks for creating, debugging, and using SAS stored processes in the 50 Keys to Learning SAS Stored Processes book. It's a complete guide to SAS stored processes! Download a sample chapter or view the table of contents.
I've been a fan of statistical simulation and other kinds of computer experimentation for many years. For me, simulation is a good way to understand how the world of statistics works, and to formulate and test conjectures. Last week, while investigating the efficiency of the power method for finding dominant eigenvalues, I generated symmetric matrices to use as test cases. Each element of the matrix was drawn randomly from U[0,1], the uniform distribution on the unit interval. As I checked my work for correctness, I noticed a few curious characteristics about the eigenvalues of these matrices:
I noticed that the same pattern held for all of my examples, so I turned to simulation to try to understand what was going on.
I had previously explored the distribution of eigenvalues for random orthogonal matrices and shown that they have complex eigenvalues that are (asymptotically) distributed uniformly on the unit circle in the complex plane. I suspected that a similar mathematical truth was behind the pattern I was seeing for random symmetric matrices.
The key statistical observation is that if the elements of a matrix are random variables, then the eigenvalues (which are the roots of a polynomial of those elements) are also random variables. As such, they have an expected value and a distribution. My conjecture was that the expected value of the largest eigenvalue was n/2 and that the expected values of the smaller eigenvalues are clustered near zero.
In a series of simulations, I generated a large number of random symmetric nxn matrices with entries drawn from U[0,1]. For each matrix I computed the n eigenvalues and stored them in a matrix with n columns. Each row of this matrix contains the eigenvalues for one random symmetric matrix. The ith column contains simulated values for the position of the ith eigenvalue. (Recall that the EIGEN routine in SAS/IML software returns the eigenvalues in (descending) sorted order.) The following graph shows the distribution of eigenvalues for 5,000 random symmetric 10x10 matrices:
Each color represents an eigenvalue, and the histogram of that color shows the distribution of the eigenvalue for 5,000 random 10x10 matrices. Notice that the largest eigenvalue is always close to 5. The largest eigenvalue has a distribution that looks like it might be approximately normally distributed. The distributions for the smaller eigenvalues overlap. A typical 10x10 matrix has 4 smaller positive eigenvalues, 4 smaller negative eigenvalues, and one eigenvalue whose expected value seems to be close to zero.
You can convince yourself that this result is reasonable by considering the constant matrix, C, for which every element is identically 0.5. (I got this idea from a paper that I will discuss later in this article.) This matrix C is singular with n-1 zero eigenvalues. Because the sum of the eigenvalues is the trace of the matrix, C has a single positive eigenvalue with value n/2. The random matrices that we are considering have an expected value of 0.5 in each element, so it makes sense that the eigenvalues will be close to the eigenvalues of C.
I did a few more simulations with matrices of different sizes. The patterns were similar. I suspected that there was an underlying mathematical theorem at work. As for orthogonal matrices, I conjectured that the theorem was probably an asymptotic result: as the size of the matrix gets large, the eigenvalues have certain statistical properties. To test my conjecture, I repeated the simulation for random 100x100 matrices. The following graph shows the distribution of the eigenvalues for 5,000 simulated matrices:
The scale of the plot is determined by the distribution of the eigenvalue near 50=100/2. The other histograms are so scrunched up near zero that you can barely see their colors.
To better understand the expected values of the eigenvalues, I computed the means of each distribution. These sample means estimate the expected values of the eigenvalues for n=100. For my simulated data, I found that the dominant eigenvalue is centered at 50.16. The confidence interval for that estimate does NOT include 50, so I would conjecture that the eigenvalue approaches n/2 from above.
If you plot a histogram of the non-dominant eigenvalues, you get the following graph:
When I saw the shape of that histogram, I was surprised. I expected to see a uniform distribution. However, my intuition was mistaken and instead I saw a shape that curves down sharply at both ends. After I generated eigenvalues for even LARGER matrices, I commented to a colleague that "it looks like the density is semi-circular." However, I had never encountered a semi-circular distribution before.
At the end of my previous article, I mentioned a few of these conjectures and asked if anyone knew of a theorem that describes the statistical properties of the eigenvalues of a random symmetric matrix. Remarkably, I had an answer within 24 hours. Professor Steve Strogatz from Cornell University, commented as follows:
Each entry for such a matrix has an expected value of mu= 1/2, and there's a theorem by Furedi and Komlos that implies the largest eigenvalue in this case will be asymptotic to n*mu. That's why you are getting n/2. And the distribution of eigenvalues (except for this largest eigenvalue) will follow the Wigner semicircle law.
The reference Strogatz cites is a 1981 article in Combinatorica titled "The eigenvalues of random symmetric matrices." The first sentence of the paper is "E. P. Wigner published in 1955 his famous semi-circle law for the distribution of eigenvalues of random symmetric matrices." I suppose "famous" is a relative term: I had never heard of the "Wigner semicircle distribution," but it is famous enough to have its own article in Wikipedia.
The paper goes on to formulate and prove theorems concerning the eigenvalues of random symmetric matrices. The theorems explain the phenomena that I noticed in my simulations, including that the dominant eigenvalue is approximately normally distributed and its expected value converges to n/2 from above. See the paper for further details.
If you would like to conduct your own simulations, this section includes the SAS/IML program used to generate the symmetric random matrices. Although I generated the elements from U[0,1], Wigner's result holds for any distribution of elements, as do the results of Furedi and Komlos. The following program writes the eigenvalues to a SAS data set with 5,000 rows and n variables named e1, e2, ..., en. The value of n is determined by the size macro variable.
%let size=10; /* controls the size of the random matrix */
proc iml;
/* find eigenvalues of symmetric matrices with A[i,j]~U(0,1) */
NumSim = 5000;
call randseed(1);
n = &size;
r = j(n*(n+1)/2, 1);/* allocate array for symmetric elements */
results = j(NumSim, n);
do i = 1 to NumSim;
call randgen(r, "uniform"); /* fill r with U(0,1) */
A = sqrvech(r); /* A is symmetric */
eval = eigval(A); /* find eigenvalues */
results[i,] = eval`; /* save as ith row */
end;
labl = "e1":("e"+strip(char(n))); /* var names "e1", "e2", ... */
create Eigen from results[c=labl];
append from results;
close Eigen;
quit;The following SAS program is used to analyze the eigenvalues, including making the graphs shown in this article.
/* make a histogram statement for e1-e&n */ %macro eigenhist(n); %do i = 1 %to &n; histogram e&i / transparency = 0.7; %end; %mend; /* overlay distributions of e1-e&size */ proc sgplot data=eigen noautolegend; %eigenhist(&size); yaxis grid; xaxis grid label="Eigenvalue"; run; /* dist of expected values */ /* compute mean of each variable (or use PROC SQL or PROC MEANS...) */ proc iml; use eigen; read all var ("e1":"e&size") into X; close eigen; mean = T(mean(X)); rank = T(do(&size,1,-1)); /* 100, 99, ..., 1 */ create EigMeans var {"rank" "Mean"}; append; close EigMeans; quit; proc univariate data=EigMeans; where rank<&size; var Mean; histogram Mean; run;
There are some things to like about Statistica. The scatter plot matrix, for one. I’d done a sentiment analysis of a data set on blog posts (not mine). For each post, I had three variables
I thought people who comment a lot would be the ones who had the most negative comments, where there would not be as much of a correlation between positive comments and frequency.
I like the graphic output you get, which shows a frequency distribution for each variable and a plot for each pair. All at once you can get a sense of the strength of the correlation, whether it might be affected by restriction of range – as shown by a skewed distribution – or by outliers.
There seems to be an actual correlation between the number of positive comments and the number of negative comments. Also, positive comments outnumber negative comments almost three to one.
One might be tempted at this point to run out and say,
“Oh, look! Sentiment is very positive!”
Also, it appears that people who have more negative comments also have more positive comments, this means that ….
Just stop right there.
Before saying this means anything, you should go back and take a look at the comments being categorized as positive or negative. The first thing you will note is that computers are very poor at detecting sarcasm, subject changes and idioms. The data came from comments on blogs related to Apple computer products. Here are just a few of the cases where I disagreed with the computer.
I’m not saying that Statistica is bad - I don’t think it is – or that text mining is useless – I don’t think that, either.
What I DO think is that text mining has to be an iterative process. First, you get your results and then you examine them, make some changes – in this case I would start with the synonyms data – and you re-run your analysis.
Off to bed. I have to be up in six hours and head to the Black Belt Magazine studios for a photo shoot on our new book that is coming out this fall, Winning on the Ground: Championship matwork for judo, grappling and mixed martial arts.
It’s a bit of a leap from text mining, but, variety IS the spice of life.
SAS users, by definition, do not embrace the mysterious.
That's one of the main reasons that they use SAS: to demystify some data or process. And so, when you (as a SAS user) have gone to the trouble of designing a process flow in SAS Enterprise Guide, you like to be aware of some basic metrics, such as "how long will it take to run?"
It's difficult to predict how long a SAS process will take to run, as it depends more upon the data than on the actual program instructions. But one thing that we're very good at is telling you how long it took to run the last time that you ran it. In SAS Enterprise Guide, you can find this information at the task level by right-clicking on the task (or program node) and selecting Properties. On the General tab, you'll see the "Last execution time".
If you want to gather this information at a process flow or project level, you can repeat these steps for each item in the flow, make a note of the "execution time", then add up the numbers (expressed in hours, minutes, and seconds) to create a grand total. This tedious assignment makes for a perfect torture device for a summer intern who, in this economy, should be grateful to have a job at all.
Or, to make the job less tedious, you could use the Project Reviewer task. This is a custom task (available for download here) that shows a summary view of your process flows and allows you to create a report from the information. The task works with SAS Enterprise Guide 4.3 and 5.1.
Features of the Project Reviewer task include:
1. A selection list with all of the process flows within your project.
2. A list of each "runnable" task; that is, program, task, query, export step, etc. Each task has an "ordinal" (its sequence in the process flow), a name, a descriptive type, the user ID of the last person to modify it, the running time for its most recent iteration, the date/time modified, the date/time created, and whether the task generated errors ("red X" in the flow).
You can sort the items in the list by clicking on the column header for the value you want to sort. Click on the column header again to reverse the sort sequence.
3. A summary of the task count, and total clock time that "running" the flow represents.
4. A Create Report button, which generates a SAS program to produce a simple report of all of the project contents, summarized by each process flow. If you have multiple SAS environments, you can use the Report server list to select which SAS server to use when processing the report.
Here's a sample of the report output:

The reporting process also generates a SAS program and a data set (which are added to your project), so that you can easily adapt these for custom reports.
Let me know if you find this task to be useful and whether you have any improvements to suggest.
Some final notes/links:
~ Contributed by Lelia McConnel, Technical Support Consultant, SAS ~ Base SAS 9.3 has made creating high quality graphics output easier than ever. Did you know that you can create great looking, high resolution graphs with Base SAS? The Customer Support Website provides sample code to help you create graphs [...]
SAS Publications participates in a number of conferences, from SAS events to solution and industry-focused conferences. We know our customers are looking to make the most of their conference experience, and we want to make your visit to the Exhibit Hall as helpful as possible. We want to know your [...]
As a SAS fan, in a number of roles over the past 15+ years (programmer, statistician, data warehouse developer, business analyst, consultant, trainer, partner), I am also the Queensland Users Exploring SAS Technology (QUEST) chairperson, and I feel very privileged to be able to contribute to the wider SAS community in this blog post.
QUESTors meet four times each year at a venue in central Brisbane, Australia. At the next QUEST meeting, held May 31, I will be presenting Learning SAS – where to start?, what is available?, how can I network?, who should I know?
Having recently attended the SAS Global Forum in Orlando, Florida I was inspired by Joe Theismann’s Keynote Presentation, Game Plan for Success where he spoke about his life experiences and his steadfast attitude towards life. What interested me was his approach to life in adapting to change. He spoke about his injuries, and the paths he has taken, with passion and enthusiasm.
Passion and enthusiasm is something I have always enjoyed in my journey with SAS, and I get a buzz out of sharing with other SAS users in the community. Whether you’re a seasoned SAS user with experiences to share, or a newcomer who’d like to interact with other users, a SAS user group and the many internet-based networking mediums are an excellent way to learn, network, share and collaborate with your local SAS community and beyond. And, as in the words of Joe … “People don’t care about how much you know, until they know how much you care”
My presentation is a guide for programmers and non-programmers (analysts, business users, administrators, management) about learning SAS and connecting with SAS users.
First stop is to gain an understanding of the SAS 9 architecture, the components and how they fit together. This can be easily achieved by reading the SAS 9.3 Intelligence Platform: Overview document. This document is available on the support.sas.com website, which contains a plethora of information relating to documentation, training, support and access to the SAS community.
Get to know your local and global resources. Reach out to the people in your local SAS office, SAS partners in your area and other SAS users. Or use websites such as support.sas.com, communities.sas.com, blogs.sas.com, sascommunity.org, LinkedIn.com groups, and sasprofessionals.net for resources and experts.
Going to your SAS local user group is a good first step. Of course there’s social media too, which is also a great way for introverts to interact. These include Twitter, Facebook, LinkedIn, YouTube, Blogs and discussion forums. Pick the platform that you use most often and/or subscribe to RSS feeds and keep up-to-date using a RSS reader such as Google Reader.
You could start off by getting to know the people in your local community. Perhaps aim to meet one new person at each local SAS user group event and maybe connect with them on LinkedIn. Be involved in the discussion forums in your area of interest and leave comments on blog posts. It comes down to your interests and what you would like to learn. There are many ways to connect, participate and collaborate – taking the first step can be the hardest. As Chris Hemedinger said in the closing session of SAS Global Forum 2012 “Social media has spoiled us … information comes to us, we don’t have to find it.”
I hope you too can see that there are many opportunities available in your QUEST to learn SAS – you just have to use them….
Register for QUEST by Thursday May 24 by emailing us at quest@oz.sas.com.
OR register online at www.sas.com/australia/usergroups/quest.
SESUG (Southeast SAS Users Group Conference) is an annual conference held in the Southeast US - typically in September or October. This year, the conference will be held at the Sheraton Imperial Hotel and Convention Center in Durham, North Carolina, October 14-16.
According to Peter Eberhardt, SESUG 2012 Academic Chair, the SESUG 2011 Junior Professional Program (new in 2011) was very successful. "I am proud to say, that for 2012 we are continuing the program. Check out our website for more details."
The grant covers conference registration and one four-hour workshop. Workshops will be offered on Sunday at 11:30 am. Travel expenses and food are not covered. Junior Professionals presenting a paper will get priority consideration.
To qualify, applicants must have been using SAS in their job for 36 months or less and would otherwise not be funded by their company to attend the conference.
"The important date to remember is June 4, when applications will be available," says Eberhardt. "But, you don't have to wait until then to get more information. Check out our website and contact Barbara or Deb with any questions you might have."
SAS Global Forum 2012 was a success! After a whirlwind week of activities followed by a vacation and week of rest – I’m ready to give you some highlights. It was a lot of fun! Tip: Click on any picture to enlarge it.
Day 1 – Saturday Ready for the Tweet-UpThe biggest drama was at the airport – our flight was delayed due to mechanical failure so I decided it might be better to take a later flight. Met @Steve0verton at the airport and @PhilipB who were both headed to Orlando. As a result of the later flight we were late to the Tweet-Up so we missed the first round of drinks. It was sunset when we landed and the weather was mild – very nice for Florida. We had a lot of fun. @WaynetteTubbs hosted the even and she had a trivia contest. I won a LED key chain and some SAS Post-It notes. I love Post-It notes. Plus I got to meet Anna Maria who writes the sassy Julia Group blog – she’s such a sweetie. And Andrea Wainwright Zimmerman told me that she was the academic chair for SESUG in St. Pete, FL next year. I may have to volunteer – sounds like a conference I don’t want to miss. Plus if I’m not mistaken she also won one of the best contributed paper awards. [Check out: Quick and Dirty Excel® Workbooks Without DDE or ODS - a little birdie told me it was one of the most popular papers in the session!]
|
Day 2 – Sunday Opening SessionDefinitely had too much fun on Saturday night – just some helpful advise – never forget to bring aspirin! It didn’t matter I was still ready to get registered. @AngelaHall1, @Steve0verton, and I had a good time hanging at breakfast and walking around the beautiful Disney property. Angela and I were discussing a new book about Dashboards. We would like to include videos this time around – what do you think? @GordonCox and Greg Nelson taught an 8 AM class about SAS BI System Administration, which I only heard good things about. @saspublishing might have new book authors![Hey - if you have a book idea tweet @SSessoms about it!] After a great lunch with Andrew Karp, who runs the Sierra Information Systems site (he has some free SAS presentation offers after you sign up), it was back to the beautiful resort to wait for the Demo Hall to open! At 4pm we were finally able to enter the Demo Hall so I could find the winner of the 50 Keys to Learning SAS Stored Processes book – which was Michelle Homes. She was working the Metacoda booth. So she gave me a demonstration of the Metacode Security Plug-in – it is AWESOME! It makes managing your SAS BI users a snap. Look for a later post about this SAS Management Console Plug-In, which her wonderful husband Paul Homes coded all by himself. At the evening party I finally got to meet Chris Swenson and talk about blogging for an hour or so. Also had not seen Ben Zenick in a month of Sundays – so it was awesome catching up with him. [SAS BI developers looking for a job ... check out Zencos.] The opening ceremony was something else. It was huge room so you can get an idea of how large the screens were. The graphics were beautiful – the follow :30 second clip gives you an idea of how amazing the entire presentation was. I really enjoyed watching Dr. G get behind the computer to drive the new High Performance Analytics software – turn an 18 hour job into 14 minute one. Wow – that’s intense. Here’s the Livestream of the Opening Ceremony.
|
Day 3 – Conference Begins!Attended @Steve0verton award-winning presentation, “Lost in Wonderland? Methodology for a Guided Drill-Through Analysis Out of the Rabbit Hole?“. After his presentation he was mobbed by folks wanting to ask questions about BI. I caught up with Greg Nelson later in the day as he was being interviewed by AllAnalytics.com for their man on the street at SAS Global Forum. In the evening there was a huge party in the demo hall, lots of people milling about checking out all the new toys from SAS and other vendors. Oooh … got to see Don Henderson‘s super cool new book SAS Server Pages: Generating Dynamic Content - it’s an online book with videos. It’s such a super cool idea!! The highlight of the day was attending the Authors’ Dinner. I sat with Chris H (a real SAS Dummy) and Julie Pratt (my favorite editor at SAS Press). It was a great combination of fun, food, and friends. I laughed so hard my stomach hurt.
|
Day 4 – Presentation DayTuesday was our big presentation day for “Get Your Fast Pass to Building Business Intelligence with SAS and Google Analytics“. It was a great turnout – over 125 people and standing room only. SAS Press gave away a free copy of the book, which a lovely lady from Western Kentucky won. Special thanks to Nancy Brucken for section chair for hiding the post-it note under a chair at the last minute!! You rock!!! We were so pleased that everyone enjoyed our presentation and had so many nice things to say about it. [We love praise! And hey ... we wrote a book!!!!] Charlie Huang attended the presentation and we chatted more about using Google Analytics. Angela told me that she ran into a group of users who had 5 copies of the Building Business Intelligence Book at their office … Wow! Sounds like that team is going to be SAS BI Driven! Later I attended Kirk Paul Lafler’s talk about Top Ten SAS Performance Tuning Techniques. Picked up some thoughts about how the I/O may be causing more issues than I had considered before. He also gave me a SAS Nerd ribbon to wear – first I had to proclaim my undying loyalty to SAS. Easy! Afterwards, Angela and Brent Whitesel talking about changing your metadata – very enlightening. I volunteered to help with a few afternoon sessions – it was fun I encourage you to do it also! In the evening I had dinner with a large group at Shula’s. Beth Schultz the AllAnalytics.Com editor joined us. Judging from her subsequent video blog, I think she caught the SAS bug and she’s not even a user! [Oops!]
|
Day 5 – Over so Quick?Last day of the conference is sad but I was so tired I could hardly hold my head up. Guess I need more than 4 -5 hours of sleep a nite. While waiting to get our makeup put on, Andrew T. Kuligowksi dropped by to wish us good luck. We appreciated his thoughtfulness considering how busy he was! Eric, Angela, and I presented a SAS Talk about business intelligence that really put us on the spot! It was fun but a little scary. Roxie put a lot of makeup on us! [Chris talks about the makeup of SAS Global Forum.] After we got all gussied-up, we were ready to talk to the users. The worse part is that it was hard to hear the audience questions. So I know for a few questions we just heard a keyword and just started talking about that subject. So if you think we gave you a goofy answer – we didn’t hear you. Thanks to all the audience members who asked questions and encouraged us – we could not have done it without you.
And just like that … it was time for the closing session. Chris closed the conference with some High Performance blah, blah, blah. I didn’t get to see everything I wanted or talk to everyone I wanted … guess I’ll see you in San Francisco! Here were my tips for surviving the conference.
Really needed a vacation after all of that excitement. We ran over to Tampa for a few days – sunsets over the bay and fancy Aussie wine! I miss all the SAS nerds — group hug!!!! xoxoxox
|
In example 9.30 we explored the effects of adjusting for multiple testing using the Bonferroni and Benjamini-Hochberg (or false discovery rate, FDR) procedures. At the time we claimed that it would probably be inappropriate to extract the adjusted p-values from the FDR method from their context. In this entry we attempt to explain our misgivings about this practice.
The FDR procedure is described in Benjamini and Hochberg (JRSSB, 1995) as a "step-down" procedure. Put simply, the procedure has the following steps:
0. Choose the familywise alpha
1. Rank order the unadjusted p-values
2. Beginning with the Mth of the ordered p-values p(m),
2a. if p(m) alpha*(m/M), then reject all tests 1 ... m,
2b. if not, m = m-1
3. Repeat steps 2a and 2b until the condition is met
or p(1) > alpha/M
1. Rank order the unadjusted p-values
2. For ordered p-values p(m) M to 1,
2a. candidate ap(m) = p(m) *(M/m)
2b. if candidate ap(m) > ap(m+1) then ap(m) = ap(m+1)
2c. else ap(m) = candidate ap(m)
data fdr;
array pvals [10] pval1 - pval10
(.001 .001 .001 .001 .001 .03 .035 .04 .05 .05);
array cfdrpvals [10] cfdr1 - cfdr10;
array fdrpvals [10] fdr1 - fdr10;
fdrpvals[10] = pvals[10];
do i = 9 to 1 by -1;
cfdrpvals[i] = pvals[i] * 10/i;
if cfdrpvals[i] > fdrpvals[i+1] then fdrpvals[i] = fdrpvals[i+1];
else fdrpvals[i] = cfdrpvals[i];
end;
run;
data compare;
set fdr (in = cfdr rename=(cfdr1=c1 cfdr2=c2 cfdr3=c3 cfdr4=c4
cfdr5=c5 cfdr6=c6 cfdr7=c7 cfdr8=c8 cfdr9=c9))
fdr (in = fdr rename=(fdr1=c1 fdr2=c2 fdr3=c3 fdr4=c4 fdr5=c5
fdr6=c6 fdr7=c7 fdr8=c8 fdr9=c9));
if cfdr then adjustment = "Candidate fdr";
if fdr then adjustment = "Final fdr";
run;
proc print data = compare; var adjustment c1-c9; run;
adjustment c1 c2 c3 c4 c5 c6 c7 c8 c9
Candidate fdr 0.010 .005 .0033 .0025 .002 .05 .05 .05 .055
Final fdr 0.002 .002 .0020 .0020 .002 .05 .05 .05 .050
fakeps = c(rep(.2, 5), 6, 7, 8, 10, 10)/200
cfdr = fakeps * 10/(1:10)
rbind(cfdr, fdr=p.adjust(fakeps, "fdr"))[,1:9]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
cfdr 0.010 0.005 0.0033 0.0025 0.002 0.05 0.05 0.05 0.0556 0.05
fdr 0.002 0.002 0.0020 0.0020 0.002 0.05 0.05 0.05 0.0500 0.05
One of the first skills that a beginning SAS/IML programmer learns is how to read data from a SAS data set into SAS/IML vectors. (Alternatively, you can read data into a matrix). The beginner is sometimes confused about the syntax of the READ statement: do you specify the names of the variable in the data set, or the names of the SAS/IML vectors that you are trying to create?
The answer is "yes." J By default, SAS/IML creates vectors that have the same name as the data set variables that you specify on the READ statement. For example, if you want to read variables from the Sashelp.Class data set, use the VAR clause on the READ statement to specify the variable names, like so:
proc iml;
use Sashelp.Class;
read all var {sex height weight};This code snippet creates three vectors (Sex, Height, and Weight) that each contain 19 rows, which is the number of observations in the data set. However, the question that I've been asked is "What exactly is that thingy between the curly braces?" Is it a list of vector names? Is it something else?
The answer is "It is a character vector that specifies the names of variables in the data set." The confusion arises because the "thingy between curly braces" doesn't have any quotation marks. However, in the SAS/IML language, characters inside of curly braces are transformed to upper-case strings. In other words, the following two statements are equivalent:
c = {sex height weight}; /* converted to upper case strings */
c = {"SEX" "HEIGHT" "WEIGHT"};The first statement does not contain quotation marks, but the parser recognizes that it is an array of character values. Therefore, each character string is converted to upper case before it is stored in the vector c.
Because SAS variables are not case-sensitive, SAS software doesn't care how you specify the names of data set variables. Upper case, lower case, mixed case,...it's all the same to SAS. Consequently, the original SAS/IML statements are equivalent to the following statements:
read all var {"SEX" "Height" "WeIgT"}; /* names are not case sensitive */In either case, the READ statement creates three vectors that have the same names as specified on the VAR clause.
Since all specifications read the same variables, you might wonder what label appears when the PRINT statement is used to display a vector. Upper case? Mixed case? The answer is that the PRINT statement uses the same case as the name of the variable in the data set. For this example, the variables in the Sashelp.Class data are mixed case with a leading upper-case letter, as shown in the output of the following PRINT statement:
print sex height weight; /* print in same case as in data set */
For more details on how to read data into SAS/IML variables, see my article "Reading SAS data sets."
The advantage of learning a new language is it sometimes makes you re-think the old languages you know. For example, here is a problem that happens often:
Some people are morons.
For example, say I were to ask you the following question:
“How old are you?”
YOU would probably answer something like, 42 or 21. You didn’t mistake that for an essay question, now, did you? That, my dear reader, is because YOU are not a moron. However, trust me when I tell you that other people are not as smart as you.
A rather annoying percentage of people enter responses along the lines of:
I am 47 years old.
I just turned 21. Happy Birthday to me.
36 years
87 (yes, eighty-seven)
54 yrs.
and so on ….
Just using the sub-string function to read in the first two characters won’t work, obviously.
Well, I was doing something in javascript where I asked the person their age and then stripped off everything but numbers before I tried to use the age they had given me, like so:
var age = prompt(“How old are you?”) ;
var ageyears = age.replace(/[\D]/g, ”);
Usually, in my SAS programs, I would either just define age as a numeric variable and all of those who included text had their values set to missing. Or, if I wanted to minimize missing data, I would write a statement to just read in the first two characters, or maybe to strip out “years” and “yrs”. However, in the latest data set I have, it seems to be a sample of people who are creatively annoying, so I had to settle for a lot of missing data or do something else. I got to thinking that there MUST be some function in SAS that does something similar.
Well, wouldn’t you know ….
Age_numonly = compress(age,'0123456789','K');Having the ‘K’ at the end reverses what the COMPRESS function normally does and instead of deleting your numbers it keeps them. I don’t know how I did not know this. Maybe I knew it at one point and forgot it? Be sure you have the ‘K’ in quotes, by the way.
Well, now I have it stored in my blog, which is better than having it in memory, because unlike my memory, this blog gets backed up regularly.
Remarkably, this week's tip was initially inspired by the Guinness Brewery in Dublin, Ireland. In his new book Statistical Analysis for Business Using JMP, Professor Willbann Terpening provides lots of useful information - including the origin and usage of the Student t-distribution. If you'd like to get to know Willbann and [...]
Part of what captivated me about this paper and poster presentation were the presenters - these guys are high school kids using SAS to do a visual analysis of Internet use by high schoolers. The idea was so compelling that Anna Brown and Inside SAS Global Forum went to talk to two of the presenters to learn what they researched and why they started the project.
Aren't these guys fascinating?! They are definitely going to be competing for one of the sexy statistician jobs of the future! How can you use their inspiration in your research?
Here's a link to their paper, "A Week in the Life”: A Visual Analysis of Internet Use by School-Age Students.
ods html file = 'c:\tmp\prdsal2.xls' style = minimal;
title;
proc print data = sashelp.prdsal2 noobs;
run;
ods html close;
Sub createPT()
' Set storage path for the pivot table
myDataset = "sashelp.prdsal2"
myFilepath = "c:\tmp\" & myDataset & "_" & Format(Date, "dd-mm-yyyy") & ".xlsx"
Dim myPTCache As PivotCache
Dim myPT As PivotTable
' Delete the sheet containing the previous pivot table
Application.ScreenUpdating = False
On Error Resume Next
Application.DisplayAlerts = False
Sheets("Pivot_Table_Sheet").Delete
On Error GoTo 0
' Create the cache
Set myPTCache = ActiveWorkbook.PivotCaches.Create( _
SourceType:=xlDatabase, SourceData:=Range("A1").CurrentRegion)
' Add a new sheet for the pivot table
Worksheets.Add
ActiveSheet.Name = "Pivot_Table_Sheet"
' Create the pivot table
Set myPT = ActiveSheet.PivotTables.Add( _
PivotCache:=myPTCache, TableDestination:=Range("A5"))
With myPT
.PivotFields("COUNTRY").Orientation = xlPageField
.PivotFields("STATE").Orientation = xlRowField
.PivotFields("PRODTYPE").Orientation = xlRowField
.PivotFields("PRODUCT").Orientation = xlRowField
.PivotFields("YEAR").Orientation = xlColumnField
.PivotFields("QUARTER").Orientation = xlColumnField
.PivotFields("MONTH").Orientation = xlColumnField
.PivotFields("ACTUAL").Orientation = xlDataField
.PivotFields("PREDICT").Orientation = xlDataField
.DataPivotField.Orientation = xlRowField
' Add a calculated field to compare the predicted value and the actual value
.CalculatedFields.Add "DIFF", "=PREDICT-ACTUAL"
.PivotFields("DIFF").Orientation = xlDataField
' Specify a number format
.DataBodyRange.NumberFormat = "$#, ##0.00"
' Apply a style for pivot table
.TableStyle2 = "PivotStyleLight18"
End With
Range("A1").FormulaR1C1 = "Pivot table made from data set" & " " & myDataset
Range("A2").FormulaR1C1 = "Prepared by WWW.SASANALYSIS.COM on " & Date
ActiveWorkbook.SaveAs Filename:=myFilepath, _
FileFormat:=xlOpenXMLWorkbook, CreateBackup:=False
End Sub
wherein i came accross some code for reading in windows environment variables.
In the previous episode, we built our own custom SAS function - a masterful trick indeed. Gordon Keener, a developer here at SAS, responded exuberantly "You think that's cool? - try THIS!" and proceeded to demonstrate prodigious powers with the SAS by using a custom function in a custom informat [...]
The semester is coming to an end and summer break is just around the corner, so why should you be thinking about conferences and grant opportunities? Conferences like SESUG provide a lot of great benefits to students.
SAS users groups conferences are a great source of professional development. They offer workshops and presentations for those with advanced skills as well as those new to SAS who are interested in learning more. They are also a great place to hone your skills as a presenter. This is a place students can showcase their work and abilities in front of a large number of potential employers. There will be numerous networking and social events and many of these participants are hiring managers who are interested in meeting recent and upcoming graduates who have the knowledge and skills they are looking for.
The SESUG conference will be held at the Sheraton Imperial Hotel and Convention Center in Durham, North Carolina, October 14-16. The SESUG Student Grant application process is open now. There are two levels of support, one of which includes a travel stipend. This is a great way to build out your resume and network with professionals from a variety of industries.
A message from Deborah Skinner, SESUG Student Grant Coordinator
Students selected for the grant program receive the VIP treatment at our conference. Grant winners are recognized at the opening session and are guests at a special student luncheon, as well as having special ribbons on their conference badges. In addition, there are meals and mixers for all of the attendees – so there are plenty of opportunities to mingle and meet. Previous student grant winners have consistently told us what a wonderful time they had at the conference and what an eye-opening experience it was for them in planning for their future.
I can promise you that you will walk away from SESUG 2012 with new knowledge about SAS, new insights into your future, and lots of new friends and colleagues. You will have lots of opportunities to learn, grow, and frankly just have a lot of fun. So what are you waiting for? The SESUG website contains all of the details and a link to the application.
NOTE:Originally published on Generation SAS
One SAS Enterprise Guide feature I particularly like is the ability to import Microsoft Excel data quickly and easily. SAS offers many ways to work with Excel spreadsheets but often I find I just want to extract data from Excel and get on with my job.
Tip – Click on any picture to see a larger image!
If you are trying this process for the first time, use a “known good” or simple spreadsheet so if any issues arise you can at least eliminate the data as the cause. When this process fails, I generally find that the spreadsheet has something odd going on, such as pasted text, etc. SAS Enterprise Guide has some sample spreadsheets available, which I use in this example.
The SupplyInfo.xls spreadsheet is available in the SAS Enterprise Guide Sample data sub-directory. It has two sheets: Suppliers and Shippers. Let’s import the Suppliers spreadsheet for some quick analysis. Here is the location of my sample files. [Read Create Your Own Sample Data for SAS BI for ideas about where other sample data lives.]
I’m using SAS Enterprise Guide 5.1; as far as I can tell the wizard has not changed much from earlier releases so you should be able to follow along.
After the import completes you will have a fresh dataset to use for analysis.
If you don’t like how the data appears, you can ticker with the results. The Modify Task button re-starts the Data Import wizard. You can also right-click the Import Data icon to make changes.
If you later add more rows to the spreadsheet, just Run the Process Flow again. You can re-import the spreadsheet a thousand times if you want to spend your day doing that.
Would you like to get to know others who share a common interest in SAS books and documentation? We’ve made it easy for you. Besides reading this blog, here are 3 places to discuss our publications and get real-time announcements. Fans of SAS Books on LinkedIn SAS Publishing on Twitter SAS Publishing’s Facebook [...]
In a previous post I showed how you can use Windows PowerShell (with the SAS Local Data Provider) to create a SAS data set viewer. This approach doesn't require that you have SAS installed, and allows you to read or export the records within a SAS data set file.
In this post, I'll present two companion scripts that allow you to:
If you make use of the SAS DICTIONARY tables (as seen in SASHELP.VMEMBER and SASHELP.VCOLUMN), these scripts will provide familiar information. But like my previous example, these scripts do not require a SAS installation.
Why is this useful? Of course, the best way to read SAS data sets is to use SAS. And if you have SAS data sets, the probability is high that you have SAS installed somewhere, so why not use it? It turns out that even among companies that use SAS, not every employee has access to a SAS environment. (Tragic, right?) And since SAS data sets are often treated as a corporate asset (or, at least, the information within the data sets is), these are subject to cataloging and auditing by staff who don't use SAS. These scripts can enable a light-weight auditing process with a minimum of installation/licensing complications.
Here are links to all three scripts. To use them, save each file to your local PC as a .PS1 file. You will also need to make sure that you can run Windows PowerShell scripts, and that you have the SAS OLE DB Local Data Provider installed (free to download).
The output of each of these scripts is in the form of PowerShell objects, which are most useful when piped into another PowerShell cmdlet such as Out-GridView (for visual display) or Export-CSV (for use as input to Excel or another data-driven process).
To view the table information about all SAS data sets in a file path w:/ (including subfolders):
.\ReadSasDataTables.ps1 w:\ | Out-GridView
To export the SAS table information to a CSV file:
.\ReadSasDataTables.ps1 w:\ | Export-CSV -NoType -Path c:\report\tables.csv
"FileName","Path","FileTime","FileSize","TableName","Label","Created","Modified","LogicalRecords","PhysicalRecords","RecordLength","Compressed","Indexed","Type","Encoding","WindowsCodepage" "users.sas7bdat","w:\","5/8/2012 9:37:03 AM","466944","users","","5/8/2012 9:37:03 AM","5/8/2012 9:37:03 AM","412","412","952","NO","False","","20","65001" "bloglist.sas7bdat","w:\","5/8/2012 9:37:03 AM","73728","bloglist","","5/8/2012 9:37:03 AM","5/8/2012 9:37:03 AM","28","28","1360","NO","False","","20","65001" "posts.sas7bdat","w:\","5/8/2012 9:37:09 AM","103555072","posts","","5/8/2012 9:37:07 AM","5/8/2012 9:37:07 AM","41077","41077","2496","NO","False","","20","65001" "postviews.sas7bdat","w:\","5/8/2012 9:37:09 AM","5120000","postviews","","5/8/2012 9:37:09 AM","5/8/2012 9:37:09 AM","4808","4808","1040","NO","False","","20","65001" "comments.sas7bdat","w:\","5/8/2012 9:37:12 AM","26943488","comments","","5/8/2012 9:37:11 AM","5/8/2012 9:37:11 AM","7807","7807","3432","NO","False","","20","65001" "published_posts.sas7bdat","w:\","5/8/2012 9:37:13 AM","11739136","published_posts","","5/8/2012 9:37:13 AM","5/8/2012 9:37:13 AM","4628","4628","2512","NO","False","","20","65001" "blogsocial.sas7bdat","w:\","5/6/2012 1:19:06 PM","401408","blogsocial","","5/6/2012 1:19:06 PM","5/6/2012 1:19:06 PM","682","682","544","NO","False","","20","65001"
.\ReadSasDataColumns.ps1 w:\ | Out-GridView
To export the SAS columns information to a CSV file:
.\ReadSasDataColumns.ps1 w:\ | Export-CSV -NoType -Path c:\report\columns.csv
"File name","Column","Label","Pos","Type","Length","Format","Informat","Indexed","Path","File time","File size" "users.sas7bdat","ID","ID","1","NUM","0","","","False","w:\","5/8/2012 9:37:03 AM","466944" "users.sas7bdat","user_login","user_login","2","130","90","$180.","$180.","False","w:\","5/8/2012 9:37:03 AM","466944" "users.sas7bdat","user_registered","user_registered","3","NUM","0","DATETIME19.","DATETIME19.","False","w:\","5/8/2012 9:37:03 AM","466944" "users.sas7bdat","display_name","display_name","4","130","375","$750.","$750.","False","w:\","5/8/2012 9:37:03 AM","466944" "bloglist.sas7bdat","blog_id","blog_id","1","NUM","0","","","False","w:\","5/8/2012 9:37:03 AM","73728" "bloglist.sas7bdat","name","option_value","2","130","512","$1024.","$1024.","False","w:\","5/8/2012 9:37:03 AM","73728" "bloglist.sas7bdat","path","path","3","130","150","$300.","$300.","False","w:\","5/8/2012 9:37:03 AM","73728" "bloglist.sas7bdat","registered","registered","4","NUM","0","DATETIME19.","DATETIME19.","False","w:\","5/8/2012 9:37:03 AM","73728" "bloglist.sas7bdat","last_updated","last_updated","5","NUM","0","DATETIME19.","DATETIME19.","False","w:\","5/8/2012 9:37:03 AM","73728" "bloglist.sas7bdat","public","public","6","NUM","0","","","False","w:\","5/8/2012 9:37:03 AM","73728"
So, when you go to the game, do you buy a hot dog, a beer and a banner before the first quarter? Do you buy them all from the same vendor? Do you go back during the half? Does the score impact how much money you spend on concessions? All of these questions and more are being considered by Orlando Magic as data points in the customer experience.
Watch this great Inside SAS Global Forum interview with Anna Brown and Anthony Perez, Director of Business Strategy with the Orlando Magic.
Read this fantastic SAS Global Forum 2012 paper about creating the customer experience by Toshi Tsuboi. Tsuboi writes, "If you are involved in a consumer-oriented business, you probably get a sense that consumers are never satisfied....while you strive to provide them with what they want, if you mess up just once, they decide to tell everyone around the globe about it using Twitter and Facebook.
This change in customer expectations was foreseen back in 1999 by B. Joseph Pine II and James H. Gilmore in a book titled The Experience Economy. They described a new economy they called the experience economy, in which experience is the new currency. They argue that businesses must orchestrate memorable events for their customers, and that memory, or "experience," becomes the product. If a business is more advanced in providing experiences, that business can begin charging for the value of the transformation that an experience offers."
Intrigued? Read the paper.
When I was at SAS Global Forum last week, a SAS user asked my advice regarding a SAS/IML program that he wrote. One step of the program was taking too long to run and he wondered if I could suggest a way to speed it up. The long-running step was a function that finds the largest eigenvalue (and associated eigenvector) for a matrix that has thousands of rows and columns. He was using the EIGEN subroutine, which computes all eigenvalues and eigenvectors—even though he was only interested in the eigenvalue with the largest magnitude.
I asked some questions about his matrix and discovered that it had some important properties:
I told him that the power iteration method is an algorithm that can quickly compute the largest eigenvalue (in absolute value) and associated eigenvector for any matrix, provided that the largest eigenvalue is real and distinct. Distinct eigenvalues are a generic property of the spectrum of a symmetric matrix, so, almost surely, the eigenvalues of his matrix are both real and distinct.
The power iteration method requires that you repeatedly multiply a candidate eigenvector, v, by the matrix and then renormalize the image to have unit norm. If you repeat this process many times, the iterates approach the largest eigendirection for almost every choice of the vector v. You can use that fact to find the eigenvalue and eigenvector.
The power method produces the eigenvalue of the largest magnitude (called the dominant eigenvalue) and its associated eigenvector provided that
It is easy to implement a SAS/IML module that implements the power iteration method for a matrix whose dominant eigenvalue is positive. You can generate a random vector to serve as an initial value for v, or you can use a fixed vector such as a vector of ones. In either case, you form the image A v, normalize that value, and repeat until convergence. This is implemented in the following function:
proc iml;
/* If the power method converges, the function returns the largest eigenvalue.
The associated eigenvector is returned in the first argument, v.
If the power method does not converge, the function returns a missing value.
The arguments are:
v Upon input, contains an initial guess for the eigenvector.
Upon return it contains an approximation to the eigenvector.
A The matrix whose largest eigenvalue is desired.
maxIters The maximum number of iterations.
This implementation assume that the largest eigenvalue is positive.
*/
start PowerMethod(v, A, maxIters);
/* specify relative tolerance used for convergence */
tolerance = 1e-6;
v = v / sqrt( v[##] ); /* normalize */
iteration = 0; lambdaOld = 0;
do while ( iteration <= maxIters);
z = A*v; /* transform */
v = z / sqrt( z[##] ); /* normalize */
lambda = v` * z;
iteration = iteration + 1;
if abs((lambda - lambdaOld)/lambda) < tolerance then
return ( lambda );
lambdaOld = lambda;
end;
return ( . ); /* no convergence */
finish;
/* test on small example */
A = {-261 209 -49,
-530 422 -98,
-800 631 -144 };
v = {1,2,3}; /* guess */
lambda = PowerMethod(v, A, 40 );
if lambda^=. then do;
/* check that result is correct */
z = (A - lambda*I(nrow(A))) * v; /* test if v is eigenvector for lambda */
normZ = sqrt( z[##] ); /* || z || should be ~ 0 */
print lambda normZ;
end;
else print "Power method did not converge";
Finding the complete set of eigenvalues and eigenvectors for a dense symmetric matrix is computationally expensive. Multiplying a matrix and a vector is, in comparison, a trivial computation. I know that the power method will be much, much faster, than computing the full eigenstructure, but I'd like to know how much faster. Let's say I have a moderate-sized symmetric matrix. About how much faster is the power method over computing all eigenvectors? To be definite, I'll compare the times for symmetric matrices that have up to 2,500 rows and columns.
I previously have blogged about how to compare the performance of algorithms for solving linear systems. I will use the same technique to compare the performance of the PowerMethod function and the EIGEN subroutine. The following loop constructs a random symmetric matrix for a range of matrix sizes. For each matrix, the program times how long it takes the PowerMethod and EIGEN routines to run:
/***********************************/
/* large random symmetric matrices */
/***********************************/
sizes = do(500, 2500, 250); /* 500, 1000, ..., 2500 */
results=j(ncol(sizes), 3); /* allocate room for results */
call randseed(12345);
do i = 1 to ncol(sizes);
n = sizes[i];
results[i,1] = n; /* save size of matrix */
r = j(n*(n+1)/2, 1);
call randgen(r, "uniform");
r = sqrvech(r); /* make symmetric */
q = j(n,1,1);
t0=time();
lambda = PowerMethod(q, r, 1000 );
results[i,2] = time()-t0; /* time for power method */
t0=time();
call eigen(evals, evects, r);
results[i,3] = time()-t0; /* time for all eigenvals */
end;
labl = {"Size" "PowerT" "EigenT"};
print results[c=labl];
The results are pretty spectacular. The power method algorithm is virtually instantaneous, even for large matrices. In comparison, the EIGEN computation is a polynomial-time algorithm in the size of the matrix. You can graph the timing by writing the times to a data set and using the SGPLOT procedure:
create eigen from results[c=labl]; append from results; close; proc sgplot data=eigen; series x=Size y=EigenT / legendlabel="All Eigenvalues"; series x=Size y=PowerT / legendlabel="Largest Eigenvalue"; yaxis grid label="Time to Compute Eigenvalues"; xaxis grid label="Size of Matrix"; run;
In the interest of full disclosure, the power method converges at a rate that is equal to the ratio of the two largest eigenvalues, so it might take a while to converge if you are unlucky. However, for large matrices the power method should still be much, much, faster than using the EIGEN routine to compute all eigenvalues. The conclusion is clear: The power method wins this race, hands down!
If the dominant eigenvalue is negative and v is it's eigenvector, v and A*v point in opposite directions. In this case, the PowerMethod function needs a slight modification to return the dominant eigenvalue with the correct sign. There are two ways to do this. The simplest is to compute z = A*(A*v) until the algorithm converges, and then compute the eigenvalue for A in a separate step. An alternative approach is to modify the normalization of v so that it always lies in a particular half-space. This can be accomplished by choosing the direction (v or -v) that has the greatest correlation with the initial guess for v. I leave this modification to the interested reader.
And to my mathematical friends: did you notice that I used random symmetric matrices when timing the algorithm? Experimentation shows that the dominant eigenvalue for these matrices are always positive. Can anyone point me to a proof that this is always true? Furthermore, the dominant eigenvalue is approximately equal to n/2, where n is the size of the matrix. What is the expected value and distribution of the eigenvalues for these matrices?
There is a huge inequality in the way the economy has played out among people I know.
Business is good and has been getting better all year for me. My friend Jake is doing great. He was an anesthesiologist and several years ago became board-certified in geriatrics. He loves working with older people and his patients love him.
When I travel around the country, though, or catch up with old friends, I find that is not true everywhere. Others have been unemployed either continuously, or on and off, for a period over the 99 weeks of unemployment.
“People I know” is hardly a random sample, despite what your average sophomore seems to think, but I still thought it would be interesting to look at the people I’ve known for a decade or more and see where our paths diverged. Because these were all people I have known for 10, 20, or 30 years, we all were at the same place at one point. So, what happened?
I’ll tell you what DIDN’T happen. No one who I know that is long-term unemployed or under-employed is lazy. These are people who have worked construction, cleaned houses, loaded trucks, worked in factories and put in twelve-hour days as middle managers. Also, as you can guess by that list, they also are people who don’t consider manual labor “beneath them”.
None of these people are stupid. Some speak two languages. Some have two or three years of college.
Here are three things that did happen, though. One is that they just got old and for those who had spent a lifetime doing physical labor, they just could not do it any more. Their knees, shoulders, hips, hands, back – you name it – gave out and they were unable to do physical work. These folks didn’t have the skills to do a desk job. Even taking a six-month training course on how to use a computer, word-processing, spreadsheets and social media only got them up to where my eighth-grade daughter is already. There aren’t a lot of positions for people with eighth-grade level computer skills.
A second thing that happened was they settled down. They had families. They married. They bought houses. When they lost their jobs, they had a husband or wife who still had a job. They had a mortgage to pay. When the factory closed, they couldn’t just leave town and let the husband/wife watch the kids, work and pay the bills while they went somewhere else and got another apartment and a new job.
Here is the big, big difference between the two groups of people, though- the people who are unemployed quit their education. They got comfortable as a COBOL programmer, teamster, regional manager. When that job was gone, it turned out there was not a real demand for the person who knew more about the blueprint archive at General Dynamics than anyone else in the world.
Why I Won’t Be Unemployed Five Years from Now
Lately, I have been learning javascript/ jQuery. I put them together because I didn’t get very far in the project with javascript before jQuery seemed like a really good addition. I hadn’t done much with IDEs (integrated development environment), mostly using textwrangler up to this point or just the SAS editor. I tried a couple of others before settling on Webstorm, which I like A LOT. Now using an IDE is kind of the programming equivalent to learning Excel, that is not a hugely marketable anything, but more of an assumption. (I confess I still use Textwrangler for quick stuff.)
Even though my last blog was on how you shouldn’t be writing things from scratch, I opened up a new directory in Webstorm, created a new HTML page and wrote a slideshow from scratch right from the CSS to </body> . Just because. I feel like my progress in javascript is slow as mud, but at least now I can write some things. I’ve also written a couple of basic games. When I read things like the jQuery chapter in Flanagan’s book JavaScript: The Definitive Guide, I feel like I only know the tiniest portion, but I was looking at someone else’s code today and while there was no way in hell I could have written it in less than a month, I did understand almost everything that was going on, so that’s progress, too.
I needed to capture some audio so I downloaded audio hijack and invested 10 minutes in learning to use that. I also needed a voice over so I fiddled with Garageband for half an hour. I’d used that before but not lately. Every time I use it, it takes me less time to remember, “Oh yeah, that’s how you do that again.”
I needed to output some mp3 files as ogg files. I don’t even remember why I had audacity in my applications folder. I don’t *think* it came with my new computer. To export as an mp3 file I needed lame, which I also needed to download.
JavaScript is definitely a marketable skill. Being able to mess around with sound files, perhaps not as much so, but it may eventually be another “given”, like an expectation that you can use a word processor.
Most of my career has been spent processing structured data, specifically doing what is now sneeringly called “frequentist” statistics. Looking to update my syllabus for next year, I’ve been looking into data mining, both with SAS Enterprise Miner and Statistica. I was able to download the trial version of Statistica and the On-Demand version of SAS Enterprise Miner.
So …. this is what I have been up to in the last month. Some of those things will not pan out. At one point, I was pretty good with Tel-a-Graf (graphic design software for plotters – yes, plotters), Foresight – another programming language which I don’t think is around any more. I used Lotus Notes and learned SAS FSP (for “Full Screen Product”). I’ve used a VAX, IBM, DEC, Franklin Ace, Lisa and Next computer.
I think I have identified the dividing line between those whose careers stayed on an upward trend all of these years.
It’s apples.
Or, to be specific, it’s the idea of Johnny Appleseed. The way I see it, each new thing I learned is like scattering some appleseeds. Most of them will probably get eaten by birds, fall on rocks or be bought by Microsoft and killed off. If you toss enough seeds around, though, some of them will bear fruit and twenty years later, you’ll have people lined up to pay you for your knowledge of apples.
Stanley Fogleman says that SAS can be hard to learn on your own - not because it is a difficult language - but because of the various business requirements. In fact, even college students entering the workforce are often ill-prepared in some ways. That's why Fogleman believes that a SAS mentoring program can be so effective.
Fogleman says that when he first began learning SAS, he had only six months of mentoring - that was the length of a SAS consultant's contract that he was working with. During that time, he could ask any question that he wanted. After that, he was on his own to learn and figure out the courses to take.
Since that time, Fogleman has refined a mentoring program for junior programmers. He believes the plan should span one to two years, have executive buy-in and include SAS users group conferences.
"Creating a structured learning environment should be the goal of a mentor," said Fogleman.
Here are some of his tips for success:
"I wish more managers knew about the value of that local, regional and national SAS users group conferences have," said Fogleman. "I can say very confidently that most of the SAS code that I've learned has been at conferences."
What not to do:
"It's about guidance. Structured learning is more efficient," said Fogleman. "There are many different ways to solve programming problems, but there are also many blind alleys. A SAS mentor can help programmers avoid the blind alleys."
Read Fogleman's paper for more advice on What is a SAS Mentor? If you are interested in becoming a mentor, I'd suggest you contact Fogleman. In his presentation, he included a slide showing how to structure the learning process and accomplishments.
SAS already has some cool mobile Business Intelligence apps. Now, Scott McQuiggan tells Anna Brown, in this Inside SAS Global Forum interview, that you can view the really cool high-performance analytics reports that you've created on your desktop - right from your mobile device. Check this out!!
Interestingly, I just found that the most searched keyword is PROC SQL, through the traffic analysis of my tiny blog. The reason possibly is: nowadays everybody knows SQL, more or less; then someone can do some parts of the SAS job by PROC SQL without using any procedure or DATA step.
data class;
set sashelp.class;
obs = _n_;
run;
proc sql;
select avg(weight)
from (select e.weight
from class e, class d
group by e.weight
having sum(case when e.weight = d.weight then 1 else 0 end)
ge abs(sum(sign(e.weight - d.weight))));
quit;
proc sql;
select age, '|',
repeat('*',count(*)*4) as frequency
from class
group by age
order by age;
quit;
proc sql;
select name, weight,
(select sum(a.weight) from class as a where a.obs <= b.obs) as running_total
from class as b;
quit;
proc sql;
select name, weight
from class
union all
select 'Total', sum(weight)
from class;
quit;
proc sql;
select name, type, varnum
from sashelp.vcolumn
where libname = 'WORK' and memname = 'CLASS';
quit;
proc sql;
select name, a.weight, (select count(distinct b.weight)
from class b
where b.weight <= a.weight) as rank
from class a;
quit;
proc sql outobs = 8;
select *
from class
order by ranuni(1234);
quit;
proc sql;
create table class2 like class;
quit;
proc sql;
select max(case when sex='F'
then name else ' ' end) as Female,
max(case when sex='M'
then name else ' ' end) as Male
from (select e.sex,
e.name,
(select count(*) from class d
where e.sex=d.sex and e.obs < d.obs) as level
from class e)
group by level;
quit;
proc sql
select count(*), nmiss(weight), n(weight)
from class;
quit;
As a follow-up to my earlier post on taking advantage of OLAP member properties, you can also display OLAP member properties through the Add-in for Microsoft Office as well as SAS Enterprise Guide. I’m a huge fan of Enterprise Guide, so it’s nice to have that ability but even nicer when the more common information consumer can display member properties through a Pivot Table in Excel.
Simply right-click the level and select Show Properties in Report. From there you can check the properties you have defined.
I like the way the Pivot Table displays member properties as each column in Excel. This information gives the user a little more insight into the data.
To display member properties through SAS Enterprise Guide the approach is similar, right-click the level and select Show Member Property. From there you can check the member properties you want to display.
Check out my earlier post to see how these same OLAP member properties are defined and displayed in SAS Web Report Studio.

We've been more sensitive to accounting for multiple comparisons recently, in part due to work that Nick and colleagues published on the topic.
In this entry, we consider results from a randomized trial (Kypri et al., 2009) to reduce problem drinking in Australian university students.
Seven outcomes were pre-specified: three designated as primary and four as secondary. No adjustment for multiple comparisons was undertaken. The p-values were given as 0.001, 0.001 for the primary outcomes and 0.02 and .001, .22, .59 and .87 for the secondary outcomes.
In this entry, we detail how to adjust for multiplicity using R and SAS.
R
The p.adjust() function in R calculates a variety of different approaches for multiplicity adjustments given a vector of p-values. These include the Bonferroni procedure (where the alpha is divided by the number of tests or equivalently the p-value is multiplied by that number, and truncated back to 1 if the result is not a probability). Other, less conservative corrections are also included (these are Holm (1979), Hochberg (1988), Hommel (1988), Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001)). The first four methods provide strong control for the family-wise error rate and all dominate the Bonferroni procedure. Here we compare the results from the unadjusted, Benjamini and Hochberg method="BH" and Bonferroni procedure for the Kypri et al. study.
pvals = c(.001, .001, .001, .02, .22, .59, .87)
BONF = p.adjust(pvals, "bonferroni")
BH = p.adjust(pvals, "BH")
res = cbind(pvals, BH=round(BH, 3), BONF=round(BONF, 3))
pvals BH BONF
[1,] 0.001 0.002 0.007
[2,] 0.001 0.002 0.007
[3,] 0.001 0.002 0.007
[4,] 0.020 0.035 0.140
[5,] 0.220 0.308 1.000
[6,] 0.590 0.688 1.000
[7,] 0.870 0.870 1.000
matplot(res, ylab="p-values", xlab="sorted outcomes")
abline(h=0.05, lty=2)
matlines(res)
legend(1, .9, legend=c("Bonferroni", "Benjamini-Hochberg", "Unadjusted"),
col=c(3, 2, 1), lty=c(3, 2, 1), cex=0.7)
data a;
input Test$ Raw_P @@;
datalines;
test01 0.001 test02 0.001 test03 0.001
test04 0.02 test05 0.22 test06 0.59
test07 0.87
;
proc multtest inpvalues=a bon fdr plots=adjusted(unpack);
run;
False
Discovery
Test Raw Bonferroni Rate
1 0.0010 0.0070 0.0023
2 0.0010 0.0070 0.0023
3 0.0010 0.0070 0.0023
4 0.0200 0.1400 0.0350
5 0.2200 1.0000 0.3080
6 0.5900 1.0000 0.6883
7 0.8700 1.0000 0.8700

In statistical programming, I often test a program by running it on a problem for which I know the correct answer. I often use a single expression to compute the maximum value of the absolute difference between the vectors:
maxDiff = max( abs( z-correct ) ); /* largest absolute difference */
In this expression, z is the vector that I have computed and correct is the correct answer.
Let's break this expression down into pieces:
For example, last week I showed how you can use the DIF function to compute simple finite-difference approximations to derivatives. In that article, I computed an approximate derivative to the sine function and compared it to the true derivative, as follows:
proc iml; h = 0.1; x = T( do(0, 6.28, h) ); /* x in [0, 2 pi] */ y = sin(x); approx = dif(y, 1) / h; /* f'(x) ~ (f(x)-f(x-h))/h */ correct = cos(x); /* true derivatives at x */ maxBDiff = max(abs(approx - correct)); /* find maximum difference */ print maxBDiff;
The output tells me that the approximate derivative and the true value differ by about 0.05 for some value of the x vector.
It is interesting to note that you can use the exact same expression if correct is a scalar value. In this case, you are computing the maximum absolute deviation between a vector of values and a target value.
In statistics, a difference between two values is called a deviation, especially when one expression is an estimate and another is an expected value. In the language of statistics, the expression in the previous section is similar to the maximum absolute deviation. There are other statistical concepts that you can use to measure the difference between two vectors, or between a vector and a target value:
There are other measures that you can use (such as relative quantities), but these are some common ways to compute a measure of how much one vector of values differs from another.
So I say,
“There may be some useful information in the text fields. For example, people who use credit to buy commodities, such as meat may differ from people who buy finished goods like clothing who may differ from people who purchase machinery on credit. Perhaps you may want to consider some kind of clustering.”
The very bright young people nod and one says quite brightly to another,
“Well, you better get out your grep statements.”
All of the rest nod in pleased agreement, while I say to my old and faded self,
Grep? That’s what you’ve got? Seriously? What. The. Fuck.
Okay, ten points for the bright young people for knowing any Unix commands or even that Unix exists, which puts them ahead of a lot of people. However, please explain to me why in the name of God you would not even consider using something like Statistica or SAS Enterprise Miner ? There is an R text mining package. I have never used it because I don’t use R (long discussion of that here) but these young people had spent three semesters learning to program R and did not even know it existed.
At one point I was playing around with both Ruby and SAS to write a program to parse text. Do you have any idea how much of a pain in the ass that is? In that case, because it was on a set of data with a VERY limited scope, we could do it by using just a few hundred words. It was a small project with a very small budget and at the time I was wanting a project that gave me an excuse to learn Ruby.
For a more general project with the whole English language as its scope, that would be an insane undertaking. It would cost the client several times the cost of buying a SAS license or Statistica (not sure about the SPSS offering) – and what I could write would not be within shouting distance of as well done and comprehensive as something a team of people had worked on for years.
The most recent client who asked me this actually has a SAS Enterprise Miner license at their organization! (So, yes, while the license fee is humongous, since it had already been paid, the additional cost to use it on this project would be zero.)
While this is the latest, most outrageous example, the “Do-It-Yourself” fallacy happens all the time. Recently, I needed to do a slideshow. I thought I would write it using javascript/ jQuery because that is something else I have been wanting to learn better and the Codecademy thing just didn’t do it for me. I wanted an actual project.
I started on it and after about ten minutes of reflection realized there were probably dozens of jQuery plug-ins that did slide shows of every size, shape and form. Sure enough, 5 seconds on Google gave me a couple to download.
I downloaded one, modified it a bit and it was okay, though I’m not sure it is exactly what I want either. When I looked at the code in detail, it was evident the author had done the same as me, downloaded someone else’s code and modified it, because there were entire directories in there that did nothing. So, I deleted those.
After playing with that for a while, I thought perhaps there were other, better, slideshow plug-ins available. I downloaded another one because, even though I knew it probably wouldn’t suit my purpose, it was written so much more succinctly, I found it interesting.
So …. two lessons
I remember when blogging was cool.
Before the specializing and monetizing and Twitter-izing.
Well I think blogging is still cool (and awesome and awesome …). The most appealing personal reason is, blog posts are Google searchable and suitable for archive while Tweets NOT. Admittedly I hold some sort of Existentialism 2.0:
if it is not Google searched, it doesn’t exit!
Last month I placed a post on how to keep pace with CDISC from its official channels and I feel cool to add an appendix of source from the awesome blogosphere. Fortunately or not, CDISC is still in the niche market of topics and it takes few efforts to get the list(update me if someone else available! if you are a Google Reader user, just simply import this file, my Google Reader subscription on CDISC):
1. Blog @ Assero by Dave Iberson-Hurst (“Dave IH”)
Insightful and full of humor. I retweeted all of its latest posts and you can feel somehow on these tittles (YES on CDISC):
What I Want, What I Really Really Want
Churchill, the FDA and a Fall
Mad March and the FDA
Btw, I write blogs casual way while it is very impressive to read IH reminding me the George Orwell style.
2. d-Wise Technologies Blog
It is my employer’s official blog site where Chris Decker is the key contributor to CDISC. You can check out his latest posts on FDA/PhUSE Annual Computational Science Symposium where he served as committee lead:
Overcoming Industry Challenges: A Shift to Collaboration
Validation and Quality: Are They the Same?
I will also commit to update this blog as my understanding on clinical standards goes. Here is the saying:
look to the master,
follow the master,
walk with the master,
see through the master,
become the master.
3. XML4Pharma Blog
with industry news and hard (while cool) way writing on XML (CDISC ODM, define.xml).
4. eClinical Trends by Clinovo
Clinovo jumps to this topic by launching a CDISC SDTM convertor CDISC Express.
5. eClinicalOpinion
This blog is most focused on EDC, the clinical data management part. I like its series discussion on CDISC ODM.
6. eCTD Regulatory Submissions Network
This is a personal blog by Shakul Hameed. I read it mostly to get some information on submission requirements from European regulatory.
7. HL7 Watch
while it is not CDISC directly related (#6 also), it’s nice to get some voice of HL7 which would be the future of CDISC.
8. From a Logical Point of View-CDISC
Yes this one, my 2 cents. I will keep recording my personal immersion and understanding on CDISC and related clinical standards. (while it is privilege to cross reference oneself in his/her own blog! Keep awesome, keep blogging.)
9. Linked Data and URI:s for Enterprises
Look at the colon (:) in the title of this blog and you’re right this blog plays (at least) with XML. I find it is good resource (thanks @kerfors for referencing!) to learn ODM, the foundation of CDISC while the latest post is
Semantic models for CDISC based standard and metadata management
P.S.: Blogger Chris Hemedinger maintains a nice list of SAS bloggers (blogs by SAS employees, and blogs by SAS customers, consultants, and the analytics community).
It is becoming more and more apparent that social media is a gold mine of unstructured data that is just waiting to be analysed so that the nuggets can be extracted. At SAS Global Forum, I was particularly impressed with the diversified use of sentiment analysis and the exploration that has been conducted into the field of social media. I attended a number of great presentations and an extremely interesting Super Demo on the analysis of consumers’ moods during Super Bowl commercials.
The Super Demo detailed how to use mood statements alongside sentiment analysis to measure in more detail the emotion displayed by people - more than would be possible with sentiment analysis alone. For example, the underlying purpose of advertising is to generate a reaction, hopefully positive, to a particular product or service. The key, therefore, is to understand this reaction through the use of social media to determine the best marketing strategies to implement.
Text analytics can be used here to derive the emotions people are displaying through the words and phrases they use on social networking sites such as Twitter and Facebook. From this data, sentiment and intensity (defined here as the “passion” component) can be derived to determine which commercials hit the mark with their targeted audience. Read this blog post by Richard Foley about analyzing sentiment for more information about the Superbowl research.
Another thought-provoking presentation on a novel implementation of sentiment analysis and forecasting was given on the topic of predicting electoral outcomes. The purpose of this presentation and paper was to try to predict the outcomes of popular elections through social media when polling data is not necessarily available. It also demonstrated the ability to validate election outcomes and check for potential instances of fraudulent election administration.
What was interesting (maybe more than the demonstration on popular elections) was the demonstration of this same methodology on the popular television show American Idol!
The four-step methodology given to achieve this through the extraction, validation, analysis, and prediction of outcomes from the relevant social media data was:
This process allows researchers to surface the general opinions of the social sphere at differing time points to determine a view of sentiment before and after a particular event, for example an eviction from the show.
Not only is sentiment analysis crucial for this exploration, but there are also forecasting applications to determine future events given the textual information that has been determined from the sentiment analysis. Check out Jenn Sykes’ full paper, Predicting Electoral Outcomes with SAS ® Sentiment Analysis and SAS ® Forecast Studio. Also take a minute to watch her in this short Inside SAS Global Forum interview.
With regards to the application of sentiment analysis in other sectors, I can see that there is certainly potential here in the financial sector, where there is a great need for information on sentiment from customers, not only for marketing-related activities, but also customer retention and acquisition.
This year’s conference was a fantastic display of what to look forward to in the world of analytics, and the next SAS Global Forum, San Francisco April 28th thru May 1st is already in the diary!
This week's SAS author's tip comes from Carol Matthews and Brian Shilling and their book Validating Clinical Trial Data Reporting with SAS. SAS users have raved about this guide. In her review, Susan Fehrer said "Carol and Brian's book provides a good overview, practical hands-on tips, and many examples of how to perform [...]
A reader asked:
I want to create a vector as follows. Suppose there are two given vectors x=[A B C] and f=[1 2 3]. Here f indicates the frequency vector. I hope to generate a vector c=[A B B C C C]. I am trying to use the REPEAT function in the SAS/IML, language but there is always something wrong. Can you help me?
This is probably a good time to remind everyone about the SAS/IML Community (formerly known as a Discussion Forum). You can post your SAS/IML questions there 24 hours a day. That is always a better plan than making a personal appeal to me, because I receive dozens of questions like this every month, and there is no way that I can personally reply. There are lots of experienced SAS/IML experts out there, so please use the SAS/IML Community to tap into that knowledge.
That said, I think the answer to this reader's question makes an interesting example of statistical programming with SAS/IML software. It is trivial to solve this in the DATA step (see the end of this article), but how might you solve it in the SAS/IML language? If you'd like to try to solve this problem yourself, stop reading here. Spoilers ahead!
The goal is to write a function that duplicates or "expands" data that have a frequency variable. The important function to use for this task is the CUSUM function, which computes the cumulative frequencies. Let's look at a simple example and apply the CUSUM function to the frequency vector:
proc iml;
values={A,B,C,E};
freq = {2,1,3,4};
cumfreq = cusum(freq);
print values freq cumfreq;
As shown in the output, the cumfreq variable contains the indices for the expanded data. The expanded data will be a vector that contains 10 elements. The first data value (A) repeats twice (the freq value), so it repeats until element 2 (the cumfreq value) in the expanded vector. The second category fills element 3. The next category repeats 3 times, so it occupies up through element 6 in the expanded vector. The last category repeats until element 10. The following DO loop specifies each data value and the indices of the expanded vector that it should occupy:
print (values[1])[label="value"] (1:cumFreq[1])[label="Indices"]; do i = 2 to nrow(values); bIdx = 1 + cumFreq[i-1]; /* begin index */ eIdx = cumFreq[i]; /* end index */ value = values[i]; print value (bIdx:eIdx)[label="Indices"]; end;
The output shows that we have all the information we need to allocate a vector of length 10 and fill it with the data values, where the ith value is repeated freq[i] times. The key, it turns out, is to use the CUSUM function to find the indices that correspond to the each data value.
In SAS procedures that support a FREQ statement, the frequency values must be positive integers. If the frequency value is missing or is a nonpositive value, the corresponding data value is excluded from the analysis. It is easy to add that same feature to a module that takes a vector of values and a vector of frequencies and returns a vector that contains the data in expanded form. This is implemented in the following SAS/IML module, which allocates the result vector with the first data value in order to avoid handling the first element outside of the DO loop:
start expandFreq(_x, _freq);
/* Optional: handle nonpositive and fractional frequencies */
idx = loc(_freq > 0); /* trick: in SAS this also handles missing alues */
if ncol(idx)=0 then return (.);
x = _x[idx];
freq = round( _freq[idx] );
/* all frequencies are now positive integers */
cumfreq = cusum(freq);
/* Initialize result with x[1] to get correct char/num type */
N = nrow(x);
expand = j(cumfreq[N], 1, x[1]); /* useful trick */
do i = 2 to N;
bIdx = 1 + cumFreq[i-1]; /* begin index */
eIdx = cumFreq[i]; /* end index */
expand[bIdx:eIdx] = x[i];/* you could use the REPEAT function here */
end;
return ( expand );
finish;
/* test the module */
values={A,B,C,D,E,F};
freq = {2,1,3,0,4,.}; /* include nonpositive and missing frequencies */
y = expandFreq(values, freq);
print values freq y;
Notice that you don't actually need to use the REPEAT function because SAS/IML is happy to assign a scalar value into a vector. The scalar is automatically repeated as often as needed in order to fill the vector.
As indicated at the beginning of this post, the DATA step solution is quite simple: merely use the OUTPUT statement in a loop, as shown in the following example:
data Orig; input x $ Freq; datalines; A 2 B 1 C 3 D 0 E 4 F . ; run; /* expand original data by frequency variable */ data Expand; keep x; set Orig; if Freq<1 then delete; do i = 1 to int(Freq); output; end; run; proc print data=Expand; run;
The output data set contains the same data as the y vector in the SAS/IML program.
Between attending presentations and networking, you should make your way to the SAS Bookstore at PharmaSUG. While there are many reasons to add us to your list, here are the top 5: Save money on our books and documentation. We’re offering a special conference discount to PharmaSUG attendees. Talk to a [...]
A SAS user (who lives in the the US) emailed me a question about SAS functions. He was reading UTC (Coordinated Universal Time) datetime values from server logs, and to make future calculations and comparisons easier, he wanted to transform the value to local datetime. The INTNX() function worked great, but [...]
No, this is not a post about politics or life-hacking, although the same title could apply in either case. I am talking about statistical power. People often ask me what the power of a test is, but the problem is that they are asking the wrong question. Power is not a single number. I understand where the confusion can occur.
What is power & how do you get it?
There are two errors people worry about, Type I and Type II. The probability of making a Type I error is set and it is called alpha ( α ) . Alpha is usually set at .05. It is the probability of rejecting a true null hypothesis. Now, what is a null hypothesis? It is a hypothesis of ZERO difference between the means, ZERO relationship between X and Y. A Type I error can occur in the case of ONE number, zero. If the effect is zero and you say it isn’t, you have made a Type I error.
A Type II error is the probability of accepting a false null hypothesis. The probability of a Type II error is called beta (β ). A Type II error can occur in an infinite number of cases, for any number other than zero. If the effect isn’t zero and you say it is, you have made a Type II error. Power = 1 – β .Depending on what the actual value of your statistic is, the power will be different.
Look at it logically. If in an infinite population your experimental group is a million times better than your control group, then, just logically, the probability of you pulling two samples and incorrectly deciding there was no difference is very low. Similarly, if your experimental group performs .01% better than your control group, although the difference is not zero, you can logically conclude that a good percentage of the time you might conclude that there is zero difference, which is, incorrect statistically, although perhaps not for practical purposes.
Dr. Park, at the University of Indiana, has a very nice explanation of hypothesis testing and power analysis. He says, assume that we are testing the hypothesis that the mean is 4 when in actuality the mean is 7.
Let’s just say we are hypothesizing that people feed their office guinea pigs hay an average of four times a month. (I had to do something with the office guinea pigs to make them feel part of the team, so I put them in here.)
This variable is normally distributed with a standard deviation of 1. (Just a reminder that the standard deviation OF THE MEAN is the standard error.)
The cut off for rejecting our hypothesis is 5.96 because computing a z score, we get 5.96 – 4/ 1 = 1.96 since at 1.96, p is not less than .05, p = .05. So, 5.96 is the highest number at which we accept the hypothesis that the mean = 4.
This hypothesis is, in fact, wrong. People really feed their office guinea pigs hay a mean of 7 times per month.
We know this because God told us so in a spare moment when he was not busy telling Republicans they needed to become candidates for president.
Given that the true mean is 7, we can compute the z- score for
5.96 – 7 /1 = -1.04
We look up 1.04 in a z table because, although we have a direct line to God, we don’t have a calculator with statistical functions, and we find that about 15% of the time we’ll get a value of 1.04 or greater. (14.92% of the time, actually, if you’re a precision freak).
So, this tells us IF we hypothesize the mean is 4 but it really and truly honest-to-God is 7, and IF the standard error is 1, then our power is .85 because 15% of the time we’ll get a number at least as large as 1.04.
So, our power is .85, right?
Well, not so right. It is – IF the standard error is 1 and IF the “true” value is 7 and IF we were doing a z-test. But if we knew the true value, what was the point of doing any tests?
What if the true value is 6? Then z = 5.96 – 6 / 1 = .04 . The percentage of the z-distribution (which is normal) that is greater than .04 is about 50%, so our power is around .5o
Important point number one – power depends on the true value, and you don’t know the true value

This is the first important point to keep in mind …. the power of a test is different based on what the true population value is. But you don’t really know what it is, since God is too busy worrying if people are having gay sex or eating pork to talk to you about guinea pig cuisine.
Generally what people do (if “what people do” means what I do), is enter a number of possible values into software like PROC POWER. So, I enter 6, 6.5, 6.75 and 7 and find that the values for power are .51, .70, .78 and .85 I can say that the power of the test is at least .85 if the true mean number of office guinea pig hay purchases is 7 per month or higher. That is, when the true figure is at least 7 we would reject the false null hypothesis at least 85% of the time. If it was a lot more than 7, we’d reject it a lot more than 85%.
Important point number two – power depends on the variability
In the example above, I forced the standard error to equal one by assuming my standard deviation was 10 and my sample size was 100. That isn’t very realistic, but I was just going with the example in his paper. Let’s say instead that the standard deviation is 1, which is more reasonable, and the sample size is 10. Then my standard error .10 and the power is going to be greater than .999.
Important point number three – power depends on the test statistic
A z-test is a test where you compare the mean to a constant value. Generally, you don’t have a constant value. More likely, you have two groups. Say, we want to know if office guinea pigs get hay as often as home guinea pigs. My hypothesis is that the office guinea pigs will get it seven times a month, because they need more energy to keep up with their official duties, while the home guinea pigs will only get hay four times a month. I select a total sample of 10 with only 5 in each group, because I want an equal number and for some reason it is difficult to locate people who have office guinea pigs . The standard deviation of number of times of hay per month is still 1. When I compute the power of this test, it was .985.
SO …. even if you go with the standard .05 level of significance (level of significance ALSO affects power) and the standard two-tailed tests (whether you have a one or two-tailed test ALSO affects power) and you don’t have to bother about correlations between groups (the correlation between groups in a paired t-test ALSO affects power) you STILL can have a whole bunch of numbers that MAY be the power of the test depending on what the test statistic, variability and hypothesized value are.
The one thing that affects power people usually ask about is sample size
Yes, sample size also affects the power of a test. So, if I only had 4 guinea pigs per group, my power would be .939. If I had 10 guinea pigs in each group, it would be above .999
However, if you ask me to tell you how many people you need in your sample to have a power of .80, you’re asking the wrong question. The answer depends on how large of an effect size (in these examples difference between means), how much variability, the specific statistical test you are doing and other factors like whether it is a one or two-tailed tests and correlations between your groups.
The best answer you are going to get from me is that if you have 128 people total in your sample you will have power of AT LEAST .80 IF you are doing an independent t-test if there is AT LEAST a half-standard deviation difference between the two groups, AND you are doing a two-tailed AND your null hypothesis is that there is zero difference between the groups. However, if there is smaller difference than that it will be less. Also, if you are doing a different test, say, a logistic regression, power calculation is more complicated.
But I know that you are going to nod knowingly, turn around and walk out the door saying,
“128. Got it. Thanks!”
FINALLY…the simplest ESTIMATE statements to write are for continuous variables not involved in interactions or higher order terms. Consider a data set containing the 2004 SAT scores for each of the 50 states. The file includes the combined math and verbal SAT scores (TOTAL), the state (STATE) and the percent [...]
One of the other "futures" sessions I attended at SAS Global Forum was The New SAS Programming Language: DS2 with SAS's Jason Secosky. Jason was at pains to point out that DS2 is not intended as a replacement for the good old DATA step. DS2 is an alternative to DATA step and has more of a focus than the generalistic DATA step.
Generally available in 9.4, PROC DS2 is currently available in SAS V9.3 as an experimental technology. Its focus is on high performance for data manipulation and data analysis. It incorporates threading.
DATA steps are in control of their data; they specify the source of their input data, and they specify the location of their output data. In contrast, DS2 is simply a node in a flow; DS2 uses data streams rather than specific data objects. So, DS2 is not a DATA step replacement, it's new technology.
DS2's syntax is similar in parts to DATA step, with DATA and SET statements, if/then/else statements, expressions and functions. However, DS2 adds structure to code. Some of its syntax will be familiar to SAS/AF SCL coders; it includes methods (including init, term, and run). It has lots more types of variables when compared with DATA step, e.g. integer and varchar. DS2 integrates with other languages (such as R, C, C+ +, IML, and SAS fcmp functions) through the concept of a package. Interestingly, we'll be able to edit our DS2 code in the Eclipse editor, wherein a debugger will be included.
In essence, DS2 is the means of taking code to data (ref: big data) and promises linear scalability.
To a statistician, the DIF function (which was introduced in SAS/IML 9.22) is useful for time series analysis. To a numerical analyst and a statistical programmer, the function has many other uses, including computing finite differences.
The DIF function computes the difference between the original vector and a shifted version of that vector. In terms of the LAG function, DIF(x,k) = x - LAG(x,k) for any value of the lag parameter, k. I blogged about the usefulness of the LAG function earlier this week.
For a function that is given by a formula, you can use the NLPFDD subroutine to compute finite difference derivatives. However, sometimes a function is known only at a finite set of points. In that case, you have a choice: you can either model the function by using regression techniques or you can assume that the function is piecewise linear.
Some curves really are piecewise linear. For example, an ROC curve is piecewise linear, and you can compute the exact derivatives by using a forward or backward difference scheme. You can also compute an exact area under the piecewise linear function by using the trapezoidal rule of integration.
The DIF function makes it easy to compute lagged differences (finite differences) in a sequence of values. As an example, the derivative of a function can be approximated by the backward difference formula: f'(x) ≈ (f(x)-f(x-h))/h for small values of h. If you know the values of f at a discrete set of points x1 < x2 < ... < xn, then you can use the DIF function to evaluate the backward difference because the expression f(xi-h) is the lagged term f(xi-1). For example, the following SAS/IML program computes a sequence of evenly spaced x values and evaluates the sine function at these points. The points of the backDiff vector approximate the derivative of the sine at each value of x:
proc iml; h = 0.1; x = T( do(0, 6.28, h) ); y = sin(x); backDiff = dif(y, 1) / h; /* f'(x) ~ (f(x)-f(x-h))/h */
When the DIF function is called with a single argument, a lag of 1 is assumed, so you can also write backDiff = dif(y)/h.
We know from calculus that the exact derivative of the sine function is the cosine. The following function computes the exact derivative at each value of x and compares it with the finite difference approximation:
deriv = cos(x); maxBDiff = max(abs(deriv-backDiff)); /* find maximum difference */ print maxBDiff;
The following plot shows the exact derivative and the backward difference approximation at each point of x:
The finite difference approximations are in close agreement with the exact values. You can also plot the forward difference approximation, which is similar. The forward difference requires using a shift value of -1. When you work through the formula, you find that forwardDiff = -dif(y, -1) / h.
The DIF function also "works" on irregularly spaced data. For data that are not evenly spaced, the h parameter, which is the difference between adjacent x values, is no longer constant. You can use the DIF function to compute the distance between x values, and then compute the slopes as shown in the following statements:
/* irregular spacing and no formula */
x = {0.0, 0.1, 0.2, 0.4, 0.5, 0.8, 1.0};
y = {0.3, 0.6, 0.7, 0.7, 0.9, 1.0, 1.0};
dx = dif(x); /* difference for adjacent x values (lag=1) */
dy = dif(y); /* difference for adjacent y values (lag=1) */
slopes = dy/dx;
print dx dy slopes;
You can also use the DIF and LAG function to implement integration schemes. For example, in my article on the trapezoidal rule of integration, I could have implemented the trapezoidal rule by using LAG and DIF instead of using indexes to form the lag of the data vectors manually.
In the past, when I had to do any type of parsing of text, I wrote my own code with a zillion SUBSTR functions and IF statements and it did the job but it was *so-o-o ugly and painful that I never even considered including text mining in any courses I taught.
I looked into SAS Enterprise Miner years ago but the commercial version costs (and this is approximate) $1,278,544,899,711,315 and your left kidney.
The SAS On-Demand version sucked. You know how some programs you can get a cup of coffee while waiting for them to run? With the original SAS On-Demand for Enterprise Miner you could fly to Columbia, work as a day laborer to earn the money to buy land, start your own plantation, breed a strain of genetically superior coffee beans and skip the country on the last plane out just before the latest government coup nationalizes your business – and your results STILL wouldn’t be available when you got back.
Having had such good luck with SAS On-Demand for Enterprise Guide last semester, I thought I’d give Enterprise Miner another look.
Oh.My.God.
Last year, The Spoiled One was in the living room with her boring parents, complaining they were watching The Daily Show with boring news when it turned out that Justin Bieber was the guest.
She must have felt like this.
The latest version is unbelievably faster. I cannot tell you if it is better because it ran so slow in the past it was impossible to tell. It is easy to use. Let me give an example.
First, you register with SAS On-Demand and register a course for use with Enterprise Miner. This is really easy.
Second, you start Enterprise Miner which requires nothing more than clicking on the Get Software link on your log in page.
Next, create a project. Just go to FILE > NEW > Project and click next a lot. A long the way you give it a name. It’s pretty obvious.
It may not be obvious that you need to have a data source available and create a diagram. Again, it’s pretty easy to figure out, though.
Creating a data source – go to FILE > NEW > DATA SOURCE
a window pops up and the default is SAS TABLE, which is what you want if your data is in a SAS dataset (they now call them tables. I blame the damn SQL people.). Click Next
In the next window, you browse to where your data are. Because I am just testing this for use in a class, I used the abstract data set in the Sampsio library.
So, you have a project, a blank diagram and a data source. Now what?
Text parsing:
1. Drag the icon under data sources on to your diagram
2. Click on the Text Mining Tab
3. Click on the Text Parsing tab (hovering over each tab with the mouse will give you its name) and drag it to the diagram
4. Click on the little grey stem sticking out of the end of your data source and drag it to the text parsing box.
5. Now, right- click on the Text Parsing box and from the drop-down menu, select RUN
After a bit, it will come up with a window that has two choices, OK and Results. Click on Results. The most interesting bit in the results, I think, is the table of frequency for each word. You can see which words are most common in your documents.
STOP WORDS AND OTHER OPTIONS
This is just the beginning, of course. As you can imagine, if you had to actually write a program to read every word separately, that would take a bit of time. Far more time would be to have it ignore words that are useless, like, “the”, “that”, “there”. These are called stop words. Enterprise Miner has a stop list and you can add or delete words from it.
Click on the thing that looks like a page to add a row and type in another stop word. For example, these abstracts come from the SAS Global Forum proceedings so they probably all have some words like data and SAS that occur in every one of them, so in this case, that is pretty useless as far as analyzing the documents. You can add those to your stop list.
If there is a word you want to keep, you can remove it from the stop list by selecting it and clicking that X at the top (right next to the thing that looks like a new page). You’ll be asked if you are sure you want to delete that row.
How do you get the stop row list, you may ask, quivering with excitement.

If you have clicked on the Text Parsing box, making it active, you’ll see in the left window pane a number of options.
These include:
the language to use,
a list of multi-word terms, everything from “a lot” to “keep in mind” to “zero in”,
parts of speech to ignore, like adjectives, and, of course,
the stop list.
To modify any of these, just click on the three dots next to it and a window will pop up, like the one shown above for the stop list.
If you haven’t actually had to do analyze text data before, you have NO IDEA how amazingly awesomely cool this all is.
When I was in graduate school, we would actually print out multiple copies of the documents, cut the pages into paragraphs and sort them into categories.
More recently, this is why I started using Ruby because it was much easier to parse text than using SAS. There were some cheaper and open source solutions that I looked at but their documentation was non-existent, the interfaces were clear as mud.
The Bad and Good News
Speaking of unclear interfaces … I’m not sure I would have guessed that the page with the corner folded meant “add new row”. Also, there is a LOT of stuff on the Enterprise Miner screen. You have all of these different panes in the window and the options in them are completely different depending on whether you have clicked on the text mining tab, the text parsing box or something else. I’ve read a couple of data mining books, one specifically on Enterprise Miner, and they still were very sparse, particularly in their treatment of text mining, which is what I was most interested in.
That’s the bad news. The good news is that when I was at SAS Global Forum, I picked up a copy of Practical Text Mining. I almost didn’t buy it because it’s over 1,000 pages and my suitcase was already pretty full, which meant I’d have to lug it through the airport. Even worse, it did not have an electronic version, which is tough for me because even with contacts and glasses worn OVER my contacts, I still have difficulty reading some of the screen shots in it. (I expect if I had normal eyesight, I’d be fine.)
All that being said, this book is really useful. I know I got a discount at the conference, but still, it was about $70, which for a textbook like this is super-cheap. A thousand pages sounds like a lot, but that’s because it starts with the very basics and is a bit redundant. That’s not so terrible, though because that makes it easy to read. I was laying in bed sick this morning and read the first 120 pages in about two hours.
This is a godsend to anyone doing a qualitative dissertation. The real tragedy is that a lot of people in areas that do qualitative research – education, psychology, nursing, social work, to name a few – probably won’t even be aware that Enterprise Miner exists, much less that they can get it for free to use in teaching their courses.
Seriously, people, this is a huge opportunity for you to teach your students about text mining and it’s really not that hard.
Many SAS customers are quickly adopting 64-bit versions of Microsoft Windows, and they are pleased-as-punch when they find a 64-bit version of SAS to run on it. They waste no time in deploying the new version, only to find that a few things don't work quite the same as they did with the 32-bit version. This post describes the top snags that end users encounter, and how to work around them.
Imagine you have a program that looks like this:
proc import out=work.class datafile="c:\temp\class.xls" DBMS = EXCEL; run;
On 64-bit SAS for Windows, you might be surprised to encounter this error:
ERROR: Connect: Class not registered ERROR: Error in the LIBNAME statement Connection Failed. See log for details. NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE IMPORT used (Total process time): real time 0.11 seconds cpu time 0.04 seconds
The Cause:
Your 64-bit SAS process cannot use the built-in data providers for Microsoft Excel or Microsoft Access, which are usually 32-bit modules. In a previous blog post, I've provided a bit of explanation about this limitation.
The Fix:
Use DBMS=EXCELCS for Excel files, or DBMS=ACCESSCS for Microsoft Access. For LIBNAME access, try LIBNAME PCFILES. These approaches use the PC Files Server, which is a separate small application that is provided with SAS/ACCESS to PC Files. Note that you may need to go back and install this application, as it might not have been placed in your installation automatically. However, you can use the Autostart feature to skip having to configure it as a service, and thus minimize the changes to your SAS programs.
Alternatively, you can try DBMS=XLSX to remove the data providers from the equation.
NOTE: There are a few feature differences between the EXCELCS and EXCEL options. Read this SAS note to determine whether these differences will affect your work.
A Caution:
I've heard of a few customers who decide to workaround this limitation by installing the 64-bit version of Microsoft Office (and thus using the 64-bit data providers). That works, but it might introduce other incompatibilities with how you use your Microsoft Office applications. Microsoft recommends the 64-bit version of Office in only a few circumstances; consider the implications carefully before you head down this road.
Suppose that you have a library of user-defined formats that you once created by using PROC FORMAT. User-defined formats are stored in SAS catalogs, which are a sort of SAS-specific file system structure that SAS can access during your session.
If you created and used these user-defined formats with 32-bit SAS, you'll see this message when you try to use them with 64-bit SAS:
15 libname library "c:\datasources\32bit"; NOTE: Libref LIBRARY was successfully assigned as follows: Engine: V9 Physical Name: c:\datasources\32bit 16 proc print data=sashelp.air; 17 format date benefit.; ERROR: File LIBRARY.FORMATS.CATALOG was created for a different operating system. 18 run;
The Fix:
SAS provides the utility procedures CPORT and CIMPORT to allow you to transfer catalog content across different operating environments, and you can certainly take that approach for this scenario.
If you have a mixed environment on your team where some people have 32-bit SAS and others have 64-bit SAS, it might be easier to decompose the format definitions down to data sets (by using PROC FORMAT and the CTLOUT option). You can then easily recreate the formats "on the fly" by using PROC FORMAT and the CTLIN option.
This works well because SAS data sets are compatible between the 32-bit and 64-bit versions of SAS...mostly. That brings us to the last "gotcha".
If you use SAS data sets that were created by a 32-bit version of SAS, you can read them without modification in 64-bit SAS. But you might see a message like this:
NOTE: Data file TEST.HMEQ.DATA is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.
The Cause:
SAS data set files are written with an encoding that is specific to the SAS operating environment. In 32-bit SAS on Windows, the encoding is WINDOWS_32. On 64-bit SAS, it's WINDOWS_64. When the data set encoding differs from the native SAS session encoding, CEDA kicks in.
The good news is that in SAS 9.3, the SAS developers "taught" SAS for Windows to bypass the CEDA layer when the only encoding difference is WINDOWS_32 versus WINDOWS_64.
The Fix:
You don't have to do anything about this issue unless you want to update the data sets. And if you have SAS 9.3, you probably won't see this message at all...at least not when the data originates from 32-bit SAS for Windows.
If you decide to convert entire data set libraries to the new native encoding, you can achieve this by using PROC MIGRATE.
I'll finish this post with just a few general points to guide you:
Myths about 64-bit computing on Windows
Are 64-bit client applications twice as good as 32-bit applications?
How do I export from SAS to Excel files: Let me count the ways
Should you care about 64-bit applications?
Robert Allison's SAS/GRAPH: Beyond the Basics collects examples that demonstrate a variety of techniques you can use to create custom graphs using SAS/GRAPH software. To celebrate the book’s publication, we asked Robert to tell us more about why he loves SAS/GRAPH. Here’s what he had to say: A graph is [...]
My team received what turned out to be an interesting call for help from one of our clients today. We resolved the client's coding error but it also served as a reminder of a little used feature of BASE SAS, namely the ability to specify directory names in code rather than bother with libnames. There are pro's and con's for doing this. I'll discuss these below after I explain the feature.
We're used to specifying data sets on DATA statements in the "libname.dataset" style. However, instead of using a data set name, you can specify the physical pathname to the file, using syntax that your operating system understands. The pathname must be enclosed in single or double quotation marks. Here's an example:
data "c:\mydata\mydataset";
In the foregoing example, the DATA step would create a SAS data set file named mydataset.sas7bdat in the c:\mydata directory.
There's more information in the section titled "Accessing Permanent SAS Files without a Libref" in the SAS 9.3 Language Reference: Concepts. You will see that we can use the same naming technique in almost any situation where a library and data set name are expected, e.g. a SET statement, a MERGE statement, an UPDATE statement, a MODIFY statement, the DATA= option of a SAS procedure, and the OPEN function.
My client's coding error resulted from the fact that they had specified a macro parameter intended as a data set name and they had surrounded it with quotes. The call %demo("name") resulted in a DATA statement like this: data "name". As a result, SAS tried to create a file named name.sas7bdat in the SAS session's current directory. That directory was the root directory of the SASApp server, the user didn't have permission to write to it, and hence the code failed. The intention was to create a data set named "name" in the work directory, the actuality was significantly different. It was all caused by a common misunderstanding/mistake - using quotes around character strings in macros.
So, we understand how we can dispense with LIBNAME statements, but should we take advantage of this capability? Well, I can't see too many advantages, but I can see plenty of disadvantages!
The disadvantages include i) need to accurately specify directory paths throughout the program (rather than eight character libnames), ii) cannot quickly and easily change a directory location (as can be useful when testing), and iii) cannot specify an engine for the library.
Can you think of any advantages? Let us know your suggestions in a comment.
Before we've even got SAS Global Forum under our belts, registration for SAS Professionals Convention in Marlow, July 10th to 12th 2012, is open!
I’m excited to let you know about two opportunities for students at the Analytics 2012 Conference, Oct. 8-9 in Las Vegas. The first is the Student Poster Contest. If you have some research to share with the analytics community, consider submitting an abstract. If your abstract is accepted, then you [...]
A recent exchange on the R-sig-teaching list featured a discussion of how best to teach new students R. The initial post included an exercise to write a function, that given a n, will draw n rows of a triangle made up of "*", noting that for a beginner, this may require two for loops. For example, in pseudo-code:
for i = 1 to n
for j = 1 to i
print "*"
> ifelse(outer(1:5, 1:5, `>=`), "*", " ")
[,1] [,2] [,3] [,4] [,5]
[1,] "*" " " " " " " " "
[2,] "*" "*" " " " " " "
[3,] "*" "*" "*" " " " "
[4,] "*" "*" "*" "*" " "
[5,] "*" "*" "*" "*" "*"
> lapply(1:5, function(x) cat(rep("*", x), "\n"))
*
* *
* * *
* * * *
* * * * *
data test;
array star [5] $ star1 - star5;
do i = 1 to 5;
star[i] = "*";
output;
end;
run;
proc print noobs; var star1 - star5; run;
star1 star2 star3 star4 star5
*
* *
* * *
* * * *
* * * * *
I was pleased to see some papers on the subject of software development processes at SAS Global Forum this year. The IT industry hasn't yet reached a point where a consensus on the perfect software development process has been reached (will it ever?). So, it's no surprise that opinions differ on some matters.
To a statistician, the LAG function (which was introduced in SAS/IML 9.22) is useful for time series analysis. To a numerical analyst and a statistical programmer, the function provides a convenient way to compute quantitites that involve adjacent values in any vector.
The LAG function is essentially a "shift operator." It shifts a vector of values and pads the result with missing values so that the returned vector has the same number of elements as the original vector. For example, the following SAS/IML statements define the first few terms of the Fibonacci series and call the LAG function to shift the series by one element.
proc iml;
v = {1, 1, 2, 3, 5, 8, 13, 21}; /* Fibonacci sequence */
lag1 = lag(v); /* by default, lag=1 ==> shift forward */
first = 1:(nrow(v)-1); /* index 1:(N-1) */
v1 = v[first]; /* extract all but the last element */
print lag1 v1;
The returned vector, lag1, contains a missing value in the first element and does not contains the last element of v. Notice that the nonmissing values are similar to v1, which is obtained by subsetting the first N-1 elements of the vector v.
You can shift elements the other way by using a negative value for the lag parameter. (This is sometimes called computing a lead.)
lag2 = lag(v, -1); /* shift backward */ last = 2:nrow(v); /* index 2:N */ v2 = v[last]; /* extract all but first element */
The returned vector (not shown) contains a missing value in the last element and does not contains the first element of v.
The LAG function is valuable when you want to compute a quantity that involves adjacent elements. For example, the following statements compute the ratio of adjacent values in the Fibonacci sequence:
z = v/lag(v); /* ratio of adjacent values */ print z;
This ratio quickly converges to the Golden Ratio, which is which is 1.61803399.... In a previous post, I show how you can undestand this result by looking at the eigenvalues of a certain linear transformation.
So, yes, by all means, use the LAG function to compute lags and leads in time series data. However, the LAG functon is also useful for any numerical computation that involves adjacent values in a sequence.
I was pleased to see a number of papers at this year's SAS Global Forum that dared to focus on topics outside of SAS technology and syntax. Two papers that particularly caught my interest were How to Create a Business Intelligence Strategy by Guy Garrett, and The Systems Development Life Cycle (SDLC) as a Standard: Beyond the Documentation by Dianne Rhodes. These papers were good demonstrations of the fact that you can buy the best software in the world, but you'll not optimise your return on investment if you don't put it to use in a planned, structured manner.
The focus of SAS Global Forum should always be SAS software and solutions. I'm not suggesting the event should be turned into a computer science conference, but there's a balance that can be struck. In my opinion, the balance lies at a point whereby attendees' interest in planning and process can be piqued such that they want to find out more once they return to their office.