21

Oct

## How cheap will gasoline prices go?

Have you noticed lower gasoline prices lately? How low will they go, and how long will they stay down? Let's use SAS to analyze some of the data!... First, let's look at just the price of gasoline over time. Here's a plot of the US average gasoline price, each week […]
Read More

21

Oct

## Example 2014.12: Changing repeated measures data from wide to narrow format

Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.

Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

We'll begin by making a narrower data frame with just the columns noted above. We use the

Now we'll convert to the long format. The standard R approach is to use the

In the preceding, the

Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.

The

The

Obs ID CESD1 CESD2 CESD3 CESD4 CESD

1 1 7 . 8 5 49

2 2 11 . . . 30

3 3 14 . . 49 39

4 4 44 . . 20 15

5 5 26 27 15 28 39

...

Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

Obs ID time CESD cesd_tv

1 1 1 49 7

2 1 2 49 .

3 1 3 49 8

4 1 4 49 5

...

In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

**R**We'll begin by making a narrower data frame with just the columns noted above. We use the

`select()`function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function`subset(..., select=)`because of dplyr's useful`starts_with()`function. This operates on the column names as character vectors in a hopefully obvious way.load("c:/book/savedfile")

library(dplyr)

wide = select(ds, id, starts_with("cesd"))

Now we'll convert to the long format. The standard R approach is to use the

`reshape()`function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),

v.names = "cesd_tv",

idvar = c("id", "cesd"), direction="long")

long[long$id == 1,]

id cesd time cesd_tv

1.49.1 1 49 1 7

1.49.2 1 49 2 NA

1.49.3 1 49 3 8

1.49.4 1 49 4 5

In the preceding, the

`varying`parameter is a list of columns which vary over time, while the`id.var`columns appear at each time. The`v.names`parameter is the name of the column which will hold the values of the`varying`variables.Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.

c1 = subset(wide, select= c(id, cesd, cesd1))

c1$time = 1

names(c1)[3] = "cesd_tv"

c2 = subset(wide, select= c(id, cesd, cesd2))

c2$time = 2

names(c2)[3] = "cesd_tv"

c3 = subset(wide, select= c(id, cesd, cesd3))

c3$time = 3

names(c3)[3] = "cesd_tv"

c4 = subset(wide, select= c(id, cesd, cesd4))

c4$time = 4

names(c4)[3] = "cesd_tv"

long = rbind(c1,c2,c3,c4)

long[long$id==1,]

id cesd cesd_tv time

1 1 49 7 1

454 1 49 NA 2

907 1 49 8 3

1360 1 49 5 4

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.

library(tidyr)

gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>%

mutate(time = as.numeric(substr(names,5,5))) %>%

arrange(id,time) -> long

head(long)

id cesd names cesd_tv time

1 1 49 cesd1 7 1

2 1 49 cesd2 NA 2

3 1 49 cesd3 8 3

4 1 49 cesd4 5 4

5 2 30 cesd1 11 1

6 2 30 cesd2 NA 2

The

`gather()`function takes a data frame (the first argument) and returns new columns named in the`key`and`value`parameter. The contents of the columns are the names (in the`key`) and the values (in the`value`) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The`mutate`function is like the R base function`transform()`but has some additional features and may be faster in some settings. Finally, the`arrange()`function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to displayThe

`%>%`operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the`mutate()`function should be read as taking the result of the`gather()`as its input data frame, and sending its output data frame into the`arrange()`function. For Ken, the right assignment arrow (`-> long`) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as`long = gather...`or`long , etc.`

In SAS, we'll make the narrow data set using the

The simpler way to make the desired data set is with the transpose procedure. Here the

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.

Read More**SAS**In SAS, we'll make the narrow data set using the

`keep`statement in the`data`step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by`starts_with()`in dplyr.

data all;

set "c:/book/help.sas7bdat";

run;

data wide;

set all;

keep id cesd:;

run;

The simpler way to make the desired data set is with the transpose procedure. Here the

`by`statement forces the variables listed in that statement not to be transposed. The`notsorted`options save us having to actually sort the variables. Otherwise the procedure works like`gather()`: each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for`gather()`'s`key`(SAS:`_NAME_`)and`value`(SAS:`COL1`) though these can be changed.

proc transpose data = wide out = long_a;

by notsorted id notsorted cesd;

run;

data long;

set long_a;

time = substr(_name_, 5);

rename col1=cesd_tv;

run;

proc print data = long;

where id eq 1;

var id time cesd cesd_tv;

run;

Obs ID time CESD cesd_tv

1 1 1 49 7

2 1 2 49 .

3 1 3 49 8

4 1 4 49 5

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.

data long;

set wide;

time = 1; cesd_tv = cesd1; output;

time = 2; cesd_tv = cesd2; output;

time = 3; cesd_tv = cesd3; output;

time = 4; cesd_tv = cesd4; output;

run;

proc print data = long;

where id eq 1;

var id time cesd cesd_tv;

run;

Obs ID time CESD cesd_tv

1 1 1 49 7

2 1 2 49 .

3 1 3 49 8

4 1 4 49 5

**An unrelated note about aggregators:**We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.21

Oct

## SAS Certification at Analytics 2014

Attending an industry conference requires an investment in time away from the office and maximizing that investment makes a lot of sense. In addition to gaining insight into challenges facing the analytics industry today, discovering and evaluating new products and services, and networking with the largest gathering of analytics professionals […]
Read More

21

Oct

## Automated testing by pytest

The most hard part in testing is to write test cases, which is time-consuming and error-prone. Fortunately, besides Python built-in modules such as

`doctest`

, `unittest`

, there are quite a few third-party packages that could help with automated testing. My favorite one is pytest, which enjoys proven record and syntax sugar. ### Step 1: test-driven development

For example, there is a coding challenge on Leetcode:

Find Minimum in Rotated Sorted ArraySuppose a sorted array is rotated at some pivot unknown to you beforehand.

(i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2).

Find the minimum element.

You may assume no duplicate exists in the array.

The straightforward way to find a minimal element in an array(or list in Python) is sequential searching, which goes through every element and has a time complexity of

However, this question provides a rotated sorted array, which suggests a binary search and reduces the complexity from

`O(N)`

. If the array is sorted, then the minimal one is the first element that only costs `O(1)`

.However, this question provides a rotated sorted array, which suggests a binary search and reduces the complexity from

`O(N)`

to `O(logN)`

.As usual, write the test cases first. The great thing for pytest is that it significantly simplies the effort to code the test cases: in this example, I only use 3 lines to generate 101 test cases to cover all conditions from 0 to 99 and also include an null test.

Next step is to code the function. It is easy to transplant the iterative approach of binary search to this question. If the pointer is between a sorted segment, then return the most left element as minimal. Otherwise, adjust the right boundary and the left boundary.

`# test1.py`

import pytest

# Prepare 101 test cases

array = list(range(100))

_testdata = [[array[i: ] + array[ :i], 0] for i in range(100)]

_testdata += [pytest.mark.empty(([], None))]

# Code the initial binary search function

def findMinPrev(num):

lo, hi = 0, len(num) - 1

while lo <= hi:

if num[lo] <= num[hi]:

return num[lo]

mid = (lo + hi) / 2

if num[mid] < num[hi]:

hi = mid - 1

else:

lo = mid + 1

@pytest.mark.parametrize('input, expected', _testdata)

def test_findMinPrev(input, expected):

assert findMinPrev(input) == expected

After running the

`py.test -v test1.py`

command, part of the results shows below. 65 tests passed and 36 failed; the failed cases return the much bigger values that suggests out of boundary during loops, and the selection of the boudaries may be too aggresive. `test1.py:20: AssertionError`

_________________________ test_findMinPrev[input98-0] _________________________

input = [98, 99, 0, 1, 2, 3, ...], expected = 0

@pytest.mark.parametrize('input, expected', _testdata)

def test_findMinPrev(input, expected):

> assert findMinPrev(input) == expected

E assert 98 == 0

E + where 98 = findMinPrev([98, 99, 0, 1, 2, 3, ...])

test1.py:20: AssertionError

==================== 36 failed, 65 passed in 0.72 seconds =====================

Now I adjust the right boundary slightly and finally come up with a solution that passes all the tests.

`def findMin(num):`

lo, hi = 0, len(num) - 1

while lo <= hi:

if num[lo] <= num[hi]:

return num[lo]

mid = (lo + hi) / 2

if num[mid] < num[hi]:

hi = mid

else:

lo = mid + 1

### Step 2: performance profiling

Besides the right solution, I am also interested in if the binary search method has indeed improved the performance. This step I choose line_profiler given its line-by-line ability of profiling. I take the most basic one (the sequential search) as benchmark, and also include the method that applies the

`min`

function since a few functions similar to it in Pyhton implement vectorizaiton to speed up. The test case is a rotated sorted array with 10 million elements. `# test2.py`

from line_profiler import LineProfiler

from sys import maxint

@profile

def findMinRaw(num):

"""Sequential searching"""

if not num:

return

min_val = maxint

for x in num:

if x < min_val:

min_val = x

return min_val

@profile

def findMinLst(num):

"""Searching by list comprehension"""

if not num:

return

return min(num)

@profile

def findMin(num):

""""Binary search"""

lo, hi = 0, len(num) - 1

while lo <= hi:

if num[lo] <= num[hi]:

return num[lo]

mid = (lo + hi) / 2

if num[mid] < num[hi]:

hi = mid

else:

lo = mid + 1

# Prepare a rotated array

array = list(range(10000000))

_testdata = array[56780: ] + array[ :56780]

# Test the three functions

findMinRaw(_testdata)

findMinLst(_testdata)

findMin(_testdata)

After running

`kernprof -l -v test2.py`

, I have the output as below. The sequential search has hit the loops 10000001 times and costs almost 14 seconds. The `min`

function encapsulate all details inside and uses 0.5 seconds which is 28 times faster. On the contrary, the binary search method only takes 20 loops to find the minimal value and spends just 0.0001 seconds. As a result, while dealing with large number, an improved algorithm can really save time. `Total time: 13.8512 s`

File: test2.py

Function: findMinRaw at line 4

Line # Hits Time Per Hit % Time Line Contents

==============================================================

4 @profile

5 def findMinRaw(num):

6 1 13 13.0 0.0 if not num:

7 return

8 1 3 3.0 0.0 min_val = maxint

9 10000001 16031900 1.6 47.5 for x in num:

10 10000000 17707821 1.8 52.5 if x < min_val:

11 2 5 2.5 0.0 min_val = x

12 1 3 3.0 0.0 return min_val

Total time: 0.510298 s

File: test2.py

Function: findMinLst at line 15

Line # Hits Time Per Hit % Time Line Contents

==============================================================

15 @profile

16 def findMinLst(num):

17 1 4 4.0 0.0 if not num:

18 return

19 1 1243016 1243016.0 100.0 return min(num)

Total time: 0.000101812 s

File: test2.py

Function: findMin at line 22

Line # Hits Time Per Hit % Time Line Contents

==============================================================

22 @profile

23 def findMin(num):

24 1 15 15.0 6.0 lo, hi = 0, len(num) - 1

25 20 40 2.0 16.1 while lo <= hi:

26 20 48 2.4 19.4 if num[lo] <= num[hi]:

27 1 2 2.0 0.8 return num[lo]

28 19 54 2.8 21.8 mid = (lo + hi) / 2

29 19 50 2.6 20.2 if num[mid] < num[hi]:

30 5 10 2.0 4.0 hi = mid

31 else:

32 14 29 2.1 11.7 lo = mid + 1

20

Oct

## Higher education and analytics

It's my favorite time of year! The leaves are changing. Football is back. And it's also time for our annual Analytics conference. One of the best parts about my job is getting to attend the conference each year and host the Inside Analytics video series. Not everyone at the conference gets […]
Read More

20

Oct

## My five favorite memories of MWSUG 2014

The Chicago weather cooperated for MWSUG 2014 with nice crisp fall temperatures, clear skies and beautiful sunrises over the lake. Aside from the weather, here are my five favorite memories of MWSUG 2014: 1. Location-location-location. The location and venue were fabulous! Not only were we along the Chicago River with a beautiful […] Read More

20

Oct

## Compute the log-determinant of an arbitrary matrix

A few years ago I wrote an article that shows how to compute the log-determinant of a covariance matrix in SAS. This computation is often required to evaluate a log-likelihood function. My algorithm used the ROOT function in SAS/IML to compute a Cholesky decomposition of the covariance matrix. The Cholesky […] Read More

19

Oct

## What Good is Cook’s D ?

I’ve written here before about visual literacy and Cook’s D is just my latest example. Most people intuitively understand that any sample can have outliers, say, an 80-year-old man who is the father of a six-year-old child, the new college graduate who is making $150,000 a year. We understand that those people may throw off […] Read More

19

Oct

## Consistent Group Colors by Value

Getting consistent group colors across different data sets for a graph is a common topic of interest. Recently a user wrote in to ask how to ensure that specific groups "values" for a bar chart get specific colors. The group values may arrive in different order, or some may […]
Read More

19

Oct

## Care and Feeding of Volunteers

Since it is the weekend, I decided to blog about weekend stuff. Look for more statistics tomorrow. For most of the past quarter-century, I have been roped into being a volunteer for one organization or another. Here is a very, very partial list: American Association on Mental Retardation National Council on Family Relations United States […] Read More