18
Jun

## Example 9.35: Discrete randomization and formatted output

A colleague asked for help with randomly choosing a kid within a family. This is for a trial in which families are recruited at well-child visits, but in each family only one of the children having a well-child visit that day can be in the study. The idea is that after recruiting the family, the research assistant needs to choose one child, but if they make that choice themselves, the children are unlikely to be representative. Instead, we'll allow them to make a random decision through an easily used slip that can be put into sealed envelopes. The envisioned process is that the RA will recruit the family, determine the number of eligible children, then open the envelope to find out which child was randomly selected.

One thought here would be to generate separate stacks of envelopes for each given family size, and have the research assistant open an envelope from the appropriate stack. However, this could be logistically challenging, especially since the RAs will spend weeks away from the home office. Instead, we'll include all plausible family sizes on each slip of paper. It seems unlikely that more than 5 children in a family will have well-child visits on the same day.

SAS
We'll use the SAS example to demonstrate using SAS macros to write SAS code, as well as showing a plausible use for SAS formats (section 1.4.12) and making use of proc print.
`/* the following macro will write out equal probabilities for selecting each integer between 1 and the argument, in the format needed for the rand function.  E.g., if the argument is 3, it will write out1/3,1/3,1/3*/%macro tbls(n);%do i = 1 %to &n;1/&n %if &i < &n %then ,%end;%mend tbls;/* then we can use the %tbls macro to create the randomizationvia rand("TABLE") (section 1.10.4). */ data kids;do family = 1 to 10000;  nkids = 2; chosen = rand("TABLE",%tbls(2)); output;  nkids = 3; chosen = rand("TABLE",%tbls(3)); output;  nkids = 4; chosen = rand("TABLE",%tbls(4)); output;  nkids = 5; chosen = rand("TABLE",%tbls(5)); output;end;run;/* check randomization */proc freq data = kids;table nkids * chosen / nocol nopercent;run;    nkids     chosen   Frequency|   Row Pct  |       1|       2|       3|       4|       5|  Total   ---------+--------+--------+--------+--------+--------+          2 |  50256 |  49744 |      0 |      0 |      0 | 100000            |  50.26 |  49.74 |   0.00 |   0.00 |   0.00 |   ---------+--------+--------+--------+--------+--------+          3 |  33429 |  33292 |  33279 |      0 |      0 | 100000            |  33.43 |  33.29 |  33.28 |   0.00 |   0.00 |   ---------+--------+--------+--------+--------+--------+          4 |  25039 |  24839 |  25245 |  24877 |      0 | 100000            |  25.04 |  24.84 |  25.25 |  24.88 |   0.00 |   ---------+--------+--------+--------+--------+--------+          5 |  19930 |  20074 |  20188 |  20036 |  19772 | 100000            |  19.93 |  20.07 |  20.19 |  20.04 |  19.77 |   ---------+--------+--------+--------+--------+--------+   Total      128654   127949    78712    44913    19772   400000`

Looks pretty good. Now we need to make the output usable to the research assistants, by formatting the results into English. We'll use the same format for each number of kids. This saves some keystrokes now, but may possibly cause the RAs some confusion-- it means that we might refer to the "4th oldest" of 4 children, rather than the "youngest". We could fix this using a different format for each number of children, analogous to the R version below.
`proc format;value chosen1 = "oldest"2 = '2nd oldest'3 = '3rd oldest'4 = '4th oldest'5 = '5th oldest';run; /* now, make a text variable the concatenates (section 1.4.5) the variables and some explanatory text */data k2;set kids;if nkids eq 2 then  t1 = "If there are " || strip(nkids) ||" children then choose the " ||       strip(put(chosen,chosen.)) || " child.";else  t1 = "             " || strip(nkids) ||" ________________________ " ||       strip(put(chosen,chosen.));run;/* then we print.  Notice the options to print in plain text, shorten the page length and width, and remove the date and page number from the SAS output, aswell as in the proc print statement to remove the observation number andshow the line number, with a few other tricks */options nonumber nodate ps = 60 ls = 68;OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";proc print data = k2 (obs = 3) noobs label sumlabel;by family;var t1;label t1 = '00'x family = "Envelope";run;---------------------------- Envelope=1 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ 3rd oldest                  4 ________________________ 4th oldest                  5 ________________________ 5th oldest---------------------------- Envelope=2 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ oldest                  4 ________________________ oldest                  5 ________________________ 3rd oldest---------------------------- Envelope=3 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ 2nd oldest                  4 ________________________ 3rd oldest                  5 ________________________ 2nd oldest`

R
For R, we leave some trial code in place, to demonstrate how one might discover, test, and build R code in this setting. Most results have been omitted.
`sample(5, size = 1)   # choose a (discrete uniform) random integer between 1 and 5apply(matrix(2:5),1,sample,size=1)   # choose a random integer between 1 and 2, then between 1 and 3, etc., # using apply() to repeat the call to sample() with different maximum number# apply() needs a matrix or array input# result of this is the raw data needed for one familyreplicate(3,apply(matrix(2:5),1,sample,size=1))# replicate() is in the apply() family and just repeats the # function n times     [,1] [,2] [,3][1,]    2    1    2[2,]    2    1    2[3,]    2    2    2[4,]    3    5    4`

Now we have the raw data for the envelopes. Before formatting it for printing, let's check it to make sure it works correctly.
`test=replicate(100000, apply(matrix(2:5), 1, sample, size=1))apply(test, 1, summary)        [,1] [,2]  [,3]  [,4]Min.     1.0    1 1.000 1.0001st Qu.  1.0    1 1.000 2.000Median   1.0    2 2.000 3.000Mean     1.5    2 2.492 3.0033rd Qu.  2.0    3 3.000 4.000Max.     2.0    3 4.000 5.000# this is not so helpful-- need the count or percent for each number# this would be the default if the data were factors, but they aren't# check to see if we can trick summary() into treating these integers# as if they were factorsmethods(summary)# yes, there's a summary() method for factors-- let's apply it# there's also apply(test,1,table) which might be better, if you remember itapply(test, 1, summary.factor)[[1]]    1     2 50025 49975 [[2]]    1     2     3 33329 33366 33305 [[3]]    1     2     3     4 25231 25134 24849 24786 [[4]]    1     2     3     4     5 19836 20068 20065 20022 20009 # apply(test,1,table) will give similar results, if you remember it`

Well, that's not too pretty, but it's clear that the randomization is working. Now it's time to work on formatting the output.
`mylist=replicate(5, apply(matrix(2:5), 1, sample, size=1))# brief example data set# We'll need to use some formatted values (section 1.14.12), as in SAS. # Here, we'll make new value labels for each number of children,# which will make the output easier to read.  We add in an envelope # number and wrap it all into a data frame.df = data.frame(envelope = 1:5,   twokids=factor(mylist[1,],1:2,labels=c("youngest","oldest")),  threekids=factor(mylist[2,],1:3,labels=c("youngest", "middle", "oldest")),  fourkids=factor(mylist[3,],1:4,labels=c("youngest", "second youngest",       "second oldest", "oldest")),  fivekids=factor(mylist[4,],1:5,labels=c("youngest", "second youngest",       "middle", "second oldest", "oldest")))# now we need a function to take a row of the data frame and make a single slip# the paste() function (section 1.4.5) puts together the fixed and variable # content of each row, while the cat() function will print it without quotesslip = function(kidvec) {  cat(paste("------------- Envelope", kidvec[1], "------------------"))  cat(paste("\nIf there are", 2:5, " children, select the", kidvec[2:5],"child"))  cat("\n \n \n")}# test it on one rowslip(df[1,])# looks good-- now we can apply() it to each row of the data frameapply(df, 1, slip)------------- Envelope 1 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the second youngest child If there are 5  children, select the youngest child  ------------- Envelope 2 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the second oldest child If there are 5  children, select the middle child------------- Envelope 3 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the youngest child If there are 5  children, select the second youngest child# and so forth# finally, we can save the result in a file with# capture.output()capture.output(apply(df,1,slip), file="testslip.txt")`

18
Jun

## Example 9.35: Discrete randomization and formatted output

A colleague asked for help with randomly choosing a kid within a family. This is for a trial in which families are recruited at well-child visits, but in each family only one of the children having a well-child visit that day can be in the study. The idea is that after recruiting the family, the research assistant needs to choose one child, but if they make that choice themselves, the children are unlikely to be representative. Instead, we'll allow them to make a random decision through an easily used slip that can be put into sealed envelopes. The envisioned process is that the RA will recruit the family, determine the number of eligible children, then open the envelope to find out which child was randomly selected.

One thought here would be to generate separate stacks of envelopes for each given family size, and have the research assistant open an envelope from the appropriate stack. However, this could be logistically challenging, especially since the RAs will spend weeks away from the home office. Instead, we'll include all plausible family sizes on each slip of paper. It seems unlikely that more than 5 children in a family will have well-child visits on the same day.

SAS
We'll use the SAS example to demonstrate using SAS macros to write SAS code, as well as showing a plausible use for SAS formats (section 1.4.12) and making use of proc print.
`/* the following macro will write out equal probabilities for selecting each integer between 1 and the argument, in the format needed for the rand function.  E.g., if the argument is 3, it will write out1/3,1/3,1/3*/%macro tbls(n);%do i = 1 %to &n;1/&n %if &i < &n %then ,%end;%mend tbls;/* then we can use the %tbls macro to create the randomizationvia rand("TABLE") (section 1.10.4). */ data kids;do family = 1 to 10000;  nkids = 2; chosen = rand("TABLE",%tbls(2)); output;  nkids = 3; chosen = rand("TABLE",%tbls(3)); output;  nkids = 4; chosen = rand("TABLE",%tbls(4)); output;  nkids = 5; chosen = rand("TABLE",%tbls(5)); output;end;run;/* check randomization */proc freq data = kids;table nkids * chosen / nocol nopercent;run;    nkids     chosen   Frequency|   Row Pct  |       1|       2|       3|       4|       5|  Total   ---------+--------+--------+--------+--------+--------+          2 |  50256 |  49744 |      0 |      0 |      0 | 100000            |  50.26 |  49.74 |   0.00 |   0.00 |   0.00 |   ---------+--------+--------+--------+--------+--------+          3 |  33429 |  33292 |  33279 |      0 |      0 | 100000            |  33.43 |  33.29 |  33.28 |   0.00 |   0.00 |   ---------+--------+--------+--------+--------+--------+          4 |  25039 |  24839 |  25245 |  24877 |      0 | 100000            |  25.04 |  24.84 |  25.25 |  24.88 |   0.00 |   ---------+--------+--------+--------+--------+--------+          5 |  19930 |  20074 |  20188 |  20036 |  19772 | 100000            |  19.93 |  20.07 |  20.19 |  20.04 |  19.77 |   ---------+--------+--------+--------+--------+--------+   Total      128654   127949    78712    44913    19772   400000`

Looks pretty good. Now we need to make the output usable to the research assistants, by formatting the results into English. We'll use the same format for each number of kids. This saves some keystrokes now, but may possibly cause the RAs some confusion-- it means that we might refer to the "4th oldest" of 4 children, rather than the "youngest". We could fix this using a different format for each number of children, analogous to the R version below.
`proc format;value chosen1 = "oldest"2 = '2nd oldest'3 = '3rd oldest'4 = '4th oldest'5 = '5th oldest';run; /* now, make a text variable the concatenates (section 1.4.5) the variables and some explanatory text */data k2;set kids;if nkids eq 2 then  t1 = "If there are " || strip(nkids) ||" children then choose the " ||       strip(put(chosen,chosen.)) || " child.";else  t1 = "             " || strip(nkids) ||" ________________________ " ||       strip(put(chosen,chosen.));run;/* then we print.  Notice the options to print in plain text, shorten the page length and width, and remove the date and page number from the SAS output, aswell as in the proc print statement to remove the observation number andshow the line number, with a few other tricks */options nonumber nodate ps = 60 ls = 68;OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";proc print data = k2 (obs = 3) noobs label sumlabel;by family;var t1;label t1 = '00'x family = "Envelope";run;---------------------------- Envelope=1 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ 3rd oldest                  4 ________________________ 4th oldest                  5 ________________________ 5th oldest---------------------------- Envelope=2 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ oldest                  4 ________________________ oldest                  5 ________________________ 3rd oldest---------------------------- Envelope=3 ----------------------------     If there are 2 children then choose the 2nd oldest child.                  3 ________________________ 2nd oldest                  4 ________________________ 3rd oldest                  5 ________________________ 2nd oldest`

R
For R, we leave some trial code in place, to demonstrate how one might discover, test, and build R code in this setting. Most results have been omitted.
`sample(5, size = 1)   # choose a (discrete uniform) random integer between 1 and 5apply(matrix(2:5),1,sample,size=1)   # choose a random integer between 1 and 2, then between 1 and 3, etc., # using apply() to repeat the call to sample() with different maximum number# apply() needs a matrix or array input# result of this is the raw data needed for one familyreplicate(3,apply(matrix(2:5),1,sample,size=1))# replicate() is in the apply() family and just repeats the # function n times     [,1] [,2] [,3][1,]    2    1    2[2,]    2    1    2[3,]    2    2    2[4,]    3    5    4`

Now we have the raw data for the envelopes. Before formatting it for printing, let's check it to make sure it works correctly.
`test=replicate(100000, apply(matrix(2:5), 1, sample, size=1))apply(test, 1, summary)        [,1] [,2]  [,3]  [,4]Min.     1.0    1 1.000 1.0001st Qu.  1.0    1 1.000 2.000Median   1.0    2 2.000 3.000Mean     1.5    2 2.492 3.0033rd Qu.  2.0    3 3.000 4.000Max.     2.0    3 4.000 5.000# this is not so helpful-- need the count or percent for each number# this would be the default if the data were factors, but they aren't# check to see if we can trick summary() into treating these integers# as if they were factorsmethods(summary)# yes, there's a summary() method for factors-- let's apply it# there's also apply(test,1,table) which might be better, if you remember itapply(test, 1, summary.factor)[[1]]    1     2 50025 49975 [[2]]    1     2     3 33329 33366 33305 [[3]]    1     2     3     4 25231 25134 24849 24786 [[4]]    1     2     3     4     5 19836 20068 20065 20022 20009 # apply(test,1,table) will give similar results, if you remember it`

Well, that's not too pretty, but it's clear that the randomization is working. Now it's time to work on formatting the output.
`mylist=replicate(5, apply(matrix(2:5), 1, sample, size=1))# brief example data set# We'll need to use some formatted values (section 1.14.12), as in SAS. # Here, we'll make new value labels for each number of children,# which will make the output easier to read.  We add in an envelope # number and wrap it all into a data frame.df = data.frame(envelope = 1:5,   twokids=factor(mylist[1,],1:2,labels=c("youngest","oldest")),  threekids=factor(mylist[2,],1:3,labels=c("youngest", "middle", "oldest")),  fourkids=factor(mylist[3,],1:4,labels=c("youngest", "second youngest",       "second oldest", "oldest")),  fivekids=factor(mylist[4,],1:5,labels=c("youngest", "second youngest",       "middle", "second oldest", "oldest")))# now we need a function to take a row of the data frame and make a single slip# the paste() function (section 1.4.5) puts together the fixed and variable # content of each row, while the cat() function will print it without quotesslip = function(kidvec) {  cat(paste("------------- Envelope", kidvec[1], "------------------"))  cat(paste("\nIf there are", 2:5, " children, select the", kidvec[2:5],"child"))  cat("\n \n \n")}# test it on one rowslip(df[1,])# looks good-- now we can apply() it to each row of the data frameapply(df, 1, slip)------------- Envelope 1 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the second youngest child If there are 5  children, select the youngest child  ------------- Envelope 2 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the second oldest child If there are 5  children, select the middle child------------- Envelope 3 ------------------If there are 2  children, select the youngest child If there are 3  children, select the youngest child If there are 4  children, select the youngest child If there are 5  children, select the second youngest child# and so forth# finally, we can save the result in a file with# capture.output()capture.output(apply(df,1,slip), file="testslip.txt")`

21
May

## Example 9.32: Multiple testing simulation

In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)
`%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);ods select none;data test;do sim = 1 to &nsims;  do i = 1 to &ntests;    raw_p = uniform(0) *       ( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );    output;  end;end;run;ods output pvalues = __pv;proc multtest inpvalues=test fdr;by sim;run; `

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.
`data __pv1;set __pv;if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;else fdrprop =0;run;proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;by sim; run;data pvals;set pvals_a;prop = sum(of col1 - col&ntests);run;ods select all;proc freq data = pvals; tables prop; run;%mend fdr;%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);                                      Cumulative    Cumulative     prop    Frequency     Percent     Frequency      Percent     ---------------------------------------------------------      0.5         758       75.80           758        75.80     0.55         210       21.00           968        96.80      0.6          27        2.70           995        99.50     0.65           5        0.50          1000       100.00`

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.
`checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {  raw_p = matrix(c(runif(nfalse * nsims) * howfalse,                    runif((ntests-nfalse) * nsims)), nrow=nsims)  fdr = t(apply(raw_p, 1, p.adjust, "fdr"))  reject = ifelse(fdr<.05, 1/ntests,0)  prop = apply(reject, 1, sum)  prop.table(table(prop)) }> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)prop  0.5  0.55   0.6  0.65 0.755 0.210 0.032 0.003 `

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.
`checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {   pvals = matrix(c(runif(nfalse * nsims) * howfalse,                     runif((ntests-nfalse) * nsims)), nrow=nsims)   hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))   reject = ifelse(hochberg<.05,1/ntests,0)   prop = apply(reject, 1, sum)   prop.table(table(prop)) } > checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)prop  0.5  0.55   0.6 0.951 0.046 0.003`

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".

21
May

## Example 9.32: Multiple testing simulation

In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)
`%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);ods select none;data test;do sim = 1 to &nsims;  do i = 1 to &ntests;    raw_p = uniform(0) *       ( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );    output;  end;end;run;ods output pvalues = __pv;proc multtest inpvalues=test fdr;by sim;run; `

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.
`data __pv1;set __pv;if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;else fdrprop =0;run;proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;by sim; run;data pvals;set pvals_a;prop = sum(of col1 - col&ntests);run;ods select all;proc freq data = pvals; tables prop; run;%mend fdr;%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);                                      Cumulative    Cumulative     prop    Frequency     Percent     Frequency      Percent     ---------------------------------------------------------      0.5         758       75.80           758        75.80     0.55         210       21.00           968        96.80      0.6          27        2.70           995        99.50     0.65           5        0.50          1000       100.00`

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.
`checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {  raw_p = matrix(c(runif(nfalse * nsims) * howfalse,                    runif((ntests-nfalse) * nsims)), nrow=nsims)  fdr = t(apply(raw_p, 1, p.adjust, "fdr"))  reject = ifelse(fdr<.05, 1/ntests,0)  prop = apply(reject, 1, sum)  prop.table(table(prop)) }> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)prop  0.5  0.55   0.6  0.65 0.755 0.210 0.032 0.003 `

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.
`checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {   pvals = matrix(c(runif(nfalse * nsims) * howfalse,                     runif((ntests-nfalse) * nsims)), nrow=nsims)   hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))   reject = ifelse(hochberg<.05,1/ntests,0)   prop = apply(reject, 1, sum)   prop.table(table(prop)) } > checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)prop  0.5  0.55   0.6 0.951 0.046 0.003`

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".

9
Apr

## Example 9.26: More circular plotting

SAS's Rick Wicklin showed a simple loess smoother for the temperature data we showed here. Then he came back with a better approach that does away with edge effects. Rick's smoothing was calculated and plotted on a cartesian plane. In this entry we'll explore another option or two for smoothing, and plot the results on the same circular plot.

Since Rick is showing SAS code, and Robert Allison has done the circular plot (plot) (code), we'll stick to the R again for this one.

R
We'll start out by getting the data and setting it up as we did earlier. We add the year back into the matrix t3old because it'll be needed later.
`temp1 = read.table("http://academic.udayton.edu/kissock/http/Weather/           gsod95-current/NYALBANY.txt")leap = c(0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1)days = rep(365,18) + leapmonthdays = c(31,28,31,30,31,30,31,31,30,31,30,31)temp1\$V3 = temp1\$V3 - 1994yearpart = function(daytvec,yeardays,mdays=monthdays){  part = (sum(mdays[1:(daytvec[1]-1)],(daytvec[1] > 2) * (yeardays[daytvec[3]]==366))           + daytvec[2] - ((daytvec[1] == 1)*31)) / yeardays[daytvec[3]]  return(part)}temp2 = as.matrix(temp1)radians = 2* pi * apply(temp2, 1, yearpart, days, monthdays)t3old = matrix(c(temp1\$V4[temp1\$V4 != -99 & ((temp1\$V3 < 18) )],                 radians[temp1\$V4 != -99  &  ((temp1\$V3 < 18) )],                 temp1\$V3[temp1\$V4 != -99 & ((temp1\$V3 < 18) )]), ncol=3)t3now= matrix(c(temp1\$V4[temp1\$V4 != -99 & ((temp1\$V3 == 18) |                            (temp1\$V3 == 17 & temp1\$V1 == 12))],                radians[temp1\$V4 != -99 & ((temp1\$V3 == 18) |                            (temp1\$V3 == 17 & temp1\$V1 == 12))]), ncol=2)library(plotrix)radial.plot(t3old[,1],t3old[,2],rp.type="s", point.col = 2, point.symbols=46,     clockwise=TRUE, start = pi/2, label.pos = (1:12)/6 * (pi),    radial.lim=c(-20,10,40,70,100),  labels=c("February 1","March 1",      "April 1","May 1","June 1","July 1","August 1","September 1",      "October 1","November 1","December 1","January 1"))radial.plot(t3now[,1],t3now[,2],rp.type="s", point.col = 1, point.symbols='*',     clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))`

If you didn't happen to see the update on the previous entry, note that the radial.lim option makes the axes for the added points match those for the initial plot. Otherwise, the added points plotted lower than they appeared, making the recent winter look cooler.

Rick started with a smoother, but often cyclic data can be fit well parametrically, using the sine and cosine of the cycle length as the covariates. With the data set up in radians already, this is trivially simple. The predicted values for the data can be retrieved with the fitted() function (e.g., section 3.7.3), which works with many model objects. These can then be fed into the radial.plot() function with rp.type="p" to make a line plot. The result is shown at the top-- the parametric fit appears to do a good job. Of course, you can fit on a square plot very easily with the plot() function, with result shown below.
`simple = lm(t3old[,1] ~ sin(t3old[,2]) + cos(t3old[,2]))radial.plot(fitted(simple),t3old[,2],rp.type="p", clockwise=TRUE,            start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))plot(t3old[,1] ~ t3old[,2], pch='.')lines(t3old[,2],fitted(simple))`

I didn't change the order of the data, so the line comes back to the beginning of the plot at the end of the year.

Adding a smoothed fit is nearly as easy. Just replace the lm() call with a loess() (section 5.2.6) call. The new line is added on top of the old one, to see just how they differ. The result is show below.
`simploess = loess(t3old[,1] ~ t3old[,2])radial.plot(fitted(simploess),t3old[,2],rp.type="p", line.col="blue",           clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))`

The parametric fit is pretty good, but misses the sharp dip seen in January, and the fit in the late fall and early spring appear to be slightly affected.

But this approach stacks up all the data from 18 years. It might be more appropriate to stretch the data across the calendar time, fit the smoothed line to that, and then wrap it around the circular plot. To do this, we'll need to add the year back into the radians. Finding an acceptable smoother was a challenge-- the smooth.spline() function used here was adequate, but as the second plot below shows, it misses some highs and lows. Adding the smoothed curve to the plot is as easy as before, however. The plot with smoothing by year is immediately below.
`radyear = t3old[,2] + (2 * pi * t3old[,3])better = smooth.spline(y=t3old[,1],x= radyear, all.knots=TRUE,spar=1.1)radial.plot(fitted(better),t3old[,2],rp.type="p", line.col="green",         clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))plot(t3old[,1] ~ radyear, pch = '.')lines(better)`

The relatively poor fit seen below makes the new (green) line at least as poor as the parametric fit. The extra variability in the winter is reflected in distinct lines in the winter. Rick's approach, to fit the data lumping across years, seems to be the best for fitting, though it's easier to see the heteroscedaticity in the ciruclar plot. But however you slice it, this winter has had an unusual number of very warm days.

9
Apr

## Example 9.26: More circular plotting

SAS's Rick Wicklin showed a simple loess smoother for the temperature data we showed here. Then he came back with a better approach that does away with edge effects. Rick's smoothing was calculated and plotted on a cartesian plane. In this entry we'll explore another option or two for smoothing, and plot the results on the same circular plot.

Since Rick is showing SAS code, and Robert Allison has done the circular plot (plot) (code), we'll stick to the R again for this one.

R
We'll start out by getting the data and setting it up as we did earlier. We add the year back into the matrix t3old because it'll be needed later.
`temp1 = read.table("http://academic.udayton.edu/kissock/http/Weather/           gsod95-current/NYALBANY.txt")leap = c(0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1)days = rep(365,18) + leapmonthdays = c(31,28,31,30,31,30,31,31,30,31,30,31)temp1\$V3 = temp1\$V3 - 1994yearpart = function(daytvec,yeardays,mdays=monthdays){  part = (sum(mdays[1:(daytvec[1]-1)],(daytvec[1] > 2) * (yeardays[daytvec[3]]==366))           + daytvec[2] - ((daytvec[1] == 1)*31)) / yeardays[daytvec[3]]  return(part)}temp2 = as.matrix(temp1)radians = 2* pi * apply(temp2, 1, yearpart, days, monthdays)t3old = matrix(c(temp1\$V4[temp1\$V4 != -99 & ((temp1\$V3 < 18) )],                 radians[temp1\$V4 != -99  &  ((temp1\$V3 < 18) )],                 temp1\$V3[temp1\$V4 != -99 & ((temp1\$V3 < 18) )]), ncol=3)t3now= matrix(c(temp1\$V4[temp1\$V4 != -99 & ((temp1\$V3 == 18) |                            (temp1\$V3 == 17 & temp1\$V1 == 12))],                radians[temp1\$V4 != -99 & ((temp1\$V3 == 18) |                            (temp1\$V3 == 17 & temp1\$V1 == 12))]), ncol=2)library(plotrix)radial.plot(t3old[,1],t3old[,2],rp.type="s", point.col = 2, point.symbols=46,     clockwise=TRUE, start = pi/2, label.pos = (1:12)/6 * (pi),    radial.lim=c(-20,10,40,70,100),  labels=c("February 1","March 1",      "April 1","May 1","June 1","July 1","August 1","September 1",      "October 1","November 1","December 1","January 1"))radial.plot(t3now[,1],t3now[,2],rp.type="s", point.col = 1, point.symbols='*',     clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))`

If you didn't happen to see the update on the previous entry, note that the radial.lim option makes the axes for the added points match those for the initial plot. Otherwise, the added points plotted lower than they appeared, making the recent winter look cooler.

Rick started with a smoother, but often cyclic data can be fit well parametrically, using the sine and cosine of the cycle length as the covariates. With the data set up in radians already, this is trivially simple. The predicted values for the data can be retrieved with the fitted() function (e.g., section 3.7.3), which works with many model objects. These can then be fed into the radial.plot() function with rp.type="p" to make a line plot. The result is shown at the top-- the parametric fit appears to do a good job. Of course, you can fit on a square plot very easily with the plot() function, with result shown below.
`simple = lm(t3old[,1] ~ sin(t3old[,2]) + cos(t3old[,2]))radial.plot(fitted(simple),t3old[,2],rp.type="p", clockwise=TRUE,            start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))plot(t3old[,1] ~ t3old[,2], pch='.')lines(t3old[,2],fitted(simple))`

I didn't change the order of the data, so the line comes back to the beginning of the plot at the end of the year.

Adding a smoothed fit is nearly as easy. Just replace the lm() call with a loess() (section 5.2.6) call. The new line is added on top of the old one, to see just how they differ. The result is show below.
`simploess = loess(t3old[,1] ~ t3old[,2])radial.plot(fitted(simploess),t3old[,2],rp.type="p", line.col="blue",           clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))`

The parametric fit is pretty good, but misses the sharp dip seen in January, and the fit in the late fall and early spring appear to be slightly affected.

But this approach stacks up all the data from 18 years. It might be more appropriate to stretch the data across the calendar time, fit the smoothed line to that, and then wrap it around the circular plot. To do this, we'll need to add the year back into the radians. Finding an acceptable smoother was a challenge-- the smooth.spline() function used here was adequate, but as the second plot below shows, it misses some highs and lows. Adding the smoothed curve to the plot is as easy as before, however. The plot with smoothing by year is immediately below.
`radyear = t3old[,2] + (2 * pi * t3old[,3])better = smooth.spline(y=t3old[,1],x= radyear, all.knots=TRUE,spar=1.1)radial.plot(fitted(better),t3old[,2],rp.type="p", line.col="green",         clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))plot(t3old[,1] ~ radyear, pch = '.')lines(better)`

The relatively poor fit seen below makes the new (green) line at least as poor as the parametric fit. The extra variability in the winter is reflected in distinct lines in the winter. Rick's approach, to fit the data lumping across years, seems to be the best for fitting, though it's easier to see the heteroscedaticity in the ciruclar plot. But however you slice it, this winter has had an unusual number of very warm days.

2
Apr

## Example 9.25: It's been a mighty warm winter? (Plot on a circular axis)

Updated (see below)

People here in the northeast US consider this to have been an unusually warm winter. Was it?

The University of Dayton and the US Environmental Protection Agency maintain an archive of daily average temperatures that's reasonably current. In the case of Albany, NY (the most similar of their records to our homes in the Massachusetts' Pioneer Valley), the data set as of this writing includes daily records from 1995 through March 12, 2012.

In this entry, we show how to use R to plot these temperatures on a circular axis, that is, where January first follows December 31st. We'll color the current winter differently to see how it compares. We're not aware of a tool to enable this in SAS. It would most likely require a bit of algebra and manual plotting to make it work.

R
The work of plotting is done by the radian.plot() function in the plotrix package. But there are a number of data management tasks to be employed first. Most notably, we need to calculate the relative portion of the year that's elapsed through each day. This is trickier than it might be, because of leap years. We'll read the data directly via URL, which we demonstrate in Example 8.31. That way, when the unseasonably warm weather of last week is posted, we can update the plot with trivial ease.
`library(plotrix)temp1 = read.table("http://academic.udayton.edu/kissock/http/            Weather/gsod95-current/NYALBANY.txt")leap = c(0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1)days = rep(365, 18) + leapmonthdays = c(31,28,31,30,31,30,31,31,30,31,30,31)temp1\$V3 = temp1\$V3 - 1994`

The leap, days, and monthdays vectors identify leap years, count the corrrect number of days in each year, and have the number of days in the month in non-leap years, respectively. We need each of these to get the elapsed time in the year for each day. The columns in the data set are the month, day, year, and average temperature (in Fahrenheit). The years are renumbered, since we'll use them as indexes later.

The yearpart() function, below, counts the proportion of days elapsed.
`yearpart = function(daytvec,yeardays,mdays=monthdays){  part = (sum(mdays[1:(daytvec[1]-1)],           (daytvec[1] > 2) * (yeardays[daytvec[3]]==366))           + daytvec[2] - ((daytvec[1] == 1)*31)) / yeardays[daytvec[3]]  return(part)}`

The daytvec argument to the function will be a row from the data set. The function works by first summing the days in the months that have passed (,sum(mdays[1:(daytvec[1]-1)]) adding one if it's February and a leap year ((daytvec[1] > 2) * (yeardays[daytvec[3]]==366))). Then the days passed so far in the current month are added. Finally, we subtract the length of January, if it's January. This is needed, because sum(1:0) = 1, the result of which is that that January is counted as a month that has "passed" when the sum() function quoted above is calculated for January days. Finally, we just divide by the number of days in the current year.

The rest is fairy simple. We calculate the radians as the portion of the year passed * 2 * pi, using the apply() function to repeat across the rows of the data set. Then we make matrices with time before and time since this winter started, admittedly with some ugly logical expressions (section 1.14.11), and use the radian.plot() function to make the plots. The options to the function are fairly self-explanatory.
`temp2 = as.matrix(temp1)radians = 2* pi * apply(temp2,1,yearpart,days,monthdays)t3old = matrix(c(temp1\$V4[temp1\$V4 != -99 & ((temp1\$V3 < 18) | (temp1\$V2 < 12))],          radians[temp1\$V4 != -99  &  ((temp1\$V3 < 18) | (temp1\$V2 < 2))]),ncol=2)t3now= matrix(c(temp1\$V4[temp1\$V4 != -99 &           ((temp1\$V3 == 18) | (temp1\$V3 == 17 & temp1\$V1 == 12))],          radians[temp1\$V4 != -99 & ((temp1\$V3 == 18) |           (temp1\$V3 == 17 & temp1\$V1 == 12))]),ncol=2)# from plottrix libraryradial.plot(t3old[,1],t3old[,2],rp.type="s", point.col = 2, point.symbols=46,            clockwise=TRUE, start = pi/2, label.pos = (1:12)/6 * (pi),             labels=c("February 1","March 1","April 1","May 1","June 1",            "July 1","August 1","September 1","October 1","November 1",            "December 1","January 1"), radial.lim=c(-20,10,40,70,100))radial.plot(t3now[,1],t3now[,2],rp.type="s", point.col = 1, point.symbols='*',             clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))`

The result is shown at the top. The dots (point.symbol is like pch so 20 is a point (section 5.2.2) show the older data, while the asterisks are the current winter. An alternate plot can be created with the rp.type="p" option, which makes a line plot. The result is shown below, but the lines connecting the dots get most of the ink and are not what we care about today.

Either plot demonstrates clearly that a typical average temperature in Albany is about 60 to 80 in August and about 10 to 35 in January, the coldest monthttp://www.blogger.com/img/blank.gifh.

Update
The top figure shows that it has in fact been quite a warm winter-- most of the black asterisks are near the outside of the range of red dots. Updating with more recent weeks will likely increase this impression. In the first edition of this post, the radial.lim option was omitted, which resulted in different axes in the original and "add" calls to radial.plot. This made the winter look much cooler. Many thanks to Robert Allison for noticing the problem in the main plot. Robert has made many hundreds of beautiful graphics in SAS, which can be found here. He also has a book. Robert also created a version of the plot above in SAS, which you can find here, with code here. Both SAS and R (not to mention a host of other environments) are sufficiently general and flexible that you can do whatever you want to do-- but varying amounts of expertise might be required.

2
Apr

## Example 9.25: It's been a mighty warm winter? (Plot on a circular axis)

Updated (see below)

People here in the northeast US consider this to have been an unusually warm winter. Was it?

The University of Dayton and the US Environmental Protection Agency maintain an archive of daily average temperatures that's reasonably current. In the case of Albany, NY (the most similar of their records to our homes in the Massachusetts' Pioneer Valley), the data set as of this writing includes daily records from 1995 through March 12, 2012.

In this entry, we show how to use R to plot these temperatures on a circular axis, that is, where January first follows December 31st. We'll color the current winter differently to see how it compares. We're not aware of a tool to enable this in SAS. It would most likely require a bit of algebra and manual plotting to make it work.

R
The work of plotting is done by the radian.plot() function in the plotrix package. But there are a number of data management tasks to be employed first. Most notably, we need to calculate the relative portion of the year that's elapsed through each day. This is trickier than it might be, because of leap years. We'll read the data directly via URL, which we demonstrate in Example 8.31. That way, when the unseasonably warm weather of last week is posted, we can update the plot with trivial ease.
`library(plotrix)temp1 = read.table("http://academic.udayton.edu/kissock/http/            Weather/gsod95-current/NYALBANY.txt")leap = c(0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1)days = rep(365, 18) + leapmonthdays = c(31,28,31,30,31,30,31,31,30,31,30,31)temp1\$V3 = temp1\$V3 - 1994`

The leap, days, and monthdays vectors identify leap years, count the corrrect number of days in each year, and have the number of days in the month in non-leap years, respectively. We need each of these to get the elapsed time in the year for each day. The columns in the data set are the month, day, year, and average temperature (in Fahrenheit). The years are renumbered, since we'll use them as indexes later.

The yearpart() function, below, counts the proportion of days elapsed.
`yearpart = function(daytvec,yeardays,mdays=monthdays){  part = (sum(mdays[1:(daytvec[1]-1)],           (daytvec[1] > 2) * (yeardays[daytvec[3]]==366))           + daytvec[2] - ((daytvec[1] == 1)*31)) / yeardays[daytvec[3]]  return(part)}`

The daytvec argument to the function will be a row from the data set. The function works by first summing the days in the months that have passed (,sum(mdays[1:(daytvec[1]-1)]) adding one if it's February and a leap year ((daytvec[1] > 2) * (yeardays[daytvec[3]]==366))). Then the days passed so far in the current month are added. Finally, we subtract the length of January, if it's January. This is needed, because sum(1:0) = 1, the result of which is that that January is counted as a month that has "passed" when the sum() function quoted above is calculated for January days. Finally, we just divide by the number of days in the current year.

The rest is fairy simple. We calculate the radians as the portion of the year passed * 2 * pi, using the apply() function to repeat across the rows of the data set. Then we make matrices with time before and time since this winter started, admittedly with some ugly logical expressions (section 1.14.11), and use the radian.plot() function to make the plots. The options to the function are fairly self-explanatory.
`temp2 = as.matrix(temp1)radians = 2* pi * apply(temp2,1,yearpart,days,monthdays)t3old = matrix(c(temp1\$V4[temp1\$V4 != -99 & ((temp1\$V3 < 18) | (temp1\$V2 < 12))],          radians[temp1\$V4 != -99  &  ((temp1\$V3 < 18) | (temp1\$V2 < 2))]),ncol=2)t3now= matrix(c(temp1\$V4[temp1\$V4 != -99 &           ((temp1\$V3 == 18) | (temp1\$V3 == 17 & temp1\$V1 == 12))],          radians[temp1\$V4 != -99 & ((temp1\$V3 == 18) |           (temp1\$V3 == 17 & temp1\$V1 == 12))]),ncol=2)# from plottrix libraryradial.plot(t3old[,1],t3old[,2],rp.type="s", point.col = 2, point.symbols=46,            clockwise=TRUE, start = pi/2, label.pos = (1:12)/6 * (pi),             labels=c("February 1","March 1","April 1","May 1","June 1",            "July 1","August 1","September 1","October 1","November 1",            "December 1","January 1"), radial.lim=c(-20,10,40,70,100))radial.plot(t3now[,1],t3now[,2],rp.type="s", point.col = 1, point.symbols='*',             clockwise=TRUE, start = pi/2, add=TRUE, radial.lim=c(-20,10,40,70,100))`

The result is shown at the top. The dots (point.symbol is like pch so 20 is a point (section 5.2.2) show the older data, while the asterisks are the current winter. An alternate plot can be created with the rp.type="p" option, which makes a line plot. The result is shown below, but the lines connecting the dots get most of the ink and are not what we care about today.

Either plot demonstrates clearly that a typical average temperature in Albany is about 60 to 80 in August and about 10 to 35 in January, the coldest monthttp://www.blogger.com/img/blank.gifh.

Update
The top figure shows that it has in fact been quite a warm winter-- most of the black asterisks are near the outside of the range of red dots. Updating with more recent weeks will likely increase this impression. In the first edition of this post, the radial.lim option was omitted, which resulted in different axes in the original and "add" calls to radial.plot. This made the winter look much cooler. Many thanks to Robert Allison for noticing the problem in the main plot. Robert has made many hundreds of beautiful graphics in SAS, which can be found here. He also has a book. Robert also created a version of the plot above in SAS, which you can find here, with code here. Both SAS and R (not to mention a host of other environments) are sufficiently general and flexible that you can do whatever you want to do-- but varying amounts of expertise might be required.