Adapting Cassell’s SAS Randomization Test Wrapper For Unbalanced MANOVA

From sasCommunity
Jump to: navigation, search

Adapting Cassell’s SAS Randomization Test Wrapper For Unbalanced Multivariate Analysis Of Variance

Introduction

It is often necessary to perform a multivariate analysis of variance (MANOVA) to determine the effects of one or more independent variables on a set of dependent variables taken together. When the error is not normally distributed, when variances are heterogeneous or when sample sizes are insufficient to invoke the central limit theorem, using a standard MANOVA may yield unreliable tests of significance.

Randomization tests are one way to solve this problem. Instead of performing a full permutation test, which would be too computationally intensive for many data sets, a randomization test does an analysis on a random sample of all possible permutations of the data. The results from the original MANOVA can be compared to the results produced by the many random reshufflings of the data to estimate the probability of type I error. There are currently several stand-alone programs available that can perform a randomization MANOVA, such as RT 2.1 (Manly 2011) and PERMANOVA (Anderson 2005). However, most, if not all, of these programs require a balanced design for a factorial MANOVA. Adapting Cassell’s (2002) SAS randomization wrapper for MANOVA will allow analysis of unbalanced designs. Cassell (2002) concluded that the wrapper yields “…a simple, flexible way to provide randomization tests in place of the standard SAS ® analyses”. As designed, the Cassell’s wrapper handles univariate data. Here, we develop an extension of this wrapper for the unbalanced, multivariate problem and illustrate a simple way to generate the appropriate randomizations of a multifactor, multivariate data set.

Data Set Preparation

Because multivariate data come as a set of dependent variables that are linked, the procedure for randomly shuffling the data set must be modified from that in the original wrapper designed for univariate analysis. This involves modifying the data set so that the dependent variables associated with each observation remain associated after the data set is shuffled. The original data set should contain all dependent and independent variables. This approach cannot accommodate covariates, so independent variables must be strictly categorical. For the randomization process to work, you will need to create one new indicator variable that conveys each observation’s membership in a “treatment” group, whose values uniquely represent every combination of all independent variables (i.e., as though this were a one way design).

For example, consider a data set named “grasshoppers” for an experiment on the effects of food and temperature on reproductive output by grasshoppers. There are two independent variables: temperature and food, and three dependent variables: day of oviposition (=ovday), number of eggs (=eggs), and mean egg mass (=eggmass). The design is factorial, with three temperatures: 28°, 32°and 36°C, and two food groups: low and high. Thus, there are 6 “treatment” combinations. You would create the new variables “Treatment”, where “Treatment=1” indicates 28° and low food, “Treatment=2” indicates 28° and high food, “Treatment=3” indicates 32° and low food, and so on.

The Code

The following SAS macros are adapted from Cassell (2002) to accommodate a multivariate data set. The first macro is %randomize. It will first make the desired number of replicates, which you specify, of the original data set, &INDATA. It also makes an array of all the indicator variables of the original replicate. Then, for each new replicate, it randomly replaces each observation’s indicator variables with one of the numbers from the array. This ensures that the original distribution of indicator variables is maintained within each replicate, which is equivalent to randomly shuffling the multivariate observations back into the original sets of treatment conditions. For further detail on this code, please see Cassell (2002).

%macro randomize( 
    indata=_last_, 
    outdata=Randoutput, 
    treat=y, 
    numreps=1000, 
    seed=1337);

proc sql noprint;
    select count(*) into :numrecs from &INDATA;
quit;

data temp1;
    retain seed &SEED;
    drop seed;
    set &INDATA;
    do replicate=1 to &NUMREPS;
        call ranuni(seed,rand_dep);
        output;
    end;
run;


proc sort data=temp1;
    by replicate rand_dep;
run;

data &OUTDATA;
    array deplist{&NUMRECS} _temporary_;
    set &INDATA(in=in_orig) temp1(drop=rand_dep);
    if in_orig then do;
        replicate=0;
        deplist{_n_}=&TREAT;
    end;
    else &TREAT=deplist{1+mod(_n_,&NUMRECS)};
run;

%mend randomize;


Calling the randomization macro to run 1,000 replications would look like this:

%randomize(indata=grasshoppers, outdata=Randoutput, treat=Treatment, numreps=1000, seed=1337)

Next, you must code in the new independent variables for each observation in the output data set &RANDOUTPUT based on the randomly assigned indicator variable. To do this, you must create new variables corresponding to each of your original independent variables. In this example, you might choose NewTemp and NewFood.


data Randoutput;
    set Randoutput;
    if Treatment=1 or Treatment=2 then NewTemp=28;
    else if Treatment=3 or Treatment=4 then NewTemp=32;
    else NewTemp=36;
    if Treatment=1 or Treatment=3 or Treatment=5 then 	NewFood='Low';
    else NewFood='High';
run;



The goal now is to perform a MANOVA on each replicate in Randoutput and compare the original results with the set of 1,000 results from the randomized data. SAS stores multivariate statistics in a data set called MultStat. However, MultStat gets rewritten each time a MANOVA is performed, so we need to use the output delivery system (ODS) to create a new data set, &MULTISTATS. This data set will contain the test statistics that SAS produces for every effect within each replicate shuffled data set, so it must then be trimmed down to a data set, &TRIMSTATS, containing only entries with the test statistic of interest (in this case we choose Pillai’s Trace) and its associated p-value for each effect.


ods output MultStat=MultiStats;
proc glm data=Randoutput;
    by replicate;
    class NewTemp NewFood;
    model ovday eggs eggmass = newtemp newfood newtemp*newfood  /ss3 nouni;
    manova h=_all_;
run;
quit;
ods output close;

data TrimStats;
    set MultiStats;
    where Statistic="Pillai's Trace";
run;



TrimStats is now the data set that will be used in the analysis macro, %mvanalysis, which will find the percentage of replicates with p-values lower than that observed in the analysis of the original data set for the effect in question.


%macro mvanalysis(
    randdata=TrimStats,
    where=,
    testprob=ProbF,
    testlabel=F test,);


data Analysis;
    retain p numsig numtot 0;
    set &RANDDATA end=endofile;
    %if "&WHERE" ne ""
        %then where &WHERE %str(;);
    if replicate=0 then p=&TESTPROB;
    else do;
        numtot+1;
        numsig + (&TESTPROB<p);
    end;
    if endofile then do;
        ratio=numsig/numtot;
        put "Randomization test for &TESTLABEL "
        %if "&WHERE" ne "" %then "where &WHERE ";
        "has a significance level of " ratio 6.4;
    end;
run;

%mend mvanalysis;

It is important to note that when the data set &ANALYSIS is created, the variable named “p”, which corresponds to the original replicate’s p-value, must not be named any version of “pvalue” or else the macro will not work. MultStat automatically generates its own variable named “Pvalue”, which will end up in &TRIMSTATS, and SAS will assume this is the value to which you refer if you use the name “pvalue” in the retain statement.

To specify the effect you want to test, enter “Hypothesis=‘[variable name]’” after “where=”. Here, we call for an analysis of the interaction term.

%mvanalysis(randdata=TrimStats, where=Hypothesis='newtemp_newfood', testprob=ProbF, testlabel=NewTemp*NewFood test)</big> 

The results of %mvanalysis for interaction will appear in the log window.

Randomization test for NewTemp*NewFood test where Hypothesis='newtemp_newfood' has a significance level of 0.1540.

From this, you can conclude that the interaction is not significant and move on to testing the main effects.

%mvanalysis(randdata=TrimStats, where=Hypothesis='newtemp', testprob=ProbF, testlabel=NewTemp test) 

%mvanalysis(randdata=TrimStats, where=Hypothesis='newfood', testprob=ProbF, testlabel=NewFood test)


The results of %mvanalysis for newtemp and newfood will appear in the log window.

Randomization test for NewTemp test where Hypothesis='newtemp' has a significance level of 0.0000
Randomization test for NewFood test where Hypothesis='newfood' has a significance level of 0.0000


From this, we conclude that the two main effects are significant.

Data set

Below the data set used for this analysis.

data grasshoppers;
    input temp food$ ovday eggs initmass eggmass treatment newtemp newfood$;
    datalines;
28 l 51 41 4.69 0.01137 1 . .
28 l 56 42 2.63 0.01498 1 . .
28 l 43 35 2.20 0.01056 1 . .
28 l 49 33 2.55 0.01292 1 . .
28 l 50 36 2.92 0.01610 1 . .
28 l 50 37 3.28 0.01462 1 . .
28 h 46 57 3.52 0.01367 2 . .
28 h 49 33 3.12 0.02052 2 . .
28 h 42 39 2.62 0.01382 2 . .
28 h 47 48 2.72 0.02606 2 . .
28 h 44 47 3.04 0.01281 2 . .
28 h 43 50 3.02 0.02862 2 . .
28 h 47 44 2.96 0.01452 2 . .
28 h 46 49 3.38 0.01875 2 . .
28 h 42 55 4.47 0.01477 2 . .
32 l 49 27 2.93 0.01518 3 . .
32 l 40 29 2.89 0.01522 3 . .
32 l 45 29 2.56 0.01497 3 . .
32 l 45 24 2.73 0.01671 3 . .
32 l 48 29 3.31 0.01652 3 . .
32 l 44 31 2.76 0.01609 3 . .
32 l 42 27 3.12 0.01534 3 . .
32 h 36 56 2.83 0.01409 4 . .
32 h 53 25 3.00 0.01221 4 . .
32 h 35 43 2.57 0.01308 4 . .
32 h 51 17 2.47 0.01856 4 . .
32 h 34 46 3.28 0.01334 4 . .
32 h 32 67 3.79 0.01368 4 . .
32 h 33 40 4.15 0.01654 4 . .
36 l 47 29 3.01 0.01649 5 . .
36 l 38 24 2.51 0.01344 5 . .
36 l 41 20 2.44 0.01522 5 . .
36 l 41 22 2.83 0.01401 5 . .
36 l 46 24 3.19 0.01640 5 . .
36 l 32 22 4.57 0.01504 5 . .
36 h 38 35 2.85 0.01600 6 . .
36 h 35 51 2.95 0.01523 6 . .
36 h 40 32 2.69 0.01429 6 . .
36 h 32 50 3.14 0.02512 6 . .
36 h 25 60 4.89 0.01559 6 . .
36 h 32 52 2.91 0.02461 6 . .
36 h 26 51 2.95 0.02379 6 . .
;
run;


Colleen R. Stephens (colleenrstephens@gmail.com)

Steven A. Juliano (sajulian@ilstu.edu)

School of Biological Sciences

Illinois State University

Normal, IL USA


References

Cassell, D.L. 2002. A Randomization-test Wrapper for SAS PROCs. SUGI 27. Paper 251-27.
(http://www2.sas.com/proceedings/sugi27/p251-27.pdf )

Manly, B.F.J. 2011. Heritage Windows Programs from the RT Package. Western EcoSystems Technology Inc., Cheyenne, Wyoming.
(http://www.west-inc.com/computerprograms.html )

Anderson, M.J. 2005. PERMANOVA: Permutational multivariate analysis of variance. Department of Statistics, University of Auckland, New Zealand.
(http://www.stat.auckland.ac.nz/~mja/prog/PERMANOVA_UserNotes.pdf )