Forest Plotting Analysis Macro %FORESTPLOT

From sasCommunity
Jump to: navigation, search

Abstract

The forest plot is a powerful and versatile tool for visually presenting model estimates for multivariable analysis, or illustrating association measures of key interested factor across various models or subgroup analyses. The ability to see if two values are significantly different from each other, or if a covariate has a significant meaning on its own when compared to a reference value, is made much simpler in a forest plot rather than sifting through numbers in a report table. The amount of data preparation in order to build a high quality forest plot in SAS can be tremendous as the programmer will need to run analyses, extract the estimates to be plots, and structure the estimates in a format conducive to generating a forest plot. This code required for this process is often replicated repeatedly for multiple models, which is inefficient and prone to errors. While some SAS procedures can produce forest plots using the Output Delivery System (ODS) graphics automatically, the plots are not generally publication ready and are difficult to customize even if the programmer is familiar with the Graph Template Language. The macro FORESTPLOT is designed to efficiently and automatically perform all of the steps of building a high quality forest plot, and is currently designed to perform regression analyses common to the clinical oncology research areas, Cox proportional hazards and logistic models, as well as calculate Kaplan-Meier event-free rates and binomial success rates. Additionally to improve flexibility, the user can specify a pre-built data set to transform into a forest plot if the automated analysis options of the macro do not fit the user's needs.

Version Control

Platforms

Fully Tested: Linux
Partially Tested: Windows

SAS Versions

9.2: Yes
9.3: Yes
9.4: Yes

Online Materials

SAS Global Forum 2015

View the PDF for this paper: Media:3419-2015.pdf

View the PowerPoint Slides: Media:3419-2015_presentation.pptx

Version of the macro: Media:Forestplot_sgf2015.sas

Key Features

Automates Analysis

The FORESPLOT macro automates several different types of analyses in order to save the user time and effort on generating a plot ready data set. The METHOD parameter determines which type of analysis is used. Multiple models can be run with one macro call. %FORESTPLOT Automates the following analyses using the mentioned procedures:

  • Cox Proportional Hazards Regression (PROC PHREG)
  • Logistic Regression (PROC LOGISTIC or PROC GENMOD)
  • Kaplan-Meier Event-Free Rates (PROC LIFETEST)
  • Binomial Success Rates (PROC FREQ)

There are many parameters for customizing the automated analysis, including:

  • Stratification where available
  • Categorical factors
  • Reference values
  • Order of values
  • Continuous factors
  • Step size is customizable
  • Replicating analysis using a BY variable
  • Order of BY levels customizable
  • Ability to subset with a WHERE clause without making new data set
  • Ability to indicate censor value or event value

Displays Statistics from Analysis in Plot

The macro can feature different statistics from the automated analysis within the graph itself Cox Proportional Hazards Regression:

  • Hazard ratios and 95% confidence interval
  • Number of events and patients
  • P-values (stratified or unstratified)
  • Global tests: Score, likelihood ratio, and Wald
  • Type-3 tests: Score, likelihood ratio, and Wald
  • Covariate level comparisons: Wald
  • Concordance index and 95% confidence interval

Logistic Regression:

  • Odds ratios and 95% confidence interval
  • Number of events and patients
  • P-values (Stratified (Proc LOGISTIC only) and unstratified)
  • Global tests (Proc LOGISTIC only): Score, likelihood ratio, and Wald
  • Type-3 tests: Likelihood ratio (Proc GENMOD only), and Wald
  • Covariate level comparisons: Wald
  • Concordance index and 95% confidence interval

Kaplan-Meier Event-Free Rates

  • Event-free rates and 95% confidence intervals at specified time-points
  • Number of patients and events for entire Kaplan-Meier curve
  • P-values (stratified and unstratified) for entire Kaplan-Meier curves
  • Logrank and Wilcoxon

Binomial Success Rates

  • Success rate and 95% confidence interval
  • Number of successes and number of patients
  • P-values
  • Chi-square test and Fisher's exact test

Builds its Own Plot Data Set

The FORESTPLOT macro pulls the analysis from the ran procedures and inserts them into a final data set used for plotting. The format used to build the data set follows the following logic:

  • Each row of the data set corresponds to a row of the forest plot
  • The weight and indentation of the row headers are defined by indicator variables
  • The first row of the data set is used to set up the column headers
  • Multiple formats are created for each statistic that can be displayed

Figure: Excerpt from paper showing how the plot data set mirrors graph

Dataset image example.jpg

The shown variables in the data set serve the following purpose:

  • SUBIND: Determines how many indentations the row header has (1=1 indent, 2=2 indents, etc.)
  • BOLDIND: Determines if the row header is bold weight (1) or normal weight (0)
  • SUBTITLE: Text to serve as the row header
  • ESTIMATE/LCL/UCL: Estimate and lower/upper bounds for the scatterplot markers
  • EV_T: Number of events and patients formatted together into one column
  • OR_EST_RANGE: Odds ratio estimate and 95% confidence interval formatted together into one column
  • PVAL: Displays any requested p-values
  • Depending on analysis method, can have global p-values, type-3 p-values, and covariate level comparison p-values
  • Footnote markers are automatically added to mark the type of p-value (this can be disabled)

The data set can be output from the macro, modified, and then fed back into the macro or output as a text table instead.

Customizable Graph

Nearly all parts of the graph are customizable with multiple options. The following sections cover options associated with common components.

Row Header Text

  • Size, weight, and indentations
  • Automatic or custom labels
  • Subtitles for each model

Estimate Points and Confidence Bounds

The following are modifiable on a model-by-model basis

  • Estimate point color, size and symbol type
  • Confidence bound line color and thickness

The following are modifiable on a graph-wide basis

  • Confidence bounds line caps

Display Statistics

  • Text size and weight
  • Multiple formats for each statistic
  • Estimates (Hazard ratios, odds ratios, concordance indexes, event-free rates, and success rates) and confidence intervals
  • Estimate (##.##), lower limit (##.##), and upper limits (##.##)
  • Range - Lower limit-Upper limit (##.##-##.##)
  • Estimate Range (##.## (##.##-##.##))
  • Number of patients and number of events
  • Number of patients and number of events separately (###)
  • Number of events/number of patients (###/###)
  • Percent of events (##.#%)
  • Number of events/number of patients and percent (###/### (##.#%))
  • P-values
  • P-value format (#.####), values less than 0.0001 shown as >0.0001
  • Automatic footnotes can be enabled/disabled to indicate p-value type

Statistical Column Headers

  • Text size and weight
  • Text manually modifiable
  • Can be split into multiple rows

Titles/Footnotes

  • Text size and weight
  • Can be split into multiple rows
  • Superscripts/subscripts/Unicode available

X-axis

  • Two axis types
  • Linear
  • Log
  • Customizable minimum, maximum and increments
  • Customizable label
  • 3 different bases available for log axes (e, 2, and 10)
  • Label and tick value text size and weight

Automated Analysis

Cox Proportional Hazards Regression

The PHREG procedure is used to perform Cox proportional hazards regression modeling. Models can also be stratified using the STRATA parameter.

Hazard Ratios

Hazard ratios are generated using the HAZARDRATIOS statement and are output with an ODS OUTPUT statement specifying the HAZARDRATIOS data set. The reference group for the categorical covariates is determined by the CATREF parameter, and the step size for continuous covariates is determined by the CONTSTEP covariate.

Number of Patients and Events

The number of patients are pulled in two different ways. There is a cumulative number of patients and events for the entire model that is generated from the ODS OUTPUT statement specifying the CENSOREDSUMMARY data set. The macro can also calculate the number of patients and events for each level of a categorical covariate by pulling these numbers in a SQL procedure query.

P-Values

There are three levels of p-values that can be calculated by the macro: global model test, type 3 tests for each covariate, and pair-wise comparison within a given covariate when one level is set to be the reference group. The Score, likelihood-ratio, and Wald p-values are available for the global model test, and are ouput using the ODS OUTPUT statement specifying the GLOBALTESTS data set. The Score, likelihood-ratio, and Wald p-values are available for the type 3 tests, and are output using the ODS OUTPUT statement specifying either the TYPE3 data set or the MODELANOVA data set depending on the SAS version (a later release of 9.4 changes the dta set name). The Wald p-value is available to the individual covariate tests, and is output using the ODS OUTPUT statement specifying the PARAMETERESTIMATES data set. Stratified p-values are automatically used when the STRATA parameter is used.

Concordance Indexes

The concordance index for Cox proportional hazards regression is not automatically computed with any SAS procedure, and there is not a universally accepted macro to be used for this purpose. The method for calculating concordance indexes described in the survConcordance1 package from R developed by Therry Therneau is a widely recognized method, and thus was chosen to develop calculation codes in SAS which is included in this macro. The method uses a binary tree approach to calculating the weights, sum of squares, and eventual standard error. The model predicted values used in the binary tree method are taken from the OUTPUT statement defining the XBETA variable.

Logistic Regression

Either the LOGISTIC procedure or the GENMOD procedure can be used for logistic regression. The procedure used is determined by the LOGPROC parameter.

Odds Ratios

There are two different methods that are used depending on whether the covariate is categorical or continuous. Odds ratios for categorical covariates are calculated with the LSMEANS statement along with the DIFF, CL, and EXP options. These odds ratios are then output using an ODS OUTPUT statement specifying the DIFFS data set. Odds ratios for continuous covariates are calculated with the ESTIMATE statement along with the CL (if LOGISTIC procedure) and EXP options. The odds ratios are then output using an ODS OUTPUT statement specifying the ESTIMATES data set.

Number of Patients and Events

The number patients are pulled in two different ways. There is a cumulative number of patients and events for the entire model that is generated from the ODS OUTPUT statement specifying the RESPONEPROFILE data set. The macro can also calculate the number of patients and events for each level of a categorical covariate by pulling these numbers in a SQL procedure query.

P-Values

The p-values available depend on which procedure is used for analysis. Type 3 tests for each covariate and pairwise comparisons within a given covariate are available when using the GENMOD procedure. The likelihood-ratio and Wald p-values are are available for the type 3 tests, and are output using the ODS OUTPUT statement specifying either the TYPE3 data set or the MODELANOVA data set depending on the SAS version (a later release of 9.4 changes the data set name). The Wald p-value is available to the individual covariate tests, and is output using the ODS OUTPUT statement specifying the PARAMETERESTIMATES data set. Stratified p-values are not available when using the GENMOD procedure in this macro because the GENMOD procedure can only do exact stratified analyses which this macro does not automate. Global tests, type 3 tests, and individual covariate tests are available when using the LOGISTIC procedure. The likelihood-ratio and Wald p-values are available for the global model test, and are ouput using the ODS OUTPUT statement specifying the GLOBALTESTS data set. The Wald p-value is available for the type 3 tests, and is output using the ODS OUTPUT statement specifying either the TYPE3 data set or the MODELANOVA data set depending on the SAS version (a later release of 9.4 changes the dta set name). The Wald p-value is available to the individual covariate tests, and is output using the ODS OUTPUT statement specifying the PARAMETERESTIMATES data set. Stratified p-values are automatically used when the STRATA parameter is used.

Concordance Indexes

The method for calculating concordance indexes follows the methods described in a paper by JA Hanley and BJ McNeil2. While the concordance index for logistic regression can be automatically output from the LOGISTIC procedure, the standard error is not. Without a standard error the confidence bounds for the concordance index cannot be calculated. The paper by Hanley and McNeil provide a method for calculating the standard error that has been commonly used within the Biomedical Statistics and Informatics division at Mayo Clinic. The model predicted values used in this method are taken from the OUTPUT statement defining the XBETA variable.

Kaplan-Meier Event-Free Rates

The LIFETEST procedure is used to calculate the event-free rates.

Event-Free Rates

The TIMELIST option is used to specify the time-point for the event-free rate, and the OUTSURV option in combination with the REDUCEOUT option is used to output the event-free rates to a data set. 2.3.2 Number of Patients and Events An ODS OUTPUT statement specifying the CENSOREDSUMMARY data set is used to output the number of patients and number events.

P-Values

The logrank test and the Wilcoxon test can be called within the macro for Kaplan-Meier event-free rates when a BY parameter is specified. These are generated with a TEST option within a STRATA statement, and output with an ODS OUTPUT statement specifying the HOMTESTS data set. Stratified versions of these p-values are generated when the STRATA parameter is used. These are calculated slightly differently by running a second LIFETEST procedure call and specifying the STRATA variables within the STRATA statement. A GROUP option is then used specifying the BY variable. The p-values are output in the same method as the unstratified versions.

Binomial Success Rate

The FREQ procedure is used to calculate binomial success rates

Success Rates

The TABLES statement is used with the BIN option to generate the binomial success rates. A data step is utilized before hand to create a variable that contains the counts for each level of the binomial variable. This variable is then used in a FREQ statement with the ZEROS option (which forces PROC FREQ to include counts of zero) to avoid errors that can arise with zero percent success rates and 100 percent success rates. The estimates are then output using an OUTPUT statement with the BIN option.

Number of Patients and Successes

The dataset created in section 2.4.1 also contains the number of patients, and using this, along with the success rate, the number of paitents and number of successes can both be calculated.

P-Values

When a BY parameter is specified, a p-value can be computed to test significance of the different success rates across groups. Either a Chi-square p-value or a Fisher exact p-value can be calculated, and these are computed by adding the CHISQ and FISHER options to the TABLES statement. These are output to the same table mentioned in 2.4.1 by adding the CHISQ and FISHER options to the OUTPUT statement.

Pre-Generated Data

The statistics displayed for pre-generated data all come from the inputted data itself. The variables containing these statistics are pointed to using several macro parameters. The format that the input data set is required to be in is described in section 5.

Estimates

The VESTIMATE, VLCL, and VUCL parameters point to the estimate, lower confidence limit and upper confidence limit respectively. These must be numeric variables.

Number of Patients and Events

The VTOTAL and VEVENTS parameters are optional and point to the number of patients and the number of events respectively. These must be numeric columns.

P-values

The VPVAL parameter is an optional parameter that points to p-values. This can be a numeric or character variable.

Other Statistics

The VOTHER parameter exists to give the flexibility to add statistics not covered by the macro to the plot. These can be numeric or character columns, and mutiple variables can be pointed to by specifying a list separated by spaces. The column headers for these variables is determined by the VOTHERLABELS parameter.

Data Set Construction

The plot dataset is constructed using the SQL procedure. A blank data set is created with the columns needed using a CREATE TABLE statement, and then macro loops are combined with INSERT statements to add rows to the plot data set. In the case of calculated analysis, the INSERT statements make use of the SET option in combination with subqueries to pull the outputted data sets from section 4. When specifying pre-generated data, the INSERT statements make use of the SELECT statement to pull a query from the inputted data set.

Variable Construction

The variables that make up the plot data set are separated into different subtypes.

Row headers

Each row of the forest plot has a rowheader, or subtitle, to show the variable label or value. The row header is contained within one variable:

  • SUBTITLE: Contains the row header such as variable label, variable level, group label, group level, or model title.

Estimates and Confidence Limits

There are numeric variables for the each calculated estimate, such as hazard ratio or odds ratio, and the upper and lower confidence limits. These variables are used for making the scatterplot and error bars in the plot.

  • ESTIMATE: Calculated ratio/estimate/rate
  • LCL: Calculated lower 95% confidence limit
  • UCL: Calculated upper 95% confidence limit

When running a logistic model odds ratios and concordance indexes are both calculated, so in the dataset there are additional variables created to contain both of these statistics. These variables are the same but with a prefix. For example, the variables for hazard ratios would be HR_ESTIMATE, HR_LCL, and HR_UCL. Whichever statistic is being plotted is then copied into the ESTIMATE, LCL and UCL variables. Each of these variables also has a character variable equivalent. The character versions of these columns are specially formmated to be used for displaying statistics in the plot. There are additional variables that are combinations of these variables:

  • RANGE: LCL - UCL. Example: 0.25-4.20
  • EST_RANGE: ESTIMATE (LCL - UCL). Example: 1.34 (0.25-4.20)

Number of Patients and Events

The section for number of patients and events has numeric versions of these counts as well as character versions of these counts to use in the summary statistics panel of the plot. In addition to these are additional variables that can be used in the summary statistics panel with special formats to save space:

  • TOTAL: Total number of patients withing model/group
  • EVENTS: Number of events or successes (for binomial) within model/group
  • EV_T: Events/Total. Example: 245/300
  • PCT: Percentage %. Example: 60%
  • EV_T_PCT: Events/Total (Percentage %). Example: 245/300 (81.6%)

P-Values

There is one variable in the plot data set for the p-values. This is a character variable, and the macro can recognize that a footnote exists if the variable is in the following format:

  • #.###"{sup "4"}". Example: 0.0032”{sup “F”}” will turn into 0.0032F

Plot Indicators

There are three different indicator variables in the plot data set to serve the following purposes:

  • SUBIND: determines the number of indentations of the subtitle
  • BOLDIND: determines if the subtitle is bold or normal weight
  • SHADEIND: determines if the row in the plot has a shade background when SHADING=2

These variables are either automatically calculated by the macro for automated analyses, or can be provided in the input data set for METHOD=data by using the VSUBIND, VBOLDIND, and VSHADEIND parameters.

Reasoning for this Format

The data set is designed to be a tabular view of the forest plot in that each row in the data set corresponds to a row in the graph and the columns of the data set align with the columns of the graph (see figure in Features section)

Graph Template Language

(To be added)

Example Images

(Work in progress)

Example 1: Example of listing multiple models within one forest plot

Example1.jpeg


Contact Info

Personal E-mail: jpmeyers.spa@gmail.com

--Jpmeyers (talk) 18:49, 12 October 2014 (CDT)