Good Programming Practice for Clinical Trials

From sasCommunity

Jump to: navigation, search

The following are draft recommendations originating from the SAS Good Programming Practice used by a UCB Pharma Global Statistical Programming group, as of October 1, 2008.

Feel free to comment on the related "Discussion" tab/page and/or to contribute changes directly to this page (preferentially with a rationale).

The purpose is indeed to encourage contributions from across companies in an attempt to create a consensus recommendation. The ambition is that this page could become recognized by the management teams of Statistical Programming in the Pharmaceutical Industry and Regulatory Authorities.

The hope is that the Practice can be reviewed and endorsed by the relevant management teams of several Pharmaceutical companies and major Contract Research Organizations at the PharmaSUG (June 2009) and PhUSE (October 2009) conferences in 2009. Timelines as soon as known will be posted here.

Contents

[edit] Disclaimer

  • This article reflects the opinions of its authors and contributors at the time of writing, and not necessarily the view of their respective companies.

[edit] Introduction

The Good Programming Practices are defined in order to:

  • Ensure the clarity of the code and facilitate code review;
  • Save time in case of maintenance, and ease the transfer of code among programs or programmers;
  • Minimize the need for code maintenance by robust programming;
  • Minimize the development effort by development and re-use of standard code and by use of dynamic (easily adaptable) code;
  • Minimize the resources needed at execution time (improve the efficiency of the code);
  • Prevent as far as possible the risk of logical errors.

Note: As often, the various guidelines provided hereafter may conflict with one another if applied in too rigorous a way. Clarity, efficiency, re-usability, adaptability and robustness of the code are all important, and must be balanced in the programming practice.


[edit] Readability and Maintainability

[edit] Language

  • SAS code and comments have to be written in English.

[edit] Header and Revision History

  • Include a header for every program (template below).
**********************************************************;
* Program name      :
*
* Author            :
*
* Date created      :
*
* Study             : (Study number)
*                     (Study title)
*
* Purpose           :
*
* Template          :
*
* Inputs            :
*
* Outputs           :
*
* Program completed : Yes/No
*
* Updated by        : (Name) – (Date): 
*                            (Modification and Reason)
**********************************************************;
  • In addition to your name or initials, use your login ID to identify yourself in the header. This is so there is no ambiguity on the identify of each programmer.
  • Update the revision history at each code modification made after the finalization of the first version of a program.

Note: When you copy a program from another study, you became the author of this program, and you should clear the revision history. You can specify the origin of the program under the “Template” section of the header.

[edit] Comments

  • Include a comment before each major DATA/PROC step, especially when you are doing something complex or non-standard. Comments should be comprehensive, and should describe the rationale and not simply the action. For example, do not comment "Access demography data"; instead explain which data elements and why they are needed.
  • Organize the comments into a hierarchy.
  • Do not include numbers in comments.

Reason: It avoids heavy update when removing or inserting sections.

[edit] Naming Conventions

  • Use explicit name for variables and datasets, with a maximum length of 8.
  • For permanent datasets, use a meaningful dataset label and variable labels.
  • When possible, never use the same name for a dataset more than once in the program.

Note: However, keep in mind that large intermediate files take a lot of SAS Workspace.

  • Name IN variable using “in” plus a meaningful reference to the dataset.

Example:

data aelst;
   merge aesaes (in=inae) patpat (in=inpat);
   by patno;
   if inae and inpat;
run; 
  • Labels must have a maximum length of 40 characters.

[edit] Code Structure

  • It is mandatory to include libnames, options and formats in a separate setup program (e.g. init.sas), unless these are temporary formats or temporary options that are reset after being used.

Reason: It will guarantee that changes of the environment are taken into account in all programs run afterwards.

  • Use standard company macros (e.g. _empty, _savdata and _savlog for AD programs and _run for reporting programs)
  • One statement per line, but several are allowed if small and repeated or related. Long statements should be split across multiple lines.
  • Use a standard sequence for placing statements and group like statements together.
  1. Within a program:
    1. %LET statements and macro definitions
    2. Input steps
    3. Calculations
    4. Save final (permanent) datasets and created outputs
  2. Within a DATA step:
    1. All non-executable statements first (e.g. ATTRIB, LENGTH, KEEP...)
    2. All executable statements next

Reason: It increases the readability of the program.

  • Left-justify DATA, PROC, OPTIONS statements, indent all statements within.

Example:

proc means data=osevit;
   var prmres;
   by prmcod treat;
run; 
  • End every DATA/PROC step with a left-aligned RUN statement.

Reason: It explicitly defines the step boundary.

  • Insert at least one blank line after each RUN statement in DATA/PROC steps.
  • Indent statements within a DO loop, align END with DO.
  • Avoid to have to many nested DO loop and IF-ELSE statements.
  • In case of interlinked DO loop, add a comment at the start (DO) and end (END) of each loop.

Example:

data test01;
  do patno=1 to 40; * cycle thru patients;
    do visit=1 to 3; * cycle thru visits;
      output; 
    end; * cycle thru visits;
  end; * cycle thru patients;
run;
  • Insert parentheses in meaningful places in order to clarify the sequence in which mathematical or logical operations are performed.

Example:

data test02;
  set test01;
  if (visit=0  and vdate lt adate1) 
  or (visit=99 and vdate gt adate2) then delete;
run;

[edit] Efficiency

  • When you input or output a SAS dataset, use a KEEP (preferred to DROP) statement to keep only the needed variables.

Reason: The SAS system loads only the specified variables into the Program Data Vector, eliminating all other variables from being loaded.

  • When subsetting a SAS dataset, use a WHERE statement rather than IF, if possible.

Reason: WHERE subsets the data before entering it into the Program Data Vector, whereas IF subsets the data after inputting the entire dataset.

  • When using IF condition, use IF/ELSE for mutually exclusive conditions, and check the most likely condition first.

Reason: The ELSE/IF will check only those observations that fail the first IF condition. With the IF/IF, all observations will be checked twice.

  • Avoid unnecessary sorting. CLASS statement can be used in some procedure to perform by-group processing without sorting the data.

Example:

proc means data=osevit;
  var prmres;
  class treat;
run; 
  • If possible (i.e. not a sorting variable), use character values for categorical variables or flags instead of numeric values.

Reason: It saves space. A character “1” uses one byte (if length is set to one), whereas a numeric 1 uses eight bytes.

  • Use the LENGTH statement to reduce variable size.

Reason: Storage space can be reduced significantly. Note: Keep in mind that a too limited variable length could reduce the robustness of the code (lead to truncation with different sets of data).

  • Use simple macros for repeating code.

[edit] Robustness

  • Use the MSGLEVEL=I option in order to have all informational, note, warning, and error messages sent to the LOG.
  • In the final code, there should be no dead code that does not work or that is not used. This must be removed from the program.
  • Code to allow checking of the program or of the data (on all data or on a subset of patients such as clean patients, discontinued patients, patients with SAE or patients with odd data) is encouraged and should be built throughout the program. This code can be easily activated during the development phase or commented out during a production run using the piece of code detailed in Section 6.
  • It is unauthorized to have avoidable notes or warning in the log (mandatory).

Reason: They can often lead to ambiguities, confusion, or actual error (e.g. uninitialized variables, automatic numeric/character conversions, automatic formatting, operation on missing data...). Note: If such a warning message is unavoidable, an explanation has to be given in the program.

  • Always use DATA= in a PROC statement (mandatory).

Reason: It ensures correct dataset referencing, makes program easy to follow, and provides internal documentation.

  • Be careful when merging two datasets. Erroneous merging may occur when:
  1. No BY statement is specified.
  2. More than one dataset contain repeats of BY values.
  3. Some variables, other than BY variables, exist in the two datasets (the values of the second dataset are kept).

The second case produces a message in the LOG and can be checked (automatically done by the _VIEWLOG macro). If you really need, PROC SQL is the only way to perform such many-to-many merges. The last case produces a message in the LOG when the option MSGLEVEL=I is in use (this should always be the case). On the other hand, the first case does not produce any message in the LOG although the results can be completely different from expected. Therefore, particular attention must be paid when programming.

  • When coding IF-THEN-ELSE constructs use a final ELSE statement to trap any observations that do not meet the conditions in the IF-THEN clauses.

Reason: You can only be sure that all possible combinations of data are covered if there is a final ELSE statement.

  • When coding a user-defined FORMAT, include the keyword ‘other’ on the left side of the equals sign so that all possible values have an entry in the format.

Reason: A missing entry in a user-defined FORMAT can be difficult to detect. The simplest way to identify this potential problem is to ensure that all values are assigned a format. Note: This does not apply to INFORMATs. It could be more helpful to get a WARNING message when trying to INPUT data of unexpected format.

  • Try to produce code that will operate correctly with unusual data in unexpected situations (e.g. missing data).

[edit] Proposed Code for Data Checks

[edit] Activate/Deactivate Pieces of Code

In the beginning of the program, define a macro variable that you set to blank during the development phase or that you set equal to * for the production run:

%let c=;  or  %let c=*;

For the pieces of code that check the data/program, start each line with the macro variable defined above:

&c title “Check the visits for each patient”;
&c proc freq data=patvis01;
&c    table patno*visit;
&c run;

This code will be executed if &c is blank (development), but will be commented out when &c=* (production).

[edit] Perform Checks on a Subset of Patients

In a separate code that you store under the study MACRO folder, list the subset of patients (clean patients, discontinued patients, patients with SAE or patients with odd data) that you want to look at:

%macro select;

2076 2162 2271 2449

%mend;

In the beginning of the program, define a second macro variable that you set equal to * when you want to perform checks on all data or to blank when you are interested in a subset of patients:

%let s=*;  or  %let s=;

For each checking code, add a piece of code that allows subsetting the data, and start each line of this piece of code with the 2 macro variables defined above:

&c title “Check the visits for each patient”;
&c proc freq data=patvis01;
&c    table patno*visit;
&c &s where patno in (%select);
&c run;

The check will be performed only if &c is blank, and it will be applied to all patients if &s=* or on the subset of patients if &s is blank.

[edit] References

[edit] Your Opinion

  • Like this page? Want to add/ improve it? Feel free to edit this page (use the edit tab on top of the page - login required). Constructive contributions will be appreciated and retained.
  • Just wish to provide some comments? use the discussion tab on top of the page (login required).

--Jmbodart 06:22, 26 October 2008 (EDT)

Personal tools