Using Formats for Data Review

From sasCommunity
Jump to: navigation, search

keywords: bricolage :: tinkering

To refer to this page use: TinyUrl http://tinyurl.com/24ra8uw

Overview

History, Background

Cody and Fehd recommend adding an other value to formats in order to be able to identify invalid values

Cody:

 
proc format;
value abc ... other = 'invalid';

Fehd

 
%Let Invalid = Invalid;*allocate in autoexec;
proc format;
value abc ... other = "&Invalid.";

Using a macro variable allows the value to be referenced in data review programs:

 
if put(Variable,format.) eq "&Invalid." then do;

At SESUG 2010 Woodruff and Dunn presented a paper on using a double DoW to summarize data. Their idea prompted this R&D.

Task

  • Description: This article presents a single-step, multi-pass read of an unsorted data set which summarizes the data and creates a data set with outliers: invalid values and values greater than one standard deviation from the mean.
  • Purpose: bricolage; illustrates the technique of multiple reads of a data set within one step and arrays to accumulate date
  • Given an unsorted data set with
    • id == primary key
    • categorical variable == foreign key
    • measurement == fact
  • Produce a data set with standard statistics
    • --- mean min max n std ---
    • of the measurement variable
    • by (stratified) the categorical variable

Compare the multi-step process with the single-step process.

Programs

Project algorithm

  1. make autoexec
  2. create format+value for categorical variable
    1. program: formats
  3. convert format+value to format+index:
    1. subroutine: formats-value-to-index.sas
    2. program: formats-value-to-index-sashelp-class-sex.sas
  4. summarize data:
    1. subroutine: smry-std.sas
    2. program: smry-std-sashelp-class.sas

SetUp

autoexec

Parameterized includes shown here depend on fileref Project

 
title   'Fehd, 2011: Using formats to summarize unsorted data';
filename Project  '.'       ;*  here;
libname  Library  '..\sas7b';* there: sibling folder;

Formats

Create a format for sashelp.class.sex.

 
Proc Format library = Library fmtlib;
 
value $Sex 'F'   = 'feminine'
           'M'   = 'masculine'
           other = 'invalid';
 
*todo: program to convert above to this:;
*value $_Sex 'F' = 1
             'M' = 2
           other = 0;
run;


Make library.class.sex

 
DATA Library.Class;
attrib     Name length = $20
       SaveName length = $20;
retain seed 157;       
 
do until(EndoFile);
   set sashelp.class
       end = EndoFile;
   SaveName = Name;
   do I = 1 to 1501/19;
      Name = catt(SaveName,I);
      if %*mod(I,57) eq 0;
         ranuni(I+seed) ge 0.999 then Sex = 'I';
      output;
      end;
   end;
stop;
 
PROC Freq data   = &SysLast.;
          tables   Sex;
run;

Multi-step Process

Proc Summary demo

 
* name: proc-summary-sashelp-class.sas;
* is.a demo: multi-step data review;
* see smry-std;
 
%Let In_Lib  = sashelp;
*Let In_Lib  = Library;* see make-big-data;
%Let In_Data = Class;
%Let In_Id   = Name;
%Let In_By   = Sex;
%Let In_ByFmt= $Sex.;
%Let In_Name = Age;
%Let In_Name = Height;
%Let In_Name = Weight;
 
%Let Out_Std  = 1;
%Let Out_Lib  = work;
%Let Out_Data = Class;
 
%Let Out_Smry = Summary;
 
PROC Sort  data = &In_Lib..&In_Data
           out  =     work.&In_Data;
           by     &In_By;
 
PROC Summary data   = work.&In_Data.
                      n mean min max sum std var;
             var      &In_Name.;
             by       &In_By;
             output
                out = Work.&Out_Smry.(drop=_FREQ_ _Type_ );
 
Proc Print;
Proc Transpose data = Work.&Out_Smry.
               out  = Work.Transposed;
               var    &In_Name.;
               by     &In_By.;
               id     _stat_;
Proc Print;
*endSAS;
 
DATA  &Out_Lib..&Out_Smry.;
array _by   (3) $1;
array _Mean (3);
array _Std  (3);
drop  _: I;   
 
set work.Transposed (keep   =  _Name_ 
                     rename = (_Name_ = Stats_for)
                     obs    = 1);
attrib I length = 4;
retain I 0;
 
do I = 1 to N_obs;
   set work.Transposed (keep = &In_By. Mean Std)
       nobs = N_Obs;
   _by  (I) = &In_by;
   _Mean(I) = Mean;
   _Std (I) = Std;
   end;
 
do until(EndoFile);
   set work.&In_Data.   end = EndoFile;
   by       &In_By.;
   if first.&In_By. then do;      
      do I = 1 to dim(_Std);
         if &In_By. eq _by(I) then leave;
         end;
         put 'note2: ' &In_By.= I=;
      end;
   if abs(&In_Name. - _Mean(I)) / _Std (I) ge &Out_Std.
      or put(&In_By.,&In_ByFmt.) eq 'invalid'
      then do;
      putlog &In_by.= &In_Id.= &In_Name.= ;
      output;
      end;
   end;
stop;
run;   
 
Proc Print;
run;

Proc Summary log

NOTE: The SAS System used:
      real time           0.25 seconds
      user cpu time       0.07 seconds
      system cpu time     0.10 seconds
      Memory                            4475k

Proc Summary listing

Obs    Sex    _NAME_     N     MIN     MAX       MEAN       STD

 1      F     Weight     9    50.5    112.5     90.111    19.3839
 2      M     Weight    10    83.0    150.0    108.950    22.7272
 
       Stats_
Obs     for      Sex     MEAN       STD      Name       Age    Height    Weight

 1     Weight     F     108.95    22.7272    Janet       15     62.5      112.5
 2     Weight     F     108.95    22.7272    Joyce       11     51.3       50.5
 3     Weight     F     108.95    22.7272    Mary        15     66.5      112.0
 4     Weight     M     108.95    22.7272    James       12     57.3       83.0
 5     Weight     M     108.95    22.7272    Jeffrey     13     62.5       84.0
 6     Weight     M     108.95    22.7272    Philip      16     72.0      150.0
 7     Weight     M     108.95    22.7272    Ronald      15     67.0      133.0
 8     Weight     M     108.95    22.7272    Thomas      11     57.5       85.0

Format-value-to-index

Convert a dimension or lookup table to a value==index table

 
;/*    name: formats-value-to-index
description: make new format.value with value = index
       is.a: parameterized include program
purpose    : provide lookup table for smry-std program
usage:
%Let In_Value  = $Sex;
%Let Out_Value = $_Sex;
%Include Project(formats-value-to-index);
;/******************************/
 
Proc Format library = Library 
            out     = Work.Values;
            select    &In_Value.;
 
PROC SQL;  describe table &SysLast.;
           quit;            
PROC Print data = &SysLast.;
run;
 
DATA   Ctrl;
attrib Label length = $8;
drop   I;
retain fmtname "&Out_Value."
       type %sysfunc(ifc(%substr(&In_Value.,1,1) eq $,'c','n'));
       * if char1 of In_Value is $ then c else n;
 
do I = 1 to N_obs; 
   set Work.Values(keep = Start Label) 
       nobs = N_obs;
   Label = left(put(I,best.));
   output;
   end;
* assign row: Other = zero;
Hlo   = 'O'; * Oh::Other;
Label = '0'; * zero;
Start = '?';
output;
stop;
 
PROC Print  data    = Ctrl    noobs;
 
PROC Format library = Library fmtlib
            cntlin  = Ctrl;
run;

Format-value-to-index for sashelp.class.sex

 
%Let In_Value  = $Sex;
%Let Out_Value = $_Sex;
%Include Project(formats-value-to-index)/source2;

Smry-Std

 
;/*    name: smry-std   suggest new name: DataSmry
       is.a: parameterized include program
description: calculate summary stats: n mean min max std var
             of data set using multiple reads of unsorted data
             output outliers
purpose    : bricolage: demonstrate multiple reads,
                        array usage for accumulation
author     : Ronald J. Fehd copyleft 2010
notes      : value.index &In_ByFmt. created by format-value-to-index
usage:
%Let In_data  = Library.Class;
%Let In_data  = sashelp.Class;
%Let In_id    = Name;
%Let In_by    = Sex;
%Let In_ByFmt = $_Sex;
%Let In_name  = Height;
%Let In_name  = Weight;
%Let Out_Std  = 1.0;*if In_name.value ge &Out_Std. then output;
%Let Out_Lib  = Library;
%Let Out_Lib  = Work;
%Let Out_Data = Out_Data;
%Let Out_Smry = Summary;
%Include Project(smry-std);
;/******************************/
* get dimension table == lookup table with index in (0::unknown : N);
PROC Format library = Library
            cntlout = List;
            select    &In_ByFmt.;
PROC Print  data    = &SysLast.;
            title3    &SysLast.; 
 
PROC SQL noprint;
         select max(Label), max(length(Start))
         into  :H_bound, :Len_Start
         from   List
         where  Label ne '0';
         select quote(trim(Start))
         into  :List separated by ','
         from   List
         order  by Label;
         %Let   H_bound = &H_bound.;
         %Let   Len_Start = &Len_Start.;
%Put note2: H_bound  : &H_bound.;
%Put note2: Len_Start: &Len_Start.;
%Put note2: List     : &List.;
 
DATA &Out_Lib..&Out_Smry
        (drop  =                  &In_Name &In_id &In_Name._N_Std
         label = "&In_Data &In_by &In_Name ")
     &Out_Lib..&Out_Data
        (keep  =           &In_by &In_Name &In_id &In_Name._N_Std
         label = "&In_Data &In_by &In_Name ");
array _values(0:&H_bound.) $&Len_Start. _temporary_ (&List.);
array _counts(0:&H_bound.);
array _diffs (0:&H_bound.);
array _mean  (0:&H_bound.);
array _min   (0:&H_bound.);
array _max   (0:&H_bound.);
array _sum   (0:&H_bound.);
array _std   (0:&H_bound.);
array _var   (0:&H_bound.);
 
drop I _:;
 
do until(EndoFile);
   set &In_Data.(keep = &In_By. &In_Name.)  end = EndoFile;
   link Assign_I;
   _Counts(I) = sum(_Counts(I),1        );
   _Min   (I) = min(_Min   (I),&In_Name.);
   _Max   (I) = max(_Max   (I),&In_Name.);
   _Sum   (I) = sum(_Sum   (I),&In_Name.);
   end;
 
** calculate population statistics: x-hat == mean == sum(x)/n;
do I = lbound(_Values) to hbound(_Values);
   _Mean(I) = _Sum(I) / _Counts(I);
   end;
 
* reset; EndoFile = 0; * for next read;
do until(EndoFile);
   set &In_Data.(keep = &In_By. &In_Name.)  end = EndoFile;
   link Assign_I;
   * calculate for variance: difference^2;
   _Diffs(I) = sum(_Diffs(I),(&In_Name. - _Mean (I))**2);
   end;
 
** calculate population statistics: variance and std;
do I = lbound(_Values) to hbound(_Values);
   * s^2 is the variance
         1/D * sum((w sub i) * ((x sub i) - x-hat))**2
         (w sub i) is weight == 1
         where D is vardef: either N or N-1(==default);
   _Var(I)         = (1/(_Counts(I) -1)) * _Diffs(I);
   * STD is square root of the variance;
   _Std(I)         = sqrt(_Var(I));
   &In_By.         = _Values(I);
   &In_By._Count   = _Counts(I);
   &In_Name._Diffs = _Diffs (I);
   &In_Name._Mean  = _Mean  (I);
   &In_Name._Min   = _Min   (I);
   &In_Name._Max   = _Max   (I);
   &In_Name._Sum   = _Sum   (I);
   &In_Name._Std   = _Std   (I);
   &In_Name._Var   = _Var   (I);
   output &Out_Lib..&Out_Smry.;
   end;
 
* reset; EndoFile = 0; * for next read;
do until(EndoFile);
   set &In_Data.(keep = &In_Id. &In_By. &In_Name.)  end = EndoFile;
   link Assign_I;
   &In_Name._N_Std = abs(&In_Name. - _Mean(I)) / _Std (I);
   if    I eq 0
      or &In_Name._N_Std ge &Out_Std. then output &Out_Lib..&Out_Data.;
   end;
stop;
 
Assign_I:  I = put(&In_By.,&In_ByFmt..);* char to numeric conversion;	
           return;
 
PROC Print data  = &Out_Lib..&Out_Smry;  title3 "summary &In_Data.";
PROC Print data  = &Out_Lib..&Out_Data.;
           title3 "&In_Data. outliers: N-std ge &Out_Std.";
run;       title3 ;

Smry-Std for sashelp.class.sex

 
* name: smry-std-sashelp-class.sas;
options source2;
%Let In_data     = sashelp.Class;
%Let In_id       = Name;
%Let In_by       = Sex;
%Let In_ByValues = $_Sex;
%Let In_name     = Height;
%Let In_name     = Weight;
%Let Out_Std     = 1.0;*if value ge &Out_Std. then output;
%Let Out_Lib     = Library;
%Let Out_Lib     = Work;
%Let Out_Data    = Out_Data;
%Let Out_Smry    = Summary;
 
%Include Project(smry-std);

Smry-Std for sashelp.class log

NOTE: The SAS System used:
      real time           0.28 seconds
      user cpu time       0.07 seconds
      system cpu time     0.12 seconds
      Memory                            7354k

Smry-Std for sashelp.class listing

               Sex_    Weight_    Weight_    Weight_    Weight_    Weight_    Weight_    Weight_
Obs    Sex    Count     Diffs       Mean       Min        Max        Sum        Std        Var

 1      *        .         .         .           .          .           .       .           .   
 2      F        9     3005.89     90.111      50.5      112.5       811.0    19.3839    375.736
 3      M       10     4648.73    108.950      83.0      150.0      1089.5    22.7272    516.525

sashelp.Class outliers: N-std ge 1.0

                                   Weight_
Obs    Sex    Weight    Name        N_Std

 1      M       83.0    James      1.14180
 2      F      112.5    Janet      1.15502
 3      M       84.0    Jeffrey    1.09780
 4      F       50.5    Joyce      2.04350
 5      F      112.0    Mary       1.12923
 6      M      150.0    Philip     1.80621
 7      M      133.0    Ronald     1.05820
 8      M       85.0    Thomas     1.05380

Smry-Std for library.class.sex

 
* name: smry-std-lib-class.sas;
%Let In_data  = Library.class;
%Let In_id    = Name;
%Let In_By    = Sex;
%Let In_ByFmt = $_Sex;
%Let In_Name  = Height;
%Let In_Name  = Weight;
%Let Out_Std  = 1.0;*if value ge &Out_Std. then output;
%Let Out_Lib  = Library;
%Let Out_Lib  = Work;
%Let Out_Data = &In_Name._Outliers;
%Let Out_Smry = &In_Name._Summary;
 
%Include Project(smry-std)/source2;

References

--macro maven == the radical programmer 12:06, 4 October 2010 (UTC)