As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


Split Data into Subsets

From sasCommunity
Jump to: navigation, search

Tiny url: http://tinyurl.com/5abz5h

This is a FAQ on SAS-L. Here is an simple example of the desired code:

data F M;
set sashelp.class;
select (sex);
   when ('F')  output F;
   when ('M')  output M;
   otherwise;
  end;
run;

Result:

NOTE: The data set WORK.F has 9 observations and 5 variables.
NOTE: The data set WORK.M has 10 observations and 5 variables.

So the observations from the data set SASHELP.CLASS have been partitioned into two data sets, according to the value of the variable SEX. Of course the real problem probably has a lot more than two levels of the driving variable, and those levels may not be the same from one instance to the next. So hard-coding is undesirable. How can this be automated?

Best Practice: Just Don't Do It

There is a pretty strong consensus holding that such proliferation of data sets is undesirable, and that the subsets should be maintained together in one inclusive table. The reasoning is that BY statements can be used to stratify processing, WHERE statements can be used for selective processing, and indexes can be created to support both BY and WHERE statements. If the partitioning is proposed to compartmentalize access to different subsets, views may provide a better solution.

However, there are situations where partitioning may still be necessary (maybe the boss said Do It This Way).

One-Pass Solutions

Until the advent of the hash object in SAS 9, solutions required two passes of the data, the first to discover the values of the driving variable, which had to be known before the splitting step could be compiled.

The hash object can create data sets whose names do not have to be determined until run time, making possible one-pass solutions. See Jack Hamilton's Creating Data-Driven Data Set Names in a Single Pass Using Hash Objects. Richard DeVenezia took this a step further with a hash-of-hashes strategy which eliminates the need to sort the data in advance.

Two-Pass Solutions

Here is a solution offered by --macro maven == the radical programmer 17:35, 7 July 2008 (EDT)

 /* routine name: split-into-subsets;
* description : Call Execute parameterized Include program;
* purpose     : list processing of subroutine;
* note        : This is a Derivative Work of CallExecInclude;
%Include SiteIncl(split-into-subsets);
* input       : SsData
  *             SsList
  *             SsRowId
  *             SsOutLib
* process     : make subset data list, make subset statements
* output      : subsets
* parameters  :
%let SsData   = sashelp.Class;
%let SsList   = Library.Class_Sex(drop = Percent);
%let SsRowId  = RowNmbr;
%let SsOutLib = Work;
* Notes       : RJF2 7/5/2008 made routine from Call Execute Suite;
***************************************/

DATA   _Null_;
if 0   then set &SsList.;
array  MvarC(*) $32 _character_;
array  MvarN(*)   8 _numeric_;

attrib Stmnt   length = $1024
       Vname   length = $  32
       RowNmbr length = 4
       Testing length = 4;

retain RowNmbr 0
       Testing %eval(0 or %sysfunc(getoption(Source2)) eq SOURCE2);

* make statement: DATA OutLib._1 OutLib._2 OutLib._N;
Stmnt = 'Data ';
link ExecStmnt;

do until(EndoFile);
   set &SsList. end = EndoFile;
   RowNmbr+ +1;
   Stmnt = catt('&SsOutLib.._',&SsRowId.);
   link ExecStmnt;
   end;

* end Data statement;
Stmnt = ';do until(EndoFile);set &SsData. end = EndoFile;';
link ExecStmnt;

* reset for next read;
EndoFile = 0;
RowNmbr  = 0;

* make subsetting statements: *if Var = value then output OutLib._?;
do until(EndoFile);
   set &SsList. end = EndoFile;
   RowNmbr+ +1;

   Stmnt = 'if ';
   do I = 1 to dim(MvarC);
      call   vname(MvarC(I),Vname);
      if I ge 2 then Stmnt = cat(trim(Stmnt),' and ');
      Stmnt = cat(trim(Stmnt),' ',trim(Vname),' ="',MvarC(I),'" ');
      end;

   Stmnt = cat(trim(Stmnt),' '
              ,' then output &SsOutLib.._',&SsRowId.,'; else');
   link ExecStmnt;

   end;* do until(EndoFile);

Stmnt = ';end;stop;run;';
link ExecStmnt;

stop;
return; ExecStmnt:
   if   Testing then putlog    Stmnt=;
   call execute(cats('%nrstr(',Stmnt,')'));
return;
run;
%Put Note2: MakeSubsets of &SsData. &SsList. &SsRowId. ending;
%*symdel SsData SsList SsRowId SsOutLib;

Save the above program in your site includes folder and allocate a fileref for it in your site autoexec.

* name: autoexec.sas;
Title  'Sorting RnD';

Filename Project  '.' ;* here;
Filename SiteIncl '.' ;* here;
Libname  Library  '..\sas7b';

Prepare a data set with the list of subsets you want to create.

7          %let Libname  = SAShelp;
8          %let Memname  = Class;
9          %let Names    = Sex ;
10         %let Out_Data = Library.Class_Sex;
11         options source2;
12         %Include Project(make-unique-freq);
NOTE: %INCLUDE (level 1) file PROJECT(make-unique-freq) is file C:\Temp\sorting\sas\make-unique-freq.sas.
13        +* name       : make-unique-freq;
14        +* description: ;
15        +* purpose    : ;
16        +* parms:;
17        +%*let Libname  = ;
18        +%*let Memname  = ;
19        +%*let Names    = ;
20        +%*let Out_Data = ;
21        +
22        +PROC Freq data   = &Libname..&Memname.
23        +         (keep   = &Names.);
24        +          tables   &Names.
25        +                 / list missing noprint
26        +             out = &Out_Data.;
27        +run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set LIBRARY.CLASS_SEX has 2 observations and 3 variables.

This is the calling program.

* name       : split-into-subsets-Test;
* description: make subsets from data;
* purpose    : list processing routine;
* parms:;
options source2;* testing;
%let SsData   = sashelp.Class;
%let SsList   = Library.Class_Sex(drop = Count Percent);
%let SsRowId  = RowNmbr;
%let SsOutLib = Work;

%Include SiteIncl(split-into-subsets);

%let SsRowId = Sex;
%Include SiteIncl(split-into-subsets);

%let SsList   = Library.Class_Sex_Count(drop = Percent);
%let SsRowId = Count;
%Include SiteIncl(split-into-subsets);
%let SsList   = Library.Class_Sex(drop = Count Percent);
%let SsRowId  = RowNmbr;* default;
...
NOTE: CALL EXECUTE generated line.
1         + Data
2         +  &SsOutLib.._1
3         +  &SsOutLib.._2
4         + ;do until(EndoFile);set  &SsData. end = EndoFile;
5         + if Sex ="F"  then output  &SsOutLib.._1; else
6         + if Sex ="M"  then output  &SsOutLib.._2; else
7         + ;end;stop;run;
%let SsRowId = Sex;
...
NOTE: CALL EXECUTE generated line.
1         + Data
2         +  &SsOutLib.._F
3         +  &SsOutLib.._M
4         + ;do until(EndoFile);set  &SsData. end = EndoFile;
5         + if Sex ="F"  then output  &SsOutLib.._F; else
6         + if Sex ="M"  then output  &SsOutLib.._M; else
7         + ;end;stop;run;
%let SsList   = Library.Class_Sex_Count(drop = Percent);
%let SsRowId = Count;
...
NOTE: CALL EXECUTE generated line.
1         + Data
2         +  &SsOutLib.._10
3         +  &SsOutLib.._9
4         + ;do until(EndoFile);set  &SsData. end = EndoFile;
5         + if Sex ="M"  then output  &SsOutLib.._10; else
6         + if Sex ="F"  then output  &SsOutLib.._9; else
7         + ;end;stop;run;

Using Hashing

The original post can be found at: http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0406B&L=sas-l&P=R34571

The following is an example splitting sashelp.class into two files, F and M.

data temp (index = (sex));
  set sashelp.class;
run;

data _null_ ;
  dcl hash hh   (             ) ;
  hh.definekey  ('k'          ) ;
  hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;
  hh.definedone () ;
  do k = 1 by 1 until ( last.sex ) ;
    set temp;
    by sex ;
    hh.add () ;
  end ;
  hh.output (dataset: sex) ;
run ;

Other solutions

--macro maven == the radical programmer 10:15, 7 January 2009 (EST)