As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


Difference between revisions of "Cardinality Ratio"

From sasCommunity
Jump to: navigation, search
m (added Category: Papers by R Fehd)
m (References fixed broken link to paper 299-2013.pdf)
Line 492: Line 492:
 
== References ==
 
== References ==
  
* [http://support.sas.com/resources/papers/proceedings13/299-2013.pdf| SGF-2013.299: Data Review Information: N-Levels or Cardinality Ratio]
+
* [http://support.sas.com/resources/papers/proceedings13/299-2013.pdf SGF-2013.299: Data Review Information: N-Levels or Cardinality Ratio]
 
* [[Database_Vocabulary]]
 
* [[Database_Vocabulary]]
 
* originally developed in [[SmryEachVar_A_Data_Review_Suite]]
 
* originally developed in [[SmryEachVar_A_Data_Review_Suite]]

Revision as of 15:19, 14 June 2013

Tiny Url: http://tinyurl.com/5grpgn

The Cardinality of a variable is the ratio of the number of rows of the frequency data set divided by the number of rows of the data set. It is a Pure Number, because it has no units.

http://en.wikipedia.org/wiki/Dimensionless

Characteristics

  • The range is (0,...,1]; cannot be zero and can be one.
    • Explanation: There is at least one value for a variable: character = ' ' and numeric = .; therefore the range is 1/nobs : nobs/nobs.
  • If the Cardinality Ratio is one then the variable is unique, and probably a Primary Key or Row Identifier.
  • The distribution of values for variables in a data has two groups: each near the extreme values of near zero and near one.
  • Values in the left group, near zero, are usually Foreign Keys; you expect a Dimension (LookUp) Table, with the variable name
  • Values in the right group, near one, are probably continuous.
    • In a Fact Table, numeric variables are probably facts.
    • Note: Row-Id in a Fact Table may be an integer: length = 4; if its length is 8, then it may be a datetime-stamp.

Usage in Data Review

The Cardinality Ratio of variables in a data set highlights this information:

  1. Primary Key: Row-Id, Row-Nmbr, Sequence-Nmbr: a unique variable, usually an integer (length = 4), which identifies each row.
  2. Foreign Key: Variables with low cardinality.
  3. Facts: variables with high cardinality.

Note: this value is calculated by the data review suite: SmryEachVar http://www.sascommunity.org/wiki/SmryEachVar_A_Data_Review_Suite

Examples

Program: Proc Freq Nlevels

This programs illustrates the Nlevels option in Proc Freq.

Proc Freq data = sashelp.class 
          nlevels;
run;

output:

The FREQ Procedure
 
Number of Variable Levels
 
Variable      Levels
--------      ------
Name              19
Sex                2
Age                6
Height            17
Weight            15

Note: this is written to the output destination and is not saved in output data sets.

2008-May-08: SAS-L to the rescue!

"data _null_" suggests:

PROC Freq data     = sashelp.CitiDay
                     levels;
          ods 
           output  
           nlevels = Work.Nlevels;
 
PROC SQL; describe table work.Nlevels;
          quit;
Proc Print data = Work.Nlevels;
run;

This is the describe table of sashelp.class Nlevels.

create table WORK.NLEVELS( label='Number of Variable Levels' bufsize=4096 )
  (
   TableVar char(6) label='Table Variable',
   NLevels num format=BEST8. label='Number of Levels'
  );

This is the describe table of sashelp.citiday Nlevels.

create table WORK.NLEVELS( label='Number of Variable Levels' bufsize=8192 )
  (
   TableVar char(8) label='Table Variable',
   TableVarLabel char(40) label='Table Variable Label',
   NLevels num format=BEST8. label='Number of Levels',
   NMissLevels num format=BEST8. label='Number of Missing Levels',
   NNonMissLevels num format=BEST8. label='Number of Nonmissing Levels'
  );

Program: Proc Freq from SAS-L's data _null_

data work.class;
set sashelp.class;
if _n_ eq 3 then call missing(sex,height);
run;
 
proc freq data      = work.class
          levels;
          ods 
            output 
            nlevels = nlevels;
          tables      _all_ 
                    / noprint;
          run;
 
data work.ratio;
if 0 then set work.class(drop=_all_) nobs=nobs;
attrib CardRatio  length = 8 label = 'Card. Ratio 1= unique';
nobsData = nobs;
do until(eof);
   set work.nlevels end=eof;
   CardRatio = Nlevels / NobsData;
   output;
   end;
stop;
run;
 
Proc Print label;
run;

Program: CrdRatio-demo

 /*    name: CrdRatio-demo
description: provide a data set with Cardinality Ratio
             Nobs(data)/Nobs(levels) of each variable in data set
    purpose: data review
             Cardinality Ratio is an indicator of type of variable
             primary key
             foreign key
             fact
             CR = 1 : var is Primary Key :: identifies row
          1.0 > CR >= 0.5 : var is continous, may be a fact: summable
          0.5 < CR >  0 : var values are few, may be a foreign key
   note:  0.5 is a relative cut-off
RJF2 5/11/2012
usage:
%Let Libname = sashelp;
%Let Memname = Class;
%Let Memname = Heart;
*Let Memname = Shoes;
 
%Let Out_Lib = Library;
%Let Out_Lib = Work;
%Let Out_Data = Nlevels;
%Include SiteIncl(CrdRatio-demo);
********/
PROC Freq data    = &Libname..&Memname.
          nlevels ;
          ods       exclude OneWayFreqs;
          ods       output
          NLevels = work.&Out_Data
         (rename  = (TableVar = Name));
 
DATA Work.&Out_Data.
     (keep = Data_Set Name  
             CardinalityRatio is_a NLevels Nobs);
     if 0 then do;
        attrib Data_Set   length = $42;
        set    &Out_Data  (keep = Name);
        attrib CardinalityRatio length = 8
               is_a             length = $%length(primary key?!);
        set    &Out_Data  (keep   = Nlevels);
        set    &Libname..&Memname.(obs=0)
               nobs = Nrows ;
        end;
     retain Data_Set  "&Libname..&Memname"
            Nobs;
     Nobs = Nrows;
 
do until(EndoFile);
   set &Out_Data end = EndoFile;
   CardinalityRatio = Nlevels/Nobs;
   if CardinalityRatio eq 1 then             is_a = 'primary key!?';
   else if (1 > CardinalityRatio > 0.5) then is_a = 'fact?';
   else                                      is_a = 'foreign key?';
   output;
   end;
stop;
run;
Proc Print data = Work.&Out_Data;
           title3 "Cardinality Ratios for &Libname..&Memname.";
run;

Program: CrdRatio-demo-caller

This is the calling program:

 /*    name: CrdRatio-demo-caller.sas
description: demo program
    purpose: show output data set of CrdRatio
RJF2 2012-May-12    
***********/     
options source2;
%Let Libname = sashelp;
%Let Memname = Class;          
*Let Memname = Heart;          
*Let Memname = Shoes;          
 
%Let Out_Lib = Library;
%Let Out_Lib = Work;
%Let Out_Data = Nlevels;
%Include SiteIncl(CrdRatio-demo);

Program: CrdRatio-demo-caller.lst

This is the output for sashelp.class.

The FREQ Procedure

Number of Variable Levels
 
Variable      Levels
--------------------
Name              19
Sex                2
Age                6
Height            17
Weight            15
 
Cardinality Ratios for sashelp.Class
                                  Cardinality
Obs      Data_Set       Name         Ratio       is_a              NLevels    Nobs

 1     sashelp.Class    Name        1.00000      primary key!?          19     19 
 2     sashelp.Class    Sex         0.10526      foreign key?            2     19 
 3     sashelp.Class    Age         0.31579      foreign key?            6     19 
 4     sashelp.Class    Height      0.89474      fact?                  17     19 
 5     sashelp.Class    Weight      0.78947      fact?                  15     19 


Program: Proc Freq

Title2 RnD-Card-Ratio;
%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
PROC Freq data   = &Libname..&MemName.;
          tables   &Name.
                 / list missing noprint
             out = Work.Unique;
 
DATA   Work.Unique;
attrib NobsData  length = 4
       NLevels   length = 4 label = 'N Levels'
       CardRatio length = 8 label = 'Card. Ratio 1= unique';
drop   DsId Rc;
 
DsId     = open ("&Libname..&MemName.",'i');
NobsData = attrn(dsid,'nobs');
Rc       = close(dsid);
 
DsId     = open ("Work.Unique",'i');
NLevels  = attrn(dsid,'nobs');
Rc       = close(dsid);
 
CardRatio = NobsUnique / NobsData;
output;
stop;
 
Proc Print data = CardRatio
           label;

Listings

RnD-Card-Ratio
sashelp.Class.Name
                         Card.
       Nobs    N       Ratio 1=
Obs    Data    Levels   unique
 
 1      19      19         1
RnD-Card-Ratio
sashelp.Class.Sex
                         Card.
       Nobs    N       Ratio 1=
Obs    Data    Levels    unique
 
 1      19       2      0.10526

Program: Proc Sort

%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
PROC Sort data = &Libname..&MemName.(keep = &Name.)
          out  = Work.Unique
                 nodupkey;
          by     &Name.;
 
*same as above;

Program: Proc SQL

%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
PROC SQL; create table Work.Unique as
          select count(*)                as NobsData
               , count(distinct(&Name.)) as NLevels
               , calculated NLevels 
               / calculated NobsData     as CardRatio
          from &Libname..&MemName.(keep = &Name.);
          quit;
PROC Print data = Work.Unique;
run;

Here is sql code using list processing to make several tables and append them to create a list for a data set.

 /*    name: CardRatio-sql
description: calculate Nlevels and Cardinality Ratio for each variable
             of a data set
    purpose: bricolage: use sql to calculate Cardinality Ratio
RJF2 2012-May-17 in answer to Q at GASUG
******/ 
options nocenter;
options mprint;
%Let Libname = sashelp;
%Let Memname = class;
%Let Name    = Sex;
 
PROC SQL ;
         select Nobs
           into:Nobs
          from  dictionary.tables
          where Libname eq "%upcase(&Libname)"
            and Memname eq "%upcase(&Memname)";
          %Put note Nobs: &Nobs;
 
         select '%_(' !! trim(Name)
           into:List separated by ')'
          from  dictionary.columns
          where Libname eq "%upcase(&Libname)"
            and Memname eq "%upcase(&Memname)";
%Macro _(Name);
         create table &Name as
         select "&Name"                  as Name length =32
               ,count(distinct(&Name))   as Nlevels
               ,calculated Nlevels/&Nobs as CardRatio
           from &Libname..&Memname(keep = &Name);
         select * from &Name.;
%mend;
&List
         /*********************************************
         *RnD: hard-coded example;
         create table          Sex as
         select               "Sex"    as Name
               ,count(distinct(Sex))   as Nlevels
               ,calculated Nlevels/&Nobs as CardRatio
           from &Libname..&Memname
               (keep = Sex);
         *RnD: soft-coded example;
         create table &Name as
         select "&Name"                  as Name
               ,count(distinct(&Name))   as Nlevels
               ,calculated Nlevels/&Nobs as CardRatio
           from &Libname..&Memname(keep = &Name);
         select * from &Name.;
         /*********************************************/
         quit;
%Macro _(Name);
Proc Append base = NLevels
            data = &Name;
%mend;
&List
 
PROC Print data = Nlevels;
           title3 Nlevels;
run;

listing:

%_(Name)                            
%_(Sex)                             
%_(Age)                             
%_(Height)                          
%_(Weight)                          

Name                               Nlevels  CardRatio
------
Name                                    19          1

Name                               Nlevels  CardRatio
------
Sex                                      2   0.105263

Name                               Nlevels  CardRatio
------
Age                                      6   0.315789

Name                               Nlevels  CardRatio
------
Height                                  17   0.894737
 

Name                               Nlevels  CardRatio
------
Weight                                    15   0.789474
 

Nlevels

                              Card
Obs    Name      Nlevels     Ratio

 1     Name         19      1.00000
 2     Sex           2      0.10526
 3     Age           6      0.31579
 4     Height       17      0.89474
 5     Weight       15      0.78947

Program: Hash Table

%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
DATA   Work.Unique;
attrib NobsData  length = 4
       NLevels   length = 4 label = 'N Levels'
       CardRatio length = 8 label = 'Card. Ratio 1= unique';
drop   DsId Rc;
 
*instantiate hash table here;
 
DsId     = open ("&Libname..&MemName.",'i');
NobsData = attrn(dsid,'nobs');
Rc       = close(dsid);
 
NLevels = ??? count(of hash-table array where value ne . or ' ');
 
CardRatio = NLevels / NobsData;
output;
stop;
 
PROC Print data = Work.Unique;
run;

References

-- created by User:Rjf2 10:10, 24 April 2008 (EDT)

--Ronald_J._Fehd macro.maven == the radical programmer