Cardinality Ratio

From sasCommunity
Jump to: navigation, search
Tiny Url: http://tinyurl.com/5grpgn

The Cardinality of a variable is the ratio of the number of rows of the frequency data set divided by the number of rows of the data set. It is a Pure Number, because it has no units.

(See Wikipedia for more about Dimensionless units.)

Note: 8/26/2015, new program; see page Making_Lists, section Making List of Variables, subsection Freq.nlevels

Characteristics

  • The range is (0,...,1]; cannot be zero and can be one.
    • Explanation: There is at least one value for a variable: character = ' ' and numeric = .; therefore the range is 1/nobs : nobs/nobs.
  • If the Cardinality Ratio is one then the variable is unique, and probably a Primary Key or Row Identifier.
  • The distribution of values for variables in a data has two groups: each near the extreme values of near zero and near one.
  • Values in the left group, near zero, are usually Foreign Keys; you expect a Dimension (LookUp) Table, with the variable name
    • See Wikipedia: Foreign key
    • example:
      • variable name: Customer_Id
      • dimension table name: Customer_Id
  • Values in the right group, near one, are probably continuous.
    • In a Fact Table, numeric variables are probably facts.
    • Note: Row-Id in a Fact Table may be an integer: length = 4; if its length is 8, then it may be a datetime-stamp.

Usage in Data Review

The Cardinality Ratio of variables in a data set highlights this information:

  1. Primary Key: row-id, row-number, sequence-number: a unique variable, usually an integer (length = 4), which identifies each row.
  2. Foreign Key: variables with low cardinality, discrete, classification variables.
  3. Facts: variables with high cardinality, continuous, analysis variables.

Note: this value is calculated by the data review suite: SmryEachVar

Examples

Program: Proc Freq Nlevels

This programs illustrates the Nlevels option in Proc Freq.

Proc Freq data = sashelp.class
          nlevels;
run;

output:

The FREQ Procedure
 
Number of Variable Levels
 
Variable      Levels
--------      ------
Name              19
Sex                2
Age                6
Height            17
Weight            15

Note: this is written to the output destination and is not saved in output data sets.

2008-May-08: SAS-L to the rescue!

"data _null_" suggests:

PROC Freq data     = sashelp.CitiDay
                     levels;
          ods
           output
           nlevels = Work.Nlevels;
 
PROC SQL; describe table work.Nlevels;
          quit;
Proc Print data = Work.Nlevels;
run;

This is the describe table of sashelp.class Nlevels.

create table WORK.NLEVELS( label='Number of Variable Levels' bufsize=4096 )
  (
   TableVar char(6) label='Table Variable',
   NLevels num format=BEST8. label='Number of Levels'
  );

This is the describe table of sashelp.citiday Nlevels.

create table WORK.NLEVELS( label='Number of Variable Levels' bufsize=8192 )
  (
   TableVar char(8) label='Table Variable',
   TableVarLabel char(40) label='Table Variable Label',
   NLevels num format=BEST8. label='Number of Levels',
   NMissLevels num format=BEST8. label='Number of Missing Levels',
   NNonMissLevels num format=BEST8. label='Number of Nonmissing Levels'
  );

Program: Proc Freq from SAS-L's data _null_

data work.class;
set sashelp.class;
if _n_ eq 3 then call missing(sex,height);
run;
 
proc freq data      = work.class
          levels;
          ods
            output
            nlevels = nlevels;
          tables      _all_
                    / noprint;
          run;
 
data work.ratio;
if 0 then set work.class(drop=_all_) nobs=nobs;
attrib CardRatio  length = 8 label = 'Card. Ratio 1= unique';
nobsData = nobs;
do until(eof);
   set work.nlevels end=eof;
   CardRatio = Nlevels / NobsData;
   output;
   end;
stop;
run;
 
Proc Print label;
run;

Program: CrdRatio-demo

 /*    name: CrdRatio-demo
description: provide a data set with Cardinality Ratio
             Nobs(data)/Nobs(levels) of each variable in data set
    purpose: data review
             Cardinality Ratio is an indicator of type of variable
             primary key
             foreign key
             fact
             CR = 1 : var is Primary Key :: identifies row
          1.0 > CR >= 0.5 : var is continous, may be a fact: summable
          0.5 < CR >  0 : var values are few, may be a foreign key
   note:  0.5 is a relative cut-off
RJF2 5/11/2012
usage:
%Let Libname = sashelp;
%Let Memname = Class;
%Let Memname = Heart;
*Let Memname = Shoes;
 
%Let Out_Lib = Library;
%Let Out_Lib = Work;
%Let Out_Data = Nlevels;
%Include SiteIncl(CrdRatio-demo);
********/
PROC Freq data    = &Libname..&Memname.
          nlevels ;
          ods       exclude OneWayFreqs;
          ods       output
          NLevels = work.&Out_Data
         (rename  = (TableVar = Name));
 
DATA Work.&Out_Data.
     (keep = Data_Set Name
             CardinalityRatio is_a NLevels Nobs);
     if 0 then do;
        attrib Data_Set   length = $42;
        set    &Out_Data  (keep = Name);
        attrib CardinalityRatio length = 8
               is_a             length = $%length(primary key?!);
        set    &Out_Data  (keep   = Nlevels);
        set    &Libname..&Memname.(obs=0)
               nobs = Nrows ;
        end;
     retain Data_Set  "&Libname..&Memname"
            Nobs;
     Nobs = Nrows;
 
do until(EndoFile);
   set &Out_Data end = EndoFile;
   CardinalityRatio = Nlevels/Nobs;
   if CardinalityRatio eq 1 then             is_a = 'primary key!?';
   else if (1 > CardinalityRatio > 0.5) then is_a = 'fact?';
   else                                      is_a = 'foreign key?';
   output;
   end;
stop;
run;
Proc Print data = Work.&Out_Data;
           title3 "Cardinality Ratios for &Libname..&Memname.";
run;

Program: CrdRatio-demo-caller

This is the calling program:

 /*    name: CrdRatio-demo-caller.sas
description: demo program
    purpose: show output data set of CrdRatio
RJF2 2012-May-12
***********/
options source2;
%Let Libname = sashelp;
%Let Memname = Class;
*Let Memname = Heart;
*Let Memname = Shoes;
 
%Let Out_Lib = Library;
%Let Out_Lib = Work;
%Let Out_Data = Nlevels;
%Include SiteIncl(CrdRatio-demo);

Program: CrdRatio-demo-caller.lst

This is the output for sashelp.class.

The FREQ Procedure

Number of Variable Levels

Variable      Levels
--------------------
Name              19
Sex                2
Age                6
Height            17
Weight            15

Cardinality Ratios for sashelp.Class
                                  Cardinality
Obs      Data_Set       Name         Ratio       is_a              NLevels    Nobs

 1     sashelp.Class    Name        1.00000      primary key!?          19     19
 2     sashelp.Class    Sex         0.10526      foreign key?            2     19
 3     sashelp.Class    Age         0.31579      foreign key?            6     19
 4     sashelp.Class    Height      0.89474      fact?                  17     19
 5     sashelp.Class    Weight      0.78947      fact?                  15     19


Program: Proc Freq

Title2 RnD-Card-Ratio;
%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
PROC Freq data   = &Libname..&MemName.;
          tables   &Name.
                 / list missing noprint
             out = Work.Unique;
 
DATA   Work.Unique;
attrib NobsData  length = 4
       NLevels   length = 4 label = 'N Levels'
       CardRatio length = 8 label = 'Card. Ratio 1= unique';
drop   DsId Rc;
 
DsId     = open ("&Libname..&MemName.",'i');
NobsData = attrn(dsid,'nobs');
Rc       = close(dsid);
 
DsId     = open ("Work.Unique",'i');
NLevels  = attrn(dsid,'nobs');
Rc       = close(dsid);
 
CardRatio = NobsUnique / NobsData;
output;
stop;
 
Proc Print data = CardRatio
           label;

Listings

RnD-Card-Ratio
sashelp.Class.Name
                         Card.
       Nobs    N       Ratio 1=
Obs    Data    Levels   unique
 
 1      19      19         1
RnD-Card-Ratio
sashelp.Class.Sex
                         Card.
       Nobs    N       Ratio 1=
Obs    Data    Levels    unique
 
 1      19       2      0.10526

Program: Proc Sort

%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
PROC Sort data = &Libname..&MemName.(keep = &Name.)
          out  = Work.Unique
                 nodupkey;
          by     &Name.;
 
*same as above;

Program: Proc SQL

%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
PROC SQL; create table Work.Unique as
          select count(*)                as NobsData
               , count(distinct(&Name.)) as NLevels
               , calculated NLevels
               / calculated NobsData     as CardRatio
          from &Libname..&MemName.(keep = &Name.);
          quit;
PROC Print data = Work.Unique;
run;

Here is sql code using list processing to make several tables and append them to create a list for a data set.

 /*    name: CardRatio-sql
description: calculate Nlevels and Cardinality Ratio for each variable
             of a data set
    purpose: bricolage: use sql to calculate Cardinality Ratio
RJF2 2012-May-17 in answer to Q at GASUG
******/
options nocenter;
options mprint;
%Let Libname = sashelp;
%Let Memname = class;
%Let Name    = Sex;
 
PROC SQL ;
         select Nobs
           into:Nobs
          from  dictionary.tables
          where Libname eq "%upcase(&Libname)"
            and Memname eq "%upcase(&Memname)";
          %Put note Nobs: &Nobs;
 
         select '%_(' !! trim(Name)
           into:List separated by ')'
          from  dictionary.columns
          where Libname eq "%upcase(&Libname)"
            and Memname eq "%upcase(&Memname)";
%Macro _(Name);
         create table &Name as
         select "&Name"                  as Name length =32
               ,count(distinct(&Name))   as Nlevels
               ,calculated Nlevels/&Nobs as CardRatio
           from &Libname..&Memname(keep = &Name);
         select * from &Name.;
%mend;
&List
         /*********************************************
         *RnD: hard-coded example;
         create table          Sex as
         select               "Sex"    as Name
               ,count(distinct(Sex))   as Nlevels
               ,calculated Nlevels/&Nobs as CardRatio
           from &Libname..&Memname
               (keep = Sex);
         *RnD: soft-coded example;
         create table &Name as
         select "&Name"                  as Name
               ,count(distinct(&Name))   as Nlevels
               ,calculated Nlevels/&Nobs as CardRatio
           from &Libname..&Memname(keep = &Name);
         select * from &Name.;
         /*********************************************/
         quit;
%Macro _(Name);
Proc Append base = NLevels
            data = &Name;
%mend;
&List
 
PROC Print data = Nlevels;
           title3 Nlevels;
run;

listing:

%_(Name)
%_(Sex)
%_(Age)
%_(Height)
%_(Weight)

Name                               Nlevels  CardRatio
------
Name                                    19          1

Name                               Nlevels  CardRatio
------
Sex                                      2   0.105263

Name                               Nlevels  CardRatio
------
Age                                      6   0.315789

Name                               Nlevels  CardRatio
------
Height                                  17   0.894737


Name                               Nlevels  CardRatio
------
Weight                                    15   0.789474


Nlevels

                              Card
Obs    Name      Nlevels     Ratio

 1     Name         19      1.00000
 2     Sex           2      0.10526
 3     Age           6      0.31579
 4     Height       17      0.89474
 5     Weight       15      0.78947

Program: Hash Table

%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;
 
Title3 "&Libname..&Memname..&Name.";
 
DATA   Work.Unique;
attrib NobsData  length = 4
       NLevels   length = 4 label = 'N Levels'
       CardRatio length = 8 label = 'Card. Ratio 1= unique';
drop   DsId Rc;
 
*instantiate hash table here;
 
DsId     = open ("&Libname..&MemName.",'i');
NobsData = attrn(dsid,'nobs');
Rc       = close(dsid);
 
NLevels = ??? count(of hash-table array where value ne . or ' ');
 
CardRatio = NLevels / NobsData;
output;
stop;
 
PROC Print data = Work.Unique;
run;

References

-- created by User:Rjf2 10:10, 24 April 2008 (EDT)

--Ronald_J._Fehd macro.maven == the radical programmer