As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.

Here are some tips for How to share your SAS knowledge with your professional network.

# Cardinality Ratio

Jump to: navigation, search
Tiny Url: http://tinyurl.com/5grpgn

2017-11-09 see latest paper on Cardinality Ratios and Types

The Cardinality of a variable is the ratio of the number of rows of the frequency data set divided by the number of rows of the data set. It is a Pure Number, because it has no units.

(See Wikipedia for more about Dimensionless units.)

Note: 8/26/2015, new program; see page Making_Lists, section Making List of Variables, subsection Freq.nlevels

## Characteristics

• The range is (0,...,1]; cannot be zero and can be one.
• Explanation: There is at least one value for a variable: character = ' ' and numeric = .; therefore the range is 1/nobs : nobs/nobs.
• If the Cardinality Ratio is one then the variable is unique, and probably a Primary Key or Row Identifier.
• The distribution of values for variables in a data has two groups: each near the extreme values of near zero and near one.
• Values in the left group, near zero, are usually Foreign Keys; you expect a Dimension (LookUp) Table, with the variable name
• See Wikipedia: Foreign key
• example:
• variable name: Customer_Id
• dimension table name: Customer_Id
• Values in the right group, near one, are probably continuous.
• In a Fact Table, numeric variables are probably facts.
• Note: Row-Id in a Fact Table may be an integer: length = 4; if its length is 8, then it may be a datetime-stamp.

## Usage in Data Review

The Cardinality Ratio of variables in a data set highlights this information:

1. Primary Key: row-id, row-number, sequence-number: a unique variable, usually an integer (length = 4), which identifies each row.
2. Foreign Key: variables with low cardinality, discrete, classification variables.
3. Facts: variables with high cardinality, continuous, analysis variables.

Note: this value is calculated by the data review suite: SmryEachVar

## Examples

### Program: Proc Freq Nlevels

This programs illustrates the Nlevels option in Proc Freq.

```Proc Freq data = sashelp.class
nlevels;
run;```

output:

```The FREQ Procedure

Number of Variable Levels

Variable      Levels
--------      ------
Name              19
Sex                2
Age                6
Height            17
Weight            15```

Note: this is written to the output destination and is not saved in output data sets.

2008-May-08: SAS-L to the rescue!

"data _null_" suggests:

```PROC Freq data     = sashelp.CitiDay
levels;
ods
output
nlevels = Work.Nlevels;

PROC SQL; describe table work.Nlevels;
quit;
Proc Print data = Work.Nlevels;
run;```

This is the describe table of sashelp.class Nlevels.

```create table WORK.NLEVELS( label='Number of Variable Levels' bufsize=4096 )
(
TableVar char(6) label='Table Variable',
NLevels num format=BEST8. label='Number of Levels'
);```

This is the describe table of sashelp.citiday Nlevels.

```create table WORK.NLEVELS( label='Number of Variable Levels' bufsize=8192 )
(
TableVar char(8) label='Table Variable',
TableVarLabel char(40) label='Table Variable Label',
NLevels num format=BEST8. label='Number of Levels',
NMissLevels num format=BEST8. label='Number of Missing Levels',
NNonMissLevels num format=BEST8. label='Number of Nonmissing Levels'
);```

### Program: Proc Freq from SAS-L's data _null_

```data work.class;
set sashelp.class;
if _n_ eq 3 then call missing(sex,height);
run;

proc freq data      = work.class
levels;
ods
output
nlevels = nlevels;
tables      _all_
/ noprint;
run;

data work.ratio;
if 0 then set work.class(drop=_all_) nobs=nobs;
attrib CardRatio  length = 8 label = 'Card. Ratio 1= unique';
nobsData = nobs;
do until(eof);
set work.nlevels end=eof;
CardRatio = Nlevels / NobsData;
output;
end;
stop;
run;

Proc Print label;
run;```

### Program: CrdRatio-demo

``` /*    name: CrdRatio-demo
description: provide a data set with Cardinality Ratio
Nobs(data)/Nobs(levels) of each variable in data set
purpose: data review
Cardinality Ratio is an indicator of type of variable
primary key
foreign key
fact
CR = 1 : var is Primary Key :: identifies row
1.0 > CR >= 0.5 : var is continous, may be a fact: summable
0.5 < CR >  0 : var values are few, may be a foreign key
note:  0.5 is a relative cut-off
RJF2 5/11/2012
usage:
%Let Libname = sashelp;
%Let Memname = Class;
%Let Memname = Heart;
*Let Memname = Shoes;

%Let Out_Lib = Library;
%Let Out_Lib = Work;
%Let Out_Data = Nlevels;
%Include SiteIncl(CrdRatio-demo);
********/
PROC Freq data    = &Libname..&Memname.
nlevels ;
ods       exclude OneWayFreqs;
ods       output
NLevels = work.&Out_Data
(rename  = (TableVar = Name));

DATA Work.&Out_Data.
(keep = Data_Set Name
CardinalityRatio is_a NLevels Nobs);
if 0 then do;
attrib Data_Set   length = \$42;
set    &Out_Data  (keep = Name);
attrib CardinalityRatio length = 8
is_a             length = \$%length(primary key?!);
set    &Out_Data  (keep   = Nlevels);
set    &Libname..&Memname.(obs=0)
nobs = Nrows ;
end;
retain Data_Set  "&Libname..&Memname"
Nobs;
Nobs = Nrows;

do until(EndoFile);
set &Out_Data end = EndoFile;
CardinalityRatio = Nlevels/Nobs;
if CardinalityRatio eq 1 then             is_a = 'primary key!?';
else if (1 > CardinalityRatio > 0.5) then is_a = 'fact?';
else                                      is_a = 'foreign key?';
output;
end;
stop;
run;
Proc Print data = Work.&Out_Data;
title3 "Cardinality Ratios for &Libname..&Memname.";
run;```

### Program: CrdRatio-demo-caller

This is the calling program:

``` /*    name: CrdRatio-demo-caller.sas
description: demo program
purpose: show output data set of CrdRatio
RJF2 2012-May-12
***********/
options source2;
%Let Libname = sashelp;
%Let Memname = Class;
*Let Memname = Heart;
*Let Memname = Shoes;

%Let Out_Lib = Library;
%Let Out_Lib = Work;
%Let Out_Data = Nlevels;
%Include SiteIncl(CrdRatio-demo);```

### Program: CrdRatio-demo-caller.lst

This is the output for sashelp.class.

```The FREQ Procedure

Number of Variable Levels

Variable      Levels
--------------------
Name              19
Sex                2
Age                6
Height            17
Weight            15

Cardinality Ratios for sashelp.Class
Cardinality
Obs      Data_Set       Name         Ratio       is_a              NLevels    Nobs

1     sashelp.Class    Name        1.00000      primary key!?          19     19
2     sashelp.Class    Sex         0.10526      foreign key?            2     19
3     sashelp.Class    Age         0.31579      foreign key?            6     19
4     sashelp.Class    Height      0.89474      fact?                  17     19
5     sashelp.Class    Weight      0.78947      fact?                  15     19
```

### Program: Proc Freq

```Title2 RnD-Card-Ratio;
%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;

Title3 "&Libname..&Memname..&Name.";

PROC Freq data   = &Libname..&MemName.;
tables   &Name.
/ list missing noprint
out = Work.Unique;

DATA   Work.Unique;
attrib NobsData  length = 4
NLevels   length = 4 label = 'N Levels'
CardRatio length = 8 label = 'Card. Ratio 1= unique';
drop   DsId Rc;

DsId     = open ("&Libname..&MemName.",'i');
NobsData = attrn(dsid,'nobs');
Rc       = close(dsid);

DsId     = open ("Work.Unique",'i');
NLevels  = attrn(dsid,'nobs');
Rc       = close(dsid);

CardRatio = NobsUnique / NobsData;
output;
stop;

Proc Print data = CardRatio
label;```

### Listings

```RnD-Card-Ratio
sashelp.Class.Name
Card.
Nobs    N       Ratio 1=
Obs    Data    Levels   unique

1      19      19         1```
```RnD-Card-Ratio
sashelp.Class.Sex
Card.
Nobs    N       Ratio 1=
Obs    Data    Levels    unique

1      19       2      0.10526```

### Program: Proc Sort

```%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;

Title3 "&Libname..&Memname..&Name.";

PROC Sort data = &Libname..&MemName.(keep = &Name.)
out  = Work.Unique
nodupkey;
by     &Name.;

*same as above;```

### Program: Proc SQL

```%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;

Title3 "&Libname..&Memname..&Name.";

PROC SQL; create table Work.Unique as
select count(*)                as NobsData
, count(distinct(&Name.)) as NLevels
, calculated NLevels
/ calculated NobsData     as CardRatio
from &Libname..&MemName.(keep = &Name.);
quit;
PROC Print data = Work.Unique;
run;```

Here is sql code using list processing to make several tables and append them to create a list for a data set.

``` /*    name: CardRatio-sql
description: calculate Nlevels and Cardinality Ratio for each variable
of a data set
purpose: bricolage: use sql to calculate Cardinality Ratio
RJF2 2012-May-17 in answer to Q at GASUG
******/
options nocenter;
options mprint;
%Let Libname = sashelp;
%Let Memname = class;
%Let Name    = Sex;

PROC SQL ;
select Nobs
into:Nobs
from  dictionary.tables
where Libname eq "%upcase(&Libname)"
and Memname eq "%upcase(&Memname)";
%Put note Nobs: &Nobs;

select '%_(' !! trim(Name)
into:List separated by ')'
from  dictionary.columns
where Libname eq "%upcase(&Libname)"
and Memname eq "%upcase(&Memname)";
%Macro _(Name);
create table &Name as
select "&Name"                  as Name length =32
,count(distinct(&Name))   as Nlevels
,calculated Nlevels/&Nobs as CardRatio
from &Libname..&Memname(keep = &Name);
select * from &Name.;
%mend;
&List
/*********************************************
*RnD: hard-coded example;
create table          Sex as
select               "Sex"    as Name
,count(distinct(Sex))   as Nlevels
,calculated Nlevels/&Nobs as CardRatio
from &Libname..&Memname
(keep = Sex);
*RnD: soft-coded example;
create table &Name as
select "&Name"                  as Name
,count(distinct(&Name))   as Nlevels
,calculated Nlevels/&Nobs as CardRatio
from &Libname..&Memname(keep = &Name);
select * from &Name.;
/*********************************************/
quit;
%Macro _(Name);
Proc Append base = NLevels
data = &Name;
%mend;
&List

PROC Print data = Nlevels;
title3 Nlevels;
run;```

listing:

```%_(Name)
%_(Sex)
%_(Age)
%_(Height)
%_(Weight)

Name                               Nlevels  CardRatio
------
Name                                    19          1

Name                               Nlevels  CardRatio
------
Sex                                      2   0.105263

Name                               Nlevels  CardRatio
------
Age                                      6   0.315789

Name                               Nlevels  CardRatio
------
Height                                  17   0.894737

Name                               Nlevels  CardRatio
------
Weight                                    15   0.789474

Nlevels

Card
Obs    Name      Nlevels     Ratio

1     Name         19      1.00000
2     Sex           2      0.10526
3     Age           6      0.31579
4     Height       17      0.89474
5     Weight       15      0.78947
```

### Program: Hash Table

```%Let Libname = sashelp;
%Let Memname = Class;
%Let Name    = Sex;

Title3 "&Libname..&Memname..&Name.";

DATA   Work.Unique;
attrib NobsData  length = 4
NLevels   length = 4 label = 'N Levels'
CardRatio length = 8 label = 'Card. Ratio 1= unique';
drop   DsId Rc;

*instantiate hash table here;

DsId     = open ("&Libname..&MemName.",'i');
NobsData = attrn(dsid,'nobs');
Rc       = close(dsid);

NLevels = ??? count(of hash-table array where value ne . or ' ');

CardRatio = NLevels / NobsData;
output;
stop;

PROC Print data = Work.Unique;
run;```