As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


User:Paulkaefer/sandbox

From sasCommunity
Jump to: navigation, search
This is a work in progress. You can contribute to this article.
Once it is at a reasonable point, it will be...joined...with the SQL joins page.

SQL joins are the SQL equivalent of DATA Step merges. This page explains the way SQL joins are done using SAS PROC SQL, as well as key concepts.

Introduction: SAS syntax

The general syntax is as follows. Multiple tables may be joined together in one query, up to 256.[1]

   proc sql;
       SELECT col1, col2, etc.
       FROM table1
           inner/left/right/full join table2 on ...
           [inner/left/right/full join table3 on ...]
           [...]
   quit;

Other SQL statement syntax may also be used to subset the data to be joined. See Anatomy of a SQL query for more on SQL syntax.

Relational databases

The idea of relational databases is to have multiple tables that may be linked to other tables using unique keys. For example, in a company's database, an employees table may be linked to a salary table via a unique employee ID.

For much more on relational databases, see the Wikipedia article on the subject.

What does a join do?

A join transforms data by combining multiple sets together. While this sounds like a simple task, there are many possible combinations, even for just two input sets. Do you want to include items that have a match in both sets, or items that only match one set? Understanding basic set theory will reveal that there are many possible subsets based on the combination of two sets.

Join terminology

A subset is a "set within a set." A subset may have some or all of the elements in its superset. If this is confusing, read more and see simple examples at Wikipedia. You can join on full datasets or subsets, by filtering out the data via a DATA step or SQL query.

A union is all the distinct elements that are in one or both sets.[2] Intersection refers to the elements that belong to both sets, not one or the other.[3] In joins, the keywords UNION and INTERSECT are used.

The universal set is the set of all possible elements. For example, the universal set of numbers contains pi and -42. The subset of integers does not contain pi, and the subset of positive integers does not contain -42. A dataset is likely a subset of a universal set, but in some cases, a data table may contain every possible outcome or event in a given field.

SQL joins may be performed on more than two sets, following the same rules.

Types of joins

[4]

In the following explanations, these two tables will be used as examples:

Natural join

no keys; a.k.a. Cartesian product

Cartesian join

The Cartesian join or Cartesian product of

Inner join

Outer join

Left, right, or full

Left join

Right join

A right join is the same as a left join, but on the other set being joined (on the "right side" of the query). If the order of the datasets is switched, a left join will become a right join.

full join

Full outer join ref: [5]

Full outer join vs. cross join / Cartesian product, with an example: [6]

Other types of joins

See SQL Joins -- The Long and The Short of It.

Keys and indexes

Indexes: basic or composite, see [7]

Primary key

Secondary key

Foreign key

Business key

Surrogate key

Things to consider

  1. Chasm trap [8]
  2. Fan trap [9] <-- not SAS per se, but has examples of this and chasm trap
  3. Hash join (up with indexes?) [10] [11]
  4. Sort-merge join [12]
  5. table scans [13] [14]
One example on why they may not be recommended: [15]
  1. magic numbers; just avoid them? see [16]; maybe ask Cameron to clarify further or provide expertise
  2. threaded joins [17]
  3. partitioning
  4. normalization and denormalization
  5. And so on....

See also

and Joining Data: Data Step Merge or SQL?