As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


Difference between revisions of "Fun with PROC SQL Summary Functions"

From sasCommunity
Jump to: navigation, search
(Caveat, Especially Regarding MEDIAN: - Median has been supported since 2012, SAS 9.4)
(migrated)
 
Line 1: Line 1:
 +
{{Template:Article_Migrated|https://communities.sas.com/t5/SAS-Communities-Library/Fun-with-PROC-SQL-Summary-Functions/ta-p/475821}}
 
The [[SQL procedure]] is a wonderful tool for summarizing (or aggregating) data. It provides a number of useful summary (or aggregate) functions to help perform calculations, descriptive statistics, and other aggregating operations in a [[SELECT statement]] or HAVING clause. These functions are designed to summarize information and not display detail about data.
 
The [[SQL procedure]] is a wonderful tool for summarizing (or aggregating) data. It provides a number of useful summary (or aggregate) functions to help perform calculations, descriptive statistics, and other aggregating operations in a [[SELECT statement]] or HAVING clause. These functions are designed to summarize information and not display detail about data.
 
Without the availability of summary functions, you would have to construct the necessary logic using somewhat complicated SQL programming constructs. When using a summary function without a GROUP BY clause, all the rows in a table are treated as a single group. Consequently, the results are often a single row value.
 
Without the availability of summary functions, you would have to construct the necessary logic using somewhat complicated SQL programming constructs. When using a summary function without a GROUP BY clause, all the rows in a table are treated as a single group. Consequently, the results are often a single row value.

Latest revision as of 14:46, 27 July 2018


As part of the decommissioning effort for sasCommunity.org this article/tip has been migrated to communities.sas.com.

The new home for this article/tip is Fun with PROC SQL Summary Functions (https://communities.sas.com/t5/SAS-Communities-Library/Fun-with-PROC-SQL-Summary-Functions/ta-p/475821)



The SQL procedure is a wonderful tool for summarizing (or aggregating) data. It provides a number of useful summary (or aggregate) functions to help perform calculations, descriptive statistics, and other aggregating operations in a SELECT statement or HAVING clause. These functions are designed to summarize information and not display detail about data. Without the availability of summary functions, you would have to construct the necessary logic using somewhat complicated SQL programming constructs. When using a summary function without a GROUP BY clause, all the rows in a table are treated as a single group. Consequently, the results are often a single row value.

A number of summary functions are available including facilities to count non-missing values; determine the minimum and maximum values in specific columns; return the range of values; compute the mean, standard deviation, and variance of specific values; and other aggregating functions. The next table shows an alphabetical listing of the available PROC SQL summary functions and, when multiple names for the same function are available, the ANSI-approved name appears first.

Summary Function      Description
AVG, MEAN             Average or mean of values
COUNT, FREQ, N        Aggregate number of non-missing values
CSS                   Corrected sum of squares
CV                    Coefficient of variation
MAX                   Largest value
MIN                   Smallest value
NMISS                 Number of missing values
PRT                   Probability of a greater absolute value of Student’s t
RANGE                 Difference between the largest and smallest values
STD                   Standard deviation
STDERR                Standard error of the mean
SUM                   Sum of values
SUMWGT                Sum of the weight variable values which is 1
T                     Testing the hypothesis that the population mean is zero
USS                   Uncorrected sum of squares
VAR                   Variance

Examples

The next example uses the COUNT function with the (*) argument to produce a total number of rows, regardless if data is missing. The asterisk (*) is specified as the argument to the COUNT function to count all rows in the MOVIES table.

PROC SQL Code

PROC SQL;
 SELECT COUNT(*) AS Row_Count
  FROM MOVIES;
QUIT;

Results

Row_Count
       22

Unlike the COUNT(*) function syntax that counts all rows, regardless if data is missing or not, the next example uses the COUNT function with the (column-name) argument to produce a total number of non-missing rows based on the column, RATING.

SQL Code

PROC SQL;
 SELECT COUNT(RATING) AS Non_Missing_Row_Count
  FROM MOVIES;
QUIT;

Results

Non_Missing_
   Row_Count
          22

The MIN summary function can be specified to determine what the shortest running movie is in the MOVIES table.

SQL Code

PROC SQL;
 SELECT MIN(length) AS Shortest
           Label=’Shortest Length’
  FROM MOVIES;
QUIT;

Results

Shortest
  Length
      97

In the next example, the RANGE function is specified to determine the difference between the largest and smallest values for a selected column. Suppose you wanted to determine the range of movie lengths for all movies by their movie rating (e.g., G, PG, PG-13, and R). You could construct the following query as follows.

SQL Code

PROC SQL;
 SELECT RANGE(length)
              AS Range_Length
   FROM MOVIES
    GROUP BY RATING;
QUIT;

Results

Range_
Length
     0
    32
    97
    72

Caveat, Especially Regarding MEDIAN

The menu of statistics presented above looks a lot like the menu of statistics which can be calculated by PROC MEANS. However, there are some statistics which PROC MEANS can produce but PROC SQL cannot. The most prominent of these is probably MEDIAN.

However, if you call for a column MEDIAN calculation with PROC SQL, you will not get an ERROR message. That's because PROC SQL assumes that you are requesting the row-wise MEDIAN function. Of course this function is trivial when called with just one argument (the median of one value is that value). Depending on what else your SQL code is doing, you may or may not notice the misinterpretation and its consequences on your results.

This issue has been discussed at length on [SAS-L] (broken link).

NOTE: This issue was fixed in SAS 9.4. Proc SQL now supports the standard behavior for the MEDIAN calculation.

Character Arguments

Most of the summary statistics which PROC SQL can compute are conceptually numeric. A few (COUNT, NMISS, MIN, MAX) make sense with character values as well, and PROC SQL will allow you to use these four functions with character arguments. Note that this capability is limited to column-wise usage only.

Expressions as Arguments

The examples above use column names as the function arguments (except for the special asterisk argument used with the COUNT function). SQL is not that limited however. You can use any appropriate expression (that is, formula) as the argument of a summary function.