As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


Difference between revisions of "Converting 2010 Census Summary File 1 (SF1) Data into SAS Data Sets"

From sasCommunity
Jump to: navigation, search
(added link to HASH paper, additional example, some minor text edits)
m (took out "stray" quote mark)
Line 201: Line 201:
  
 
<source lang="sas">
 
<source lang="sas">
* library for census data sets';
+
* library for census data sets;
 
libname c "&dd.:\census_2010\&state.\datasets";
 
libname c "&dd.:\census_2010\&state.\datasets";
  

Revision as of 13:02, 4 July 2011

INTRODUCTION

According to United States Census Bureau ... "Summary File 1 (SF1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied)." The detailed explanation as to what information is available in the SF1 files can be found in the SF1 Technical Documentation (the description of the SF1 contents is taken from that publication).

SF1 data are being released a state-at-a-time and a list of currently available data can be found at a 2010 census news web site. There is a link on the news site to the data file, but you can go directly to the SF1 data site by clicking here. That web site will display the available states. SF1 data for each state are available in a single ZIP file. That ZIP file contains 49 files comprising: a packing list (PKG); a file with geographic information (GEO); 47 files with data on population and housing (DATA). The DATA files do not contain any geographic information and each DATA file must be linked to the GEO file to assign the data to a county, place, census tract, block group, etc.

The 47 DATA files contain one or more sets of information. The "sets of information" are referred to in the SF1 Technical Documentation as tables. For example, one DATA file (file #1 in each state) contains only one table and that table has only one variable, the total population of geographic areas. Another file (file #20 in each state) with only one table contains the information on the white-only population by gender and age (with 103 age group-specific populations for each gender) down to the census tract level. Many files contain multiple tables. There are a total of 331 tables in the 47 tables and a list of the tables within each file can found in the PKG file. A list showing the range of tables/variables within each of the 47 data files can also be found on page 2-5 of the SF1 Technical Documentation.

DETAILS

The SAS code converts the SF1 DATA files into SAS data sets. One data set is created for each of the 47 DATA files. The tables found within each data set are the same as those found within the data files and the variable names used in the data sets are the same as those found in the SF1 Technical Documentation. For example, file #1 for any state produces a data set named FILE1 and that data set contains one population variable named P0010001 (the same variable name one finds on page 6-21 of the SF1 Technical Documentation). Assume that you are using SF1 data from the state of Delaware.

There is an assumed directory (folder) structure for the location of both the data and the data sets ...

data: e:\census_2010\delaware\data\<files> data sets: e:\census_2010\delaware\data\

There is a macro variable (&DD) that allows you to specify a disk, but the rest of the directory structure is hard-coded (there is no reason that it cannot be changed if you would rather use a different structure). Macro variables also allow you to change the state name and the two-letter state code. The state code is a portion of the name of each SF1 data file (for example, de000012010.sf1 is file #1 for Delaware) and that file produces data set DE1).

There are many summary levels within the SF1 data. Summary levels correspond to geographic areas (referred to as "geographic elements" in the SF1 Technical Documentation). A detailed description of summary levels is available starting on page 4-1 of the SF1 Technical Documentation. The default list of summary levels selected by the SAS code are as follows ...

040 STATE
050 STATE-COUNTY
060 STATE-COUNTY-COUNTY SUBDIVISION
070 STATE-COUNTY-COUNTY SUBDIVISION-PLACE/REMAINDER
140 STATE-COUNTY-CENSUS TRACT
160 STATE-PLACE
871 STATE-5-DIGIT ZIP CODE TABULATION AREA (ZCTA)
881 STATE-5-DIGIT ZIP CODE TABULATION AREA-COUNTY (ZCTA)

The list can be changed by by editing the value of macro variable &SUMLEVS.

The default list of summary level values is associated with variables that are added to the data sets. There is default set of names used for the geographic areas represented by the summary levels. Notice that though there is list of eight summary levels, there are only five summary level variables: there is no STATE variable since all data are state-specific; there are two summary levels for PLACE (070 and 160) and only one variable is needed for the PLACE value; there are two summary levels for ZCTA (871 and 881) and only one variable is needed for the ZCTA value

The list of files converted to data sets can be changed by changing the value of the macro variable &FNS. The default is all 47 files (with a range expressed as 1:47). If not all tables are desired, change the value of the macro variable. For example, if one wanted to convert file 1 through 5 and files 42 through 47, the macro variable value would be 1:5 42:47.

SAS CODE

  • Set parameters with macro variables.
* disk drive for all file;
%let dd=e;
 
* state name (folder) and two-letter state abreviation (in all census file names for given state);
%let state=delaware;
%let state2=de;
 
*
specify SUMMARY LEVELS that will be in the data sets
or comment the IF statement in first data step to select all summary levels
default selection is ...
040 STATE
050 STATE-COUNTY
060 STATE-COUNTY-COUNTY SUBDIVISION
070 STATE-COUNTY-COUNTY SUBDIVISION-PLACE/REMAINDER
140 STATE-COUNTY-CENSUS TRACT
160 STATE-PLACE
871 STATE-5-DIGIT ZIP CODE TABULATION AREA
881 STATE-5-DIGIT ZIP CODE TABULATION AREA-COUNTY
;
%let sumlevs=%str('040' '050' '060' '070' '140' '160' '871' '881');
 
* locations of default summary level variable names (no state needed, data are state-specific);
%let input=%str(
@030 county    $3.
@037 cousub    $5.
@046 place     $5.
@055 tract     $6.
@172 zcta      $5.
);
 
*
specify VARIABLE NAMES  (in quotes, comma-separated) to match above names
default selection is COUNTY, COUNTY SUBDIVISION, PLACE, CENSUS TRACT, ZCTA (ZIP)
add names if additional summary levels are chosen
;
%let vars=%str('county','cousub','place','tract', 'zcta');
 
* specify FILES ... one data set per file (default selection is all 47 files);
%let fns=%str(1:47);
 
*
if data and data set locations follow a recommended pattern, for example using data from Delaware ...
 
e:\census_2010\delaware\data\<files>
e:\census_2010\delaware\datasets\<data sets>
 
there is no need to change anything below unless
 
*******************************************************************************************;
  • Read the geographic information file (for Delaware, that is degeo2010.sf1) and create data set GEO. Variables in that data set are later added to information in the DATA files using a hash data step object.
* location of the GEO file;
filename st "&dd.:\census_2010\&state.\data\&state2.geo2010.sf1";
 
* create data set GEO;
data geo;
infile st lrecl=500 truncover;
input
@009 sumlev    $3.
@012 geocomp   $2.
@019 logrecno   7.
&input
;
if geocomp eq '00';
 
*
choose summary levels in macro variable (above)
COMMENT next statement to select all summary levels
choosing all summary levels will require adding
geographic areas to the INPUT statement
;
if sumlev in (&sumlevs);
 
drop geocomp;
run;
  • Read the packing list file (for Delaware, that is de2010.sf1.prd.packinglist.txt) and create data set RULES. Variables in that data set are used as instructions for writing SAS code in the subsequent data setp.
* location of the file PACKINGLIST file;
filename packing "&dd.:\census_2010\&state.\data\&state2.2010.sf1.prd.packinglist.txt";
 
*
read the "rules' used to create 47 data steps ... one per SF1 file
use an IF statement to select files if not all 47 files are needed
;
data rules;
length var $6 vartyp $4;
infile packing firstobs=21 obs=351;
input txt : $20.;
filen  = input(scan(txt,2,'|:'),best.);
 
if filen in (&fns);
 
var    = scan(txt,1,'|');
varnum = input(compress(var,,'kd'),best.);
vartyp = substr(var,1,findc(var,,'d') - 1);
pops   = input(scan(txt,3,'|:'),best.);
if findc(char(right(var),6),,'a') then ltr = char(right(var),6);
keep filen vartyp varnum ltr pops;
run;
  • Write SAS code that reads the DATA files, adds information from the GEO data set, and creates SAS data sets. The DATA _NULL_ section writes SAS code to a temporary location. One data step is written for each file selected for conversion to a data set (the default is 47 files, so 47 data steps are written).
* destination for the 47 data steps (temporary disk space);
filename census temp;
 
*location of population files ... folder + 2-letter state abbreviation;
%let loc=%str(&dd.:\census_2010\&state\data\&state2);
 
* data step that creates the 47 data steps;
data _null_;
file census;
do until (last.filen);
   set rules ;
   by filen notsorted;
   if first.filen then do;
      if filen not in ('40' '41') then ctype = 'binary';
	  else ctype = 'no';
      fn = cats("&state2", filen);
      fnm = cats("'&loc", put(filen,z5.), "2010.sf1'");
  put "data c." fn "(compress=" ctype ");" /
	      "set geo (obs=1);" /
          "declare hash h(dataset:'geo');" /
          "h.defineKey('logrecno');" /
          "h.defineData('sumlev', &vars);" /
          "h.defineDone();" /
          "infile " fnm " lrecl=5000 truncover dsd end=done;" /
		  "do until(done);" /
          "input (d1-d4) (: $1.) logrecno" ;
   end;
   select;
      when (missing(ltr) or
            vartyp eq : 'h') do; v1 = cats(vartyp, put(varnum,z3.), ltr, '0001');
                                 v2 = cats(vartyp, put(varnum,z3.), ltr, put(pops,z4.));
                             end;
      otherwise              do; v1 = cats(vartyp, put(varnum,z3.), ltr, '001');
                                 v2 = cats(vartyp, put(varnum,z3.), ltr, put(pops,z3.));
                                 end;
   end;
   put  v1 '- ' v2;
end;
put ";" /
    "if h.find() eq 0 then output;" /
	"end;" /
	"stop;" /
    "drop d1-d4 logrecno;" /
    "run;" //;
run;
  • Use %INCLUDE to run the data steps that were written in the previous data step.
* library for census data sets;
libname c "&dd.:\census_2010\&state.\datasets";
 
*
create the data sets
source2 puts statments in the LOG
;
%include census / source2;
 
filename census clear;
  • VOILA !!!

You can cut/paste the SAS code from all the above sections and use the complete SAS job to convert the data files from any state into SAS data sets. If you prefer, you can download all the SAS code in one file by clicking here. The file will also contain the examples shown in the EXAMPLES section.

HOW DOES IT ALL WORK

The above SAS code works by writing SAS code (47 data steps) using a DATA _NULL_ step and then executing that code using the %INCLUDE statement. If you would like to see the SAS code that is written and executed, all you have to do is change the one line in the DATA _NULL_ step ... change FILE CENSUS; to FILE PRINT; ... and the SAS code will appear in the output window if you are SAS in display manager mode. The first of the 47 data steps looks as follows (note: you can just cut/paste the SAS code from the OUTPUT window to the EDITOR window and submit the code from there if you would like to try creating one table "manually").

data c.de1 (compress=binary );
set geo (obs=1);
declare hash h(dataset:'geo');
h.defineKey('logrecno');
h.defineData('sumlev', 'county','cousub','place','tract', 'zcta');
h.defineDone();
infile 'e:\census_2010\delaware\data\de000012010.sf1'  lrecl=5000 truncover dsd end=done;
do until(done);
input (d1-d4) (: $1.) logrecno
p0010001 - p0010001
;
if h.find() eq 0 then output;
end;
stop;
drop d1-d4 logrecno;
run;

There is only one population variable written to data set DE1. The last of the 47 SAS job creates data set DE47 and that data set has 74 variables with information on housing characteristics.

data c.de47 (compress=binary );
set geo (obs=1);
declare hash h(dataset:'geo');
h.defineKey('logrecno');
h.defineData('sumlev', 'county','cousub','place','tract', 'zcta');
h.defineDone();
infile 'e:\census_2010\delaware\data\de000472010.sf1'  lrecl=5000 truncover dsd end=done;
do until(done);
input (d1-d4) (: $1.) logrecno
hct0010001 - hct0010035
hct0020001 - hct0020013
hct0030001 - hct0030013
hct0040001 - hct0040013
;
if h.find() eq 0 then output;
end;
stop;
drop d1-d4 logrecno;
run;

The first portion of each data step creates a hash data step object and puts information from the GEO data set into memory. If you are not familiar with hash data step objects, a good place to start reading about them is the paper Getting Started with the DATA Step Hash Object by Jason Secosky and Janice Bloom. Remember that the GEO data set contains observations for only selected summary levels. Once the GEO data set is in memory, a DATA file is read and only those observations with a LOGRECNO matching a LOGRECNO in the GEO data set are written to the data set.

EXAMPLES

Now that the census data are in data sets, it is easy to use.

title 'DELAWARE:  2010 COUNTY POPULATIONS';
proc print data=c.de1 noobs;
where sumlev eq '050';
var county;
sum p0010001;
format p001: comma7.;
run;
DELAWARE:  2010 COUNTY POPULATIONS

county    p0010001

 001      162,310
 003      538,479
 005      197,145
          ========
          897,934

There are only three counties in Delaware and 60% of the population in Delaware resides in county '003'. Those are FIPS (Federal Information Processing Standards) county numbers. The county numbers and names are available in the data set ZIPCODE in the SASHELP library.

* get county numbers and names from SASHELP.ZIPCODE;
proc sql;
create table fmt as
select distinct put(county,z3.) as start, countynm as label, "$cou2nam" as fmtname
from sashelp.zipcode
where statecode eq 'DE';
quit;
 
* create a format to convert county numbers to names;
proc format cntlin=fmt;
run;
 
title 'DELAWARE:  2010 COUNTY POPULATIONS';
proc print data=c.de1 label noobs;
where sumlev eq '050';
var county;
sum p0010001;
label 
p0010001='POPULATION'
county='COUNTY'
;
format p001: comma7. county $cou2nam.;
run;
DELAWARE:  2010 COUNTY POPULATIONS

COUNTY        POPULATION

Kent           162,310
New Castle     538,479
Sussex         197,145
              ==========
               897,934

Now you can see that 60% of the population lives in New Castle county. To look at zip-specific populations choose another summary level. Summary level '871' shows both county and zip (some zips are in multiple counties). There is no county if you use summary level '881' (another zip-specific summary level).

* get county numbers and names / zip codes and city names from SASHELP.ZIPCODE;
proc sql;
create table cfmt as
select distinct put(county,z3.) as start, countynm as label, "$cou2nam" as fmtname
from sashelp.zipcode
where statecode eq 'DE';
create table zfmt as
select distinct put(zip,z5.) as start, city as label, "$zip2nam" as fmtname
from sashelp.zipcode;
quit;
 
* create formats to convert county numbers to names and zips to cities;
proc format cntlin=cfmt;
run;
 
proc format cntlin=zfmt;
run;
 
title 'DELAWARE:  2010 ZIP POPULATIONS (SOME ZIPS ARE IN MULTIPLE COUNTIES)';
proc report data=c.de1 nowd;
where sumlev eq '881';
columns county zcta zcta=temp p0010001;
define county   / 'COUNTY' format=$cou2nam.;
define zcta     / 'ZIP';
define temp     / 'CITY' format=$zip2nam.;
define p0010001 / 'POPULATION' format=comma10.;
run;

Only a portion of the output is shown.

DELAWARE:  2010 ZIP POPULATIONS (SOME ZIPS ARE IN MULTIPLE COUNTIES)

  COUNTY      ZIP    CITY                          POPULATION
  New Castle  19701  Bear                              39,194
  New Castle  19702  Newark                            51,899
  New Castle  19703  Claymont                          14,471
  New Castle  19706  Delaware City                      1,822
  New Castle  19707  Hockessin                         16,483
  New Castle  19709  Middletown                        35,107
  New Castle  19710  Montchanin                            41
  New Castle  19711  Newark                            51,322
  New Castle  19713  Newark                            30,408
  New Castle  19716  Newark                             1,568
  New Castle  19717  Newark                             3,770
  New Castle  19720  New Castle                        59,250
  New Castle  19730  Odessa                               451
  New Castle  19731  Port Penn                            252
  New Castle  19732  Rockland                              25
  New Castle  19733  Saint Georges                        110
  New Castle  19734  Townsend                          11,651
  New Castle  19735  Winterthur                            12
  New Castle  19736  Yorklyn                               32
  New Castle  19801  Wilmington                        16,286
  New Castle  19802  Wilmington                        24,621
  New Castle  19803  Wilmington                        21,364
  New Castle  19804  Wilmington                        17,944
  New Castle  19805  Wilmington                        40,937
  New Castle  19806  Wilmington                         9,560
  New Castle  19807  Wilmington                         7,405
  New Castle  19808  Wilmington                        38,442
  New Castle  19809  Wilmington                        14,049
  New Castle  19810  Wilmington                        25,011
  Kent        19901  Dover                             35,055
  Kent        19902  Dover Air Force Base                 283
  Kent        19904  Dover                             34,132
  Sussex      19930  Bethany Beach                      2,768
  Sussex      19931  Bethel                               211
  Sussex      19933  Bridgeville                        8,535

QUESTIONS/COMMENT

If you have any questions about this posting, you are welcome to send me a note by clicking here ... email Mike. To see my other SAS Community postings click here.