As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.
Here are some tips for How to share your SAS knowledge with your professional network.
Converting 2010 Census Summary File 1 (SF1) Data into SAS Data Sets
According to United States Census Bureau ... "Summary File 1 (SF1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied)." The detailed explanation as to what information is available in the SF1 files can be found in the SF1 Technical Documentation (the description of the SF1 contents is taken from that publication).
The SF1 data can be directly accessed here: clicking here. SF1 data for each state are available in a single ZIP file. That ZIP file contains 49 files comprising: a packing list (PKG); a file with geographic information (GEO); 47 files with data on population and housing (DATA). The DATA files do not contain any geographic information and each DATA file must be linked to the GEO file to assign the data to a county, place, census tract, block group, etc.
The 47 DATA files contain one or more sets of information. The "sets of information" are referred to in the SF1 Technical Documentation as tables. For example, one DATA file (file #1 in each state) contains only one table and that table has only one variable, the total population of geographic areas. Another file (file #20 in each state) with only one table contains the information on the white-only population by gender and age (with 103 age group-specific populations for each gender) down to the census tract level. Many files contain multiple tables. There are a total of 331 tables in the 47 tables and a list of the tables within each file can found in the PKG file. A list showing the range of tables/variables within each of the 47 data files can also be found on page 2-5 of the SF1 Technical Documentation (pdf page 15). A list of the variables within each table is given beginning on page 6-21 and continuing through page 6-349 (pdf pages 183-511).
The SAS code converts the SF1 DATA files into SAS data sets. One data set is created for each of the 47 DATA files. The tables found within each data set are the same as those found within the data files and the variable names used in the data sets are the same as those found in the SF1 Technical Documentation. For example, file #1 for any state produces a data set named FILE1 and that data set contains one population variable named P0010001 (the same variable name one finds on page 6-21 of the SF1 Technical Documentation). Assume that you are using SF1 data from the state of Delaware.
There is an assumed directory (folder) structure for the location of both the data and the data sets ...
data sets: e:\census_2010\delaware\datasets\
There is a macro variable (&DD) that allows you to specify a disk, but the rest of the directory structure is hard-coded (there is no reason that it cannot be changed if you would rather use a different structure). Macro variables also allow you to change the state name and the two-letter state code. The state code is a portion of the name of each SF1 data file (for example, de000012010.sf1 is file #1 for Delaware) and that file produces data set DE1).
There are many summary levels within the SF1 data. Summary levels correspond to geographic areas (referred to as "geographic elements" in the SF1 Technical Documentation). A detailed description of summary levels is available starting on page 4-1 of the SF1 Technical Documentation. The default list of summary levels selected by the SAS code are as follows ...
040 STATE 050 STATE-COUNTY 060 STATE-COUNTY-COUNTY SUBDIVISION 070 STATE-COUNTY-COUNTY SUBDIVISION-PLACE/REMAINDER 140 STATE-COUNTY-CENSUS TRACT 160 STATE-PLACE 871 STATE-5-DIGIT ZIP CODE TABULATION AREA (ZCTA) 881 STATE-5-DIGIT ZIP CODE TABULATION AREA-COUNTY (ZCTA)
The list can be changed by by editing the value of macro variable &SUMLEVS.
The default list of summary level values is associated with variables that are added to the data sets. There is default set of names used for the geographic areas represented by the summary levels. Notice that though there is list of eight summary levels, there are only five summary level variables: there is no STATE variable since all data are state-specific; there are two summary levels for PLACE (070 and 160) and only one variable is needed for the PLACE value; there are two summary levels for ZCTA (871 and 881) and only one variable is needed for the ZCTA value
The list of files converted to data sets can be changed by changing the value of the macro variable &FNS. The default is all 47 files (with a range expressed as 1:47). If not all tables are desired, change the value of the macro variable. For example, if one wanted to convert file 1 through 5 and files 42 through 47, the macro variable value would be 1:5 42:47.
- Set parameters with macro variables.
* disk drive for all file; %let dd=e; * state name (folder) and two-letter state abreviation (in all census file names for given state); %let state=delaware; %let state2=de; * specify SUMMARY LEVELS that will be in the data sets or comment the IF statement in first data step to select all summary levels default selection is ... 040 STATE 050 STATE-COUNTY 060 STATE-COUNTY-COUNTY SUBDIVISION 070 STATE-COUNTY-COUNTY SUBDIVISION-PLACE/REMAINDER 140 STATE-COUNTY-CENSUS TRACT 160 STATE-PLACE 871 STATE-5-DIGIT ZIP CODE TABULATION AREA 881 STATE-5-DIGIT ZIP CODE TABULATION AREA-COUNTY ; %let sumlevs=%str('040' '050' '060' '070' '140' '160' '871' '881'); * locations of default summary level variable names (no state needed, data are state-specific); %let input=%str( @030 county $3. @037 cousub $5. @046 place $5. @055 tract $6. @172 zcta $5. ); * specify VARIABLE NAMES (in quotes, comma-separated) to match above names default selection is COUNTY, COUNTY SUBDIVISION, PLACE, CENSUS TRACT, ZCTA (ZIP) add names if additional summary levels are chosen ; %let vars=%str('county','cousub','place','tract', 'zcta'); * specify FILES ... one data set per file (default selection is all 47 files); %let fns=%str(1:47); * if data and data set locations follow a recommended pattern, for example using data from Delaware ... e:\census_2010\delaware\data\<files> e:\census_2010\delaware\datasets\<data sets> there is no need to change anything below unless *******************************************************************************************;
- Read the geographic information file (for Delaware, that is degeo2010.sf1) and create data set GEO. Variables in that data set are later added to information in the DATA files using a hash data step object.
* location of the GEO file; filename st "&dd.:\census_2010\&state.\data\&state2.geo2010.sf1"; * create data set GEO; data geo; infile st lrecl=500 truncover; input @009 sumlev $3. @012 geocomp $2. @019 logrecno 7. &input ; if geocomp eq '00'; * choose summary levels in macro variable (above) COMMENT next statement to select all summary levels choosing all summary levels will require adding geographic areas to the INPUT statement ; if sumlev in (&sumlevs); drop geocomp; run;
- Read the packing list file (for Delaware, that is de2010.sf1.prd.packinglist.txt) and create data set RULES. Variables in that data set are used as instructions for writing SAS code in the subsequent data setp.
* location of the file PACKINGLIST file; filename packing "&dd.:\census_2010\&state.\data\&state2.2010.sf1.prd.packinglist.txt"; * read the "rules' used to create 47 data steps ... one per SF1 file use an IF statement to select files if not all 47 files are needed ; data rules; length var $6 vartyp $4; infile packing firstobs=21 obs=351; input txt : $20.; filen = input(scan(txt,2,'|:'),best.); if filen in (&fns); var = scan(txt,1,'|'); vartyp = scan(txt,1,'123456789'); varnum = input(compress(var,,'kd'),best.); pops = input(scan(txt,3,'|:'),best.); ltr = compress(char(left(reverse(var)),1),,'ka'); keep filen vartyp varnum ltr pops; run;
- Write SAS code that reads the DATA files, adds information from the GEO data set, and creates SAS data sets. The DATA _NULL_ section writes SAS code to a temporary location. One data step is written for each file selected for conversion to a data set (the default is 47 files, so 47 data steps are written).
* destination for the 47 data steps (temporary disk space); filename census temp; *location of population files ... folder + 2-letter state abbreviation; %let loc=%str(&dd.:\census_2010\&state\data\&state2); * data step that creates the 47 data steps; data _null_; file census; do until (last.filen); set rules ; by filen notsorted; if first.filen then do; if filen not in ('40' '41') then ctype = 'binary'; else ctype = 'no'; fn = cats("&state2", filen); fnm = cats("'&loc", put(filen,z5.), "2010.sf1'"); put "data c." fn "(compress=" ctype ");" / "set geo (obs=1);" / "declare hash h(dataset:'geo');" / "h.defineKey('logrecno');" / "h.defineData('sumlev', &vars);" / "h.defineDone();" / "infile " fnm " lrecl=5000 truncover dsd end=done;" / "do until(done);" / "input (d1-d4) (: $1.) logrecno" ; end; select; when (missing(ltr) or vartyp eq : 'h') do; v1 = cats(vartyp, put(varnum,z3.), ltr, '0001'); v2 = cats(vartyp, put(varnum,z3.), ltr, put(pops,z4.)); end; otherwise do; v1 = cats(vartyp, put(varnum,z3.), ltr, '001'); v2 = cats(vartyp, put(varnum,z3.), ltr, put(pops,z3.)); end; end; put v1 '- ' v2; end; put ";" / "if h.find() eq 0 then output;" / "end;" / "stop;" / "drop d1-d4 logrecno;" / "run;" //; run;
- Use %INCLUDE to run the data steps that were written in the previous data step.
* library for census data sets; libname c "&dd.:\census_2010\&state.\datasets"; * create the data sets source2 puts statments in the LOG ; %include census / source2; filename census clear;
- VOILA !!!
You can cut/paste the SAS code from all the above sections and use the complete SAS job to convert the data files from any state into SAS data sets. If you prefer, you can download all the SAS code in one file by clicking here. The file will also contain the examples shown in the EXAMPLES section.
HOW DOES IT ALL WORK
The above SAS code works by writing SAS code (47 data steps) using a DATA _NULL_ step and then executing that code using the %INCLUDE statement. If you would like to see the SAS code that is written and executed, all you have to do is change the one line in the DATA _NULL_ step ... change FILE CENSUS; to FILE PRINT; ... and the SAS code will appear in the output window if you are SAS in display manager mode. The first of the 47 data steps looks as follows (note: you can just cut/paste the SAS code from the OUTPUT window to the EDITOR window and submit the code from there if you would like to try creating one table "manually").
data c.de1 (compress=binary ); set geo (obs=1); declare hash h(dataset:'geo'); h.defineKey('logrecno'); h.defineData('sumlev', 'county','cousub','place','tract', 'zcta'); h.defineDone(); infile 'e:\census_2010\delaware\data\de000012010.sf1' lrecl=5000 truncover dsd end=done; do until(done); input (d1-d4) (: $1.) logrecno p0010001 - p0010001 ; if h.find() eq 0 then output; end; stop; drop d1-d4 logrecno; run;
There is only one population variable written to data set DE1. The last of the 47 SAS job creates data set DE47 and that data set has 74 variables with information on housing characteristics.
data c.de47 (compress=binary ); set geo (obs=1); declare hash h(dataset:'geo'); h.defineKey('logrecno'); h.defineData('sumlev', 'county','cousub','place','tract', 'zcta'); h.defineDone(); infile 'e:\census_2010\delaware\data\de000472010.sf1' lrecl=5000 truncover dsd end=done; do until(done); input (d1-d4) (: $1.) logrecno hct0010001 - hct0010035 hct0020001 - hct0020013 hct0030001 - hct0030013 hct0040001 - hct0040013 ; if h.find() eq 0 then output; end; stop; drop d1-d4 logrecno; run;
The first portion of each data step creates a hash data step object and puts information from the GEO data set into memory. If you are not familiar with hash data step objects, a good place to start reading about them is the paper Getting Started with the DATA Step Hash Object by Jason Secosky and Janice Bloom. Remember that the GEO data set contains observations for only selected summary levels. Once the GEO data set is in memory, a DATA file is read and only those observations with a LOGRECNO matching a LOGRECNO in the GEO data set are written to the data set.
Now that the census data are in data sets, it is easy to use.
title 'DELAWARE: 2010 COUNTY POPULATIONS'; proc print data=c.de1 noobs; where sumlev eq '050'; var county; sum p0010001; format p001: comma7.; run;
DELAWARE: 2010 COUNTY POPULATIONS county p0010001 001 162,310 003 538,479 005 197,145 ======== 897,934
There are only three counties in Delaware and 60% of the population in Delaware resides in county '003'. Those are FIPS (Federal Information Processing Standards) county numbers. The county numbers and names are available in the data set ZIPCODE in the SASHELP library.
* get county numbers and names from SASHELP.ZIPCODE; proc sql; create table fmt as select distinct put(county,z3.) as start, countynm as label, "$cou2nam" as fmtname from sashelp.zipcode where statecode eq 'DE'; quit; * create a format to convert county numbers to names; proc format cntlin=fmt; run; title 'DELAWARE: 2010 COUNTY POPULATIONS'; proc print data=c.de1 label noobs; where sumlev eq '050'; var county; sum p0010001; label p0010001='POPULATION' county='COUNTY' ; format p001: comma7. county $cou2nam.; run;
DELAWARE: 2010 COUNTY POPULATIONS COUNTY POPULATION Kent 162,310 New Castle 538,479 Sussex 197,145 ========== 897,934
Now you can see that 60% of the population lives in New Castle county. To look at zip-specific populations choose another summary level. Summary level '871' shows both county and zip (some zips are in multiple counties). There is no county if you use summary level '881' (another zip-specific summary level).
* get county numbers and names / zip codes and city names from SASHELP.ZIPCODE; proc sql; create table cfmt as select distinct put(county,z3.) as start, countynm as label, "$cou2nam" as fmtname from sashelp.zipcode where statecode eq 'DE'; create table zfmt as select distinct put(zip,z5.) as start, city as label, "$zip2nam" as fmtname from sashelp.zipcode; quit; * create formats to convert county numbers to names and zips to cities; proc format cntlin=cfmt; run; proc format cntlin=zfmt; run; title 'DELAWARE: 2010 ZIP POPULATIONS (SOME ZIPS ARE IN MULTIPLE COUNTIES)'; proc report data=c.de1 nowd; where sumlev eq '881'; columns county zcta zcta=temp p0010001; define county / 'COUNTY' format=$cou2nam.; define zcta / 'ZIP'; define temp / 'CITY' format=$zip2nam.; define p0010001 / 'POPULATION' format=comma10.; run;
Only a portion of the output is shown.
DELAWARE: 2010 ZIP POPULATIONS (SOME ZIPS ARE IN MULTIPLE COUNTIES) COUNTY ZIP CITY POPULATION New Castle 19701 Bear 39,194 New Castle 19702 Newark 51,899 New Castle 19703 Claymont 14,471 New Castle 19706 Delaware City 1,822 New Castle 19707 Hockessin 16,483 New Castle 19709 Middletown 35,107 New Castle 19710 Montchanin 41 New Castle 19711 Newark 51,322 New Castle 19713 Newark 30,408 New Castle 19716 Newark 1,568 New Castle 19717 Newark 3,770 New Castle 19720 New Castle 59,250 New Castle 19730 Odessa 451 New Castle 19731 Port Penn 252 New Castle 19732 Rockland 25 New Castle 19733 Saint Georges 110 New Castle 19734 Townsend 11,651 New Castle 19735 Winterthur 12 New Castle 19736 Yorklyn 32 New Castle 19801 Wilmington 16,286 New Castle 19802 Wilmington 24,621 New Castle 19803 Wilmington 21,364 New Castle 19804 Wilmington 17,944 New Castle 19805 Wilmington 40,937 New Castle 19806 Wilmington 9,560 New Castle 19807 Wilmington 7,405 New Castle 19808 Wilmington 38,442 New Castle 19809 Wilmington 14,049 New Castle 19810 Wilmington 25,011 Kent 19901 Dover 35,055 Kent 19902 Dover Air Force Base 283 Kent 19904 Dover 34,132 Sussex 19930 Bethany Beach 2,768 Sussex 19931 Bethel 211 Sussex 19933 Bridgeville 8,535