As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


Difference between revisions of "Converting 2010 Census Summary File 1 (SF1) Data into SAS Data Sets"

From sasCommunity
Jump to: navigation, search
m (DETAILS)
(add link to entire SAS job, show SAS code that is written by the DATA _NULL_ step)
Line 9: Line 9:
 
The SAS code converts the SF1 DATA files into SAS data sets.  One data set is created for each of the 47 DATA files.  The tables found within each data set are the same as those found within the data files and the variable names used in the data sets are the same as those found in the Technical Documentation.  For example, file #1 for any state produces a data set named FILE1 and that data set contains one population variable named P0010001 (the same variable name one finds on page 6-21 of the Technical Documentation).  Assume that you are using SF1 data from the state of Delaware.
 
The SAS code converts the SF1 DATA files into SAS data sets.  One data set is created for each of the 47 DATA files.  The tables found within each data set are the same as those found within the data files and the variable names used in the data sets are the same as those found in the Technical Documentation.  For example, file #1 for any state produces a data set named FILE1 and that data set contains one population variable named P0010001 (the same variable name one finds on page 6-21 of the Technical Documentation).  Assume that you are using SF1 data from the state of Delaware.
  
1/ There is an assumed directory (folder) structure for the location of both the data and the data sets ...
+
There is an assumed directory (folder) structure for the location of both the data and the data sets ...
  
 
data:  e:\census_2010\delaware\data\<files>
 
data:  e:\census_2010\delaware\data\<files>
Line 16: Line 16:
 
There is a macro variable (&DD) that allows you to specify a disk, but the rest of the directory structure is hard-coded (there is no reason that it cannot be changed if you would rather use a different structure).  Macro variables also allow you to change the state name and the two-letter state code.  The state code is a portion of the name of each SF1 data file (for example, de000012010.sf1 is file #1 for Delaware) and that file produces data set DE1).
 
There is a macro variable (&DD) that allows you to specify a disk, but the rest of the directory structure is hard-coded (there is no reason that it cannot be changed if you would rather use a different structure).  Macro variables also allow you to change the state name and the two-letter state code.  The state code is a portion of the name of each SF1 data file (for example, de000012010.sf1 is file #1 for Delaware) and that file produces data set DE1).
  
2/  There are many summary levels within the SF1 data.  Summary levels correspond to geographic areas (referred to as "geographic elements" in the Technical Documentation).  A detailed description of summary levels is available starting on page 4-1 of the Technical Documentation.  The default list of summary levels selected by the SAS code are as follows ...
+
There are many summary levels within the SF1 data.  Summary levels correspond to geographic areas (referred to as "geographic elements" in the Technical Documentation).  A detailed description of summary levels is available starting on page 4-1 of the Technical Documentation.  The default list of summary levels selected by the SAS code are as follows ...
  
 
<pre>
 
<pre>
Line 31: Line 31:
 
The list can be changed by by editing the value of macro variable &SUMLEVS.
 
The list can be changed by by editing the value of macro variable &SUMLEVS.
  
3/  The default list of summary level values is associated with variables that are added to the data sets.  There is default set of names used for the geographic areas represented by the summary levels.  Notice that though there is list of eight summary levels, there are only five summary level variables: there is no STATE variable since all data are state-specific; there are two summary levels for PLACE (070 and 160) and only one variable is needed for the PLACE value;  there are two summary levels for ZCTA (871 and 881) and only one variable is needed for the ZCTA value
+
The default list of summary level values is associated with variables that are added to the data sets.  There is default set of names used for the geographic areas represented by the summary levels.  Notice that though there is list of eight summary levels, there are only five summary level variables: there is no STATE variable since all data are state-specific; there are two summary levels for PLACE (070 and 160) and only one variable is needed for the PLACE value;  there are two summary levels for ZCTA (871 and 881) and only one variable is needed for the ZCTA value
  
4/  The list of files converted to data sets can be changed by changing the value of the macro variable &FNS.  The default is all 47 files (with a range expressed as 1:47).  If not all tables are desired, change the
+
The list of files converted to data sets can be changed by changing the value of the macro variable &FNS.  The default is all 47 files (with a range expressed as 1:47).  If not all tables are desired, change the
value of the macro variable.  For example, if one wanted to convert file 1 through 5 and files 42 through 47, the macro variable value would be 1:5 42:47 .
+
value of the macro variable.  For example, if one wanted to convert file 1 through 5 and files 42 through 47, the macro variable value would be 1:5 42:47.
  
 
==SAS CODE==
 
==SAS CODE==
The following SAS code works as follows ...
 
 
 
1/  Set parameters with macro variables.
 
1/  Set parameters with macro variables.
  
Line 216: Line 214:
  
 
6/  VOILA !!!
 
6/  VOILA !!!
 +
You can cut/paste the SAS code from all the above sections and use the complete SAS job to convert the data files from any state into SAS data sets.  If you prefer, you can download all the SAS code in one file by [http://www.albany.edu/faculty/msz03/census_2010/sf1_data_2_data_sets.zip clicking here].  The file will also contain the examples shown in the next section.
 +
 +
The above SAS code works by writing SAS code (47 data steps) using a DATA _NULL_ step and then executing that code using the %INCLUDE statement.  If you would like to see the SAS code that is written and executed, all you have to do is change the one line in the DATA _NULL_ step ... change FILE CENSUS; to FILE PRINT; ... and the SAS code will appear in the output window if you are SAS in display manager mode.  The first of the 47 data steps looks as follows.
 +
 +
<source lang="sas">
 +
data c.de1 (compress=binary );
 +
set geo (obs=1);
 +
declare hash h(dataset:'geo');
 +
h.defineKey('logrecno');
 +
h.defineData('sumlev', 'county','cousub','place','tract', 'zcta');
 +
h.defineDone();
 +
infile 'e:\census_2010\delaware\data\de000012010.sf1'  lrecl=5000 truncover dsd end=done;
 +
do until(done);
 +
input (d1-d4) (: $1.) logrecno
 +
p0010001 - p0010001
 +
;
 +
if h.find() eq 0 then output;
 +
end;
 +
stop;
 +
drop d1-d4 logrecno;
 +
run;
 +
</source>
 +
 +
There is only one population variable written to data set DE1.  The last of the 47 SAS job creates data set DE47 and that data set 74 variables with data on housing characteristics.
 +
 +
<source lang="sas">
 +
data c.de47 (compress=binary );
 +
set geo (obs=1);
 +
declare hash h(dataset:'geo');
 +
h.defineKey('logrecno');
 +
h.defineData('sumlev', 'county','cousub','place','tract', 'zcta');
 +
h.defineDone();
 +
infile 'e:\census_2010\delaware\data\de000472010.sf1'  lrecl=5000 truncover dsd end=done;
 +
do until(done);
 +
input (d1-d4) (: $1.) logrecno
 +
hct0010001 - hct0010035
 +
hct0020001 - hct0020013
 +
hct0030001 - hct0030013
 +
hct0040001 - hct0040013
 +
;
 +
if h.find() eq 0 then output;
 +
end;
 +
stop;
 +
drop d1-d4 logrecno;
 +
run;
 +
</source>
  
 
==EXAMPLE==
 
==EXAMPLE==
Now that the data are in data sets, it is easy to use.
+
Now that the census data are in data sets, it is easy to use.
  
 
<source lang="sas">
 
<source lang="sas">
Line 242: Line 286:
 
</pre>
 
</pre>
  
There are only three counties in Delaware and 60% of the population in Delaware resides in county '003'.  Those are FIPS (Federal Information Processing Standards) county numbers.  The county numbers and names are available in data set in the SASHELP library.
+
There are only three counties in Delaware and 60% of the population in Delaware resides in county '003'.  Those are FIPS (Federal Information Processing Standards) county numbers.  The county numbers and names are available in the data set ZIPCODE in the SASHELP library.
  
 
<source lang="sas">
 
<source lang="sas">
Line 281: Line 325:
 
               897,934
 
               897,934
 
</pre>
 
</pre>
 +
 +
Now you can see that 60% of the population lives in New Castle county.
  
 
==QUESTIONS/COMMENT==
 
==QUESTIONS/COMMENT==
 
If you have any questions about this posting, you are welcome to send me a note by  [[Special:Emailuser/Msz03| clicking here ... email Mike]].  To see my other SAS Community postings [http://www.sascommunity.org/wiki/User:Msz03 click here].
 
If you have any questions about this posting, you are welcome to send me a note by  [[Special:Emailuser/Msz03| clicking here ... email Mike]].  To see my other SAS Community postings [http://www.sascommunity.org/wiki/User:Msz03 click here].

Revision as of 22:40, 3 July 2011

INTRODUCTION

According to United States Census Bureau ... "Summary File 1 (SF1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied)." The detailed explanation as to what information is available in the SF1 files can be found in the SF1 Technical Documentation (the description of the SF1 contents is taken from that publication).

SF1 data are being released a state-at-a-time and a list of currently available data can be found at a 2010 census news web site. There is a link on the news site to the data file, but you can go directly to the SF1 data site by clicking here. That web site will display the available states. SF1 data for each state are available in a single ZIP file. That ZIP file contains 49 files comprising: a packing list (PKG); a file with geographic information (GEO); 47 files with data on population and housing (DATA). The DATA files do not contain any geographic information and each DATA file must be linked to the GEO file to assign the data to a county, place, census tract, block group, etc.

The 47 DATA files contain one or more sets of information. The "sets of information" are referred to in the Technical Documentation as tables. For example, one DATA file (file #1 in each state) contains only one table and that table has only one variable, the total population of geographic areas. Another file (file #20 in each state) with only one table contains the information on the white-only population by gender and age (with 103 age group-specific populations for each gender) down to the census tract level. Many files contain multiple tables. There are a total of 331 tables in the 47 tables and a list of the tables within each file can found in the PKG file.

DETAILS

The SAS code converts the SF1 DATA files into SAS data sets. One data set is created for each of the 47 DATA files. The tables found within each data set are the same as those found within the data files and the variable names used in the data sets are the same as those found in the Technical Documentation. For example, file #1 for any state produces a data set named FILE1 and that data set contains one population variable named P0010001 (the same variable name one finds on page 6-21 of the Technical Documentation). Assume that you are using SF1 data from the state of Delaware.

There is an assumed directory (folder) structure for the location of both the data and the data sets ...

data: e:\census_2010\delaware\data\<files> data sets: e:\census_2010\delaware\data\

There is a macro variable (&DD) that allows you to specify a disk, but the rest of the directory structure is hard-coded (there is no reason that it cannot be changed if you would rather use a different structure). Macro variables also allow you to change the state name and the two-letter state code. The state code is a portion of the name of each SF1 data file (for example, de000012010.sf1 is file #1 for Delaware) and that file produces data set DE1).

There are many summary levels within the SF1 data. Summary levels correspond to geographic areas (referred to as "geographic elements" in the Technical Documentation). A detailed description of summary levels is available starting on page 4-1 of the Technical Documentation. The default list of summary levels selected by the SAS code are as follows ...

040 STATE
050 STATE-COUNTY
060 STATE-COUNTY-COUNTY SUBDIVISION
070 STATE-COUNTY-COUNTY SUBDIVISION-PLACE/REMAINDER
140 STATE-COUNTY-CENSUS TRACT
160 STATE-PLACE
871 STATE-5-DIGIT ZIP CODE TABULATION AREA (ZCTA)
881 STATE-5-DIGIT ZIP CODE TABULATION AREA-COUNTY (ZCTA)

The list can be changed by by editing the value of macro variable &SUMLEVS.

The default list of summary level values is associated with variables that are added to the data sets. There is default set of names used for the geographic areas represented by the summary levels. Notice that though there is list of eight summary levels, there are only five summary level variables: there is no STATE variable since all data are state-specific; there are two summary levels for PLACE (070 and 160) and only one variable is needed for the PLACE value; there are two summary levels for ZCTA (871 and 881) and only one variable is needed for the ZCTA value

The list of files converted to data sets can be changed by changing the value of the macro variable &FNS. The default is all 47 files (with a range expressed as 1:47). If not all tables are desired, change the value of the macro variable. For example, if one wanted to convert file 1 through 5 and files 42 through 47, the macro variable value would be 1:5 42:47.

SAS CODE

1/ Set parameters with macro variables.

* disk drive for all file;
%let dd=e;
 
* state name (folder) and two-letter state abreviation (in all census file names for given state);
%let state=delaware;
%let state2=de;
 
*
specify SUMMARY LEVELS that will be in the data sets
or comment the IF statement in first data step to select all summary levels
default selection is ...
040 STATE
050 STATE-COUNTY
060 STATE-COUNTY-COUNTY SUBDIVISION
070 STATE-COUNTY-COUNTY SUBDIVISION-PLACE/REMAINDER
140 STATE-COUNTY-CENSUS TRACT
160 STATE-PLACE
871 STATE-5-DIGIT ZIP CODE TABULATION AREA
881 STATE-5-DIGIT ZIP CODE TABULATION AREA-COUNTY
;
%let sumlevs=%str('040' '050' '060' '070' '140' '160' '871' '881');
 
* locations of default summary level variable names (no state needed, data are state-specific);
%let input=%str(
@030 county    $3.
@037 cousub    $5.
@046 place     $5.
@055 tract     $6.
@172 zcta      $5.
);
 
*
specify VARIABLE NAMES  (in quotes, comma-separated) to match above names
default selection is COUNTY, COUNTY SUBDIVISION, PLACE, CENSUS TRACT, ZCTA (ZIP)
add names if additional summary levels are chosen
;
%let vars=%str('county','cousub','place','tract', 'zcta');
 
* specify FILES ... one data set per file (default selection is all 47 files);
%let fns=%str(1:47);
 
*
if data and data set locations follow a recommended pattern, for example using data from Delaware ...
 
e:\census_2010\delaware\data\<files>
e:\census_2010\delaware\datasets\<data sets>
 
there is no need to change anything below unless
 
*******************************************************************************************;

2/ Read the geographic information file (for Delaware, that is degeo2010.sf1) and create data set GEO. Variables in that data set are later added to information in the DATA files using a hash data step object.

* location of the GEO file;
filename st "&dd.:\census_2010\&state.\data\&state2.geo2010.sf1";
 
* create data set GEO;
data geo;
infile st lrecl=500 truncover;
input
@009 sumlev    $3.
@012 geocomp   $2.
@019 logrecno   7.
&input
;
if geocomp eq '00';
 
*
choose summary levels in macro variable (above)
COMMENT next statement to select all summary levels
choosing all summary levels will require adding
geographic areas to the INPUT statement
;
if sumlev in (&sumlevs);
 
drop geocomp;
run;

3/ Read the packing list file (for Delaware, that is de2010.sf1.prd.packinglist.txt) and create data set RULES. Variables in that data set are used as instructions for writing SAS code in the subsequent data setp.

* location of the file PACKINGLIST file;
filename packing "&dd.:\census_2010\&state.\data\&state2.2010.sf1.prd.packinglist.txt";
 
*
read the "rules' used to create 47 data steps ... one per SF1 file
use an IF statement to select files if not all 47 files are needed
;
data rules;
length var $6 vartyp $4;
infile packing firstobs=21 obs=351;
input txt : $20.;
filen  = input(scan(txt,2,'|:'),best.);
 
if filen in (&fns);
 
var    = scan(txt,1,'|');
varnum = input(compress(var,,'kd'),best.);
vartyp = substr(var,1,findc(var,,'d') - 1);
pops   = input(scan(txt,3,'|:'),best.);
if findc(char(right(var),6),,'a') then ltr = char(right(var),6);
keep filen vartyp varnum ltr pops;
run;

4/ Write SAS code that reads the DATA files, adds information from the GEO data set, and creates SAS data sets. The DATA _NULL_ section writes SAS code to a temporary location. One data step is written for each file selected for conversion to a data set (the default is 47 files, so 47 data steps are written).

* destination for the 47 data steps (temporary disk space);
filename census temp;
 
*location of population files ... folder + 2-letter state abbreviation;
%let loc=%str(&dd.:\census_2010\&state\data\&state2);
 
* data step that creates the 47 data steps;
data _null_;
file census;
do until (last.filen);
   set rules ;
   by filen notsorted;
   if first.filen then do;
      if filen not in ('40' '41') then ctype = 'binary';
	  else ctype = 'no';
      fn = cats("&state2", filen);
      fnm = cats("'&loc", put(filen,z5.), "2010.sf1'");
  put "data c." fn "(compress=" ctype ");" /
	      "set geo (obs=1);" /
          "declare hash h(dataset:'geo');" /
          "h.defineKey('logrecno');" /
          "h.defineData('sumlev', &vars);" /
          "h.defineDone();" /
          "infile " fnm " lrecl=5000 truncover dsd end=done;" /
		  "do until(done);" /
          "input (d1-d4) (: $1.) logrecno" ;
   end;
   select;
      when (missing(ltr) or
            vartyp eq : 'h') do; v1 = cats(vartyp, put(varnum,z3.), ltr, '0001');
                                 v2 = cats(vartyp, put(varnum,z3.), ltr, put(pops,z4.));
                             end;
      otherwise              do; v1 = cats(vartyp, put(varnum,z3.), ltr, '001');
                                 v2 = cats(vartyp, put(varnum,z3.), ltr, put(pops,z3.));
                                 end;
   end;
   put  v1 '- ' v2;
end;
put ";" /
    "if h.find() eq 0 then output;" /
	"end;" /
	"stop;" /
    "drop d1-d4 logrecno;" /
    "run;" //;
run;

5/ Use %INCLUDE to run the data steps that were written in the previous data step.

* library for census data sets';
libname c "&dd.:\census_2010\&state.\datasets";
 
*
create the data sets
source2 puts statments in the LOG
;
%include census / source2;
 
filename census clear;

6/ VOILA !!! You can cut/paste the SAS code from all the above sections and use the complete SAS job to convert the data files from any state into SAS data sets. If you prefer, you can download all the SAS code in one file by clicking here. The file will also contain the examples shown in the next section.

The above SAS code works by writing SAS code (47 data steps) using a DATA _NULL_ step and then executing that code using the %INCLUDE statement. If you would like to see the SAS code that is written and executed, all you have to do is change the one line in the DATA _NULL_ step ... change FILE CENSUS; to FILE PRINT; ... and the SAS code will appear in the output window if you are SAS in display manager mode. The first of the 47 data steps looks as follows.

data c.de1 (compress=binary );
set geo (obs=1);
declare hash h(dataset:'geo');
h.defineKey('logrecno');
h.defineData('sumlev', 'county','cousub','place','tract', 'zcta');
h.defineDone();
infile 'e:\census_2010\delaware\data\de000012010.sf1'  lrecl=5000 truncover dsd end=done;
do until(done);
input (d1-d4) (: $1.) logrecno
p0010001 - p0010001
;
if h.find() eq 0 then output;
end;
stop;
drop d1-d4 logrecno;
run;

There is only one population variable written to data set DE1. The last of the 47 SAS job creates data set DE47 and that data set 74 variables with data on housing characteristics.

data c.de47 (compress=binary );
set geo (obs=1);
declare hash h(dataset:'geo');
h.defineKey('logrecno');
h.defineData('sumlev', 'county','cousub','place','tract', 'zcta');
h.defineDone();
infile 'e:\census_2010\delaware\data\de000472010.sf1'  lrecl=5000 truncover dsd end=done;
do until(done);
input (d1-d4) (: $1.) logrecno
hct0010001 - hct0010035
hct0020001 - hct0020013
hct0030001 - hct0030013
hct0040001 - hct0040013
;
if h.find() eq 0 then output;
end;
stop;
drop d1-d4 logrecno;
run;

EXAMPLE

Now that the census data are in data sets, it is easy to use.

title 'DELAWARE:  2010 COUNTY POPULATIONS';
proc print data=c.de1 noobs;
where sumlev eq '050';
var county;
sum p0010001;
format p001: comma7.;
run;
DELAWARE:  2010 COUNTY POPULATIONS

county    p0010001

 001      162,310
 003      538,479
 005      197,145
          ========
          897,934

There are only three counties in Delaware and 60% of the population in Delaware resides in county '003'. Those are FIPS (Federal Information Processing Standards) county numbers. The county numbers and names are available in the data set ZIPCODE in the SASHELP library.

* get county numbers and names from SASHELP.ZIPCODE;
proc sql;
create table fmt as
select distinct put(county,z3.) as start, countynm as label, "$cou2nam" as fmtname
from sashelp.zipcode
where statecode eq 'DE';
quit;
 
* create a format to convert county numbers to names;
proc format cntlin=fmt;
run;
 
itle 'DELAWARE:  2010 COUNTY POPULATIONS';
proc print data=c.de1 label noobs;
where sumlev eq '050';
var county;
sum p0010001;
label 
p0010001='POPULATION'
county='COUNTY'
;
format p001: comma7. county $cou2nam.;
run;
DELAWARE:  2010 COUNTY POPULATIONS

COUNTY        POPULATION

Kent           162,310
New Castle     538,479
Sussex         197,145
              ==========
               897,934

Now you can see that 60% of the population lives in New Castle county.

QUESTIONS/COMMENT

If you have any questions about this posting, you are welcome to send me a note by clicking here ... email Mike. To see my other SAS Community postings click here.