Downloading American Community Survey 5 year data into a workable SAS format

From sasCommunity
Jump to: navigation, search

How to Download American Community Survey 5 year data into a workable SAS format.

The American Community Survey (ACS) has information on demographics for every census tract in the US. Every question of the ACS corresponds to a table and every table can be downloaded either by state or for every state in the US. The tables can be broken up into different census subdivisions such as state, county, tract and block group. This article provides instructions and SAS code to download all the tables for every census tract in every state in the US, using the 2006-2010 5-year files. For other 5-year periods, the Census Bureau used slightly different file naming conventions and file compression types that will require modifications to the code which follows.

The first step is to download the estimate, margin of error, and geography files from the ACS ftp site (3 GB) to your preferred folder. These can be accessed through the following links: [1]

[2]

or use the following SAS code:

  dm 'wbrowse "dm 'wbrowse "http://www2.census.gov/acs2010_5yr/summaryfile/2006-2010_ACSSF_All_In_2_Giant_Files(Experienced-Users-Only)/Tracts_Block_Groups_Only.zip"'; 
  dm 'wbrowse "dm 'wbrowse "http://www2.census.gov/acs2010_5yr/summaryfile/2006-2010_ACSSF_All_In_2_Giant_Files(Experienced-Users-Only)/2010_ACS_Geography_Files.zip"';

run;

The first file requires the .zip extension to be changed to .tar.gz before unzipping. Both files can then be unzipped using any unzipping program, such as [3]. For more information on this step see the file Supplemental_Instructions_for_Using_2_Giant_Files.pdf located in the same path as the data files.

Next, download and save the “Sequence Number and Table Number Lookup” table for reference from [4]. There is also a SAS version of this file located in the same path.

This table (referred to in this article as the lookup table) provides the sequence number that corresponds to each ACS table. It also gives the start position of the variables and the total cells in the table; you need to find the starting position and the number of cells from the lookup table for the specific table you need. For example, sequence number 46 covers tables B17001H, B17001I, B17003, B17004, B17005, B17006, and C17002. There are a total of 220 variables in these tables combined. The start position of table B17001I is 66 and there are 59 cells in the table.

You can then run the SAS program (below) to download a specific table from the ACS) using the sequence number and positions from the lookup table referenced above. The program first appends all the estimate files for the specific table for each state together then does the same for the margin of error files and the geographic files. The geographic files do not have specific sequence numbers associated with them, so these are appended identically for any table. In the code below, let sn= the sequence number of the table you want; let startpos=the position of the first variable for the table you want (located in the lookup table in the first row for a specific sequence number); let endpos= the position for the last variable for the table you want (calculating by adding the Total Cells to the Start Position minus 1); and let last= the last variable position for the sequence number you want (from the sequence number in the lookup table).

For this article, I used table C17002 (Income to Poverty Ratio) which corresponds to sequence number 46 and positions 219-226 (there are 220 actual estimates and margins of error for that sequence number and 6 general variables to indicate table, logical record number, etc). Position 226 is the last variable for that sequence number.

At the end, the program merges the three datasets together and eliminates extraneous variables.

 
 
 libname acs "C:\acs "; *this is where you will put your permanent datasets;
 %let sn=046; *this is the sequence number of the table you want;
 %let startpos=219; *this is the start position of the variables you want within the sequence number from the lookup table;
 %let endpos=226; *this is the end position of the variables you want within the sequence number from the lookup table;
 %let last=226; *this is the last variable position in the sequence number you have chosen from the lookup table;
 
 
 
 *create a dataset with the state abbreviations and Fips codes;
 
 proc sort data=sashelp.zipcode nodupkey out=state;
 by state statecode;
 run;
 
 data state;
 set state;
 keep state statecode;
 if state le 56;
 run;
 
 
 
 * import datasets for each state and for estimate (est) and margin of error(moe), and append states together;
 
 %macro num;	
 
 PROC IMPORT DATAFILE= " C:\acs\e20095&statem.0&sn.000.txt " OUT= work.est&j DBMS=CSV REPLACE ;getnames=no;DATAROW=1;
 RUN;
 proc datasets nolist force;
 append out=est1 data=est&j;
 quit;
 run;
 
 PROC IMPORT DATAFILE= " C:\acs\m20095&statem.0&sn.000.txt " OUT= work.moe&j DBMS=CSV REPLACE; getnames=no; DATAROW=1; 	
 RUN;
 proc datasets nolist force;
 append out=moe1 data=moe&j;
 quit;
 run;
 %mend;
 
 
 %macro states;
 %do j=1 %to 56;
 data _null_;
 set state;
 if _n_=&j;
 call symputx('statem', statecode);
 %num;
 run;
 %end;
 %mend;
 
 %states;
 
 
 
 *make variables consistent in order to merge estimates (est) and margin of error (moe) files ;
 
 data acs.aktotalest;
 set est1;
 state=upcase(var3);
 rename var6=logrecnum;
 run;
 
 
 data acs.aktotalmoe;
 set moe1;
 state=upcase(var3);
 rename var6=logrecnum;
 run;
 
 
 proc sort data=acs.aktotalmoe noduprecs out=acstotalmoe;
 by state var5 logrecnum;
 run;
 
 
 proc sort data=acs.aktotalest noduprecs out=acstotalest;
 by state var5 logrecnum;
 run; 
 
 
 %macro rename;
 proc datasets library=work;
 modify acstotalmoe;
 rename var2=mvar2 ;
 rename %do i=7 %to &last;
 var&i=mvar&i.
 %end;
 ;
 run; quit;
 %mend;
 
 
 %rename;		   
 proc sort data=acstotalest;
 by state logrecnum;
 run;
 
 proc sort data=acstotalmoe;
 by state logrecnum;
 run;
 
 data estmoe;
 merge acstotalest (in=a) acstotalmoe (in=c);
 by state logrecnum;
 run;
 
 
 
 *import the geographic files and append them together.  This would be the same for any table you download for census tract information.;
 
 %macro geography;
 data	geo;
	infile "C:\acs\g20095&statem..txt" delimiter = ' ' MISSOVER DSD lrecl=32767 ;
	input 
	@7 state $2.
	@9 sumlev $3.
	@12 geocom $2.
	@14 logrecnum1 $7.
	@26 state_fips $2.
	@28 county $3.
	@36 place $5.
	@41 tract $6.
	@47 blockgroup $1.
	;
	run;
 
 data geo&j;
	set geo;
	if tract='' then delete;
	if place='';
	if blockgroup='';
	run;
 
 proc append base=geo1 data=geo&j force;
 quit;
 run;
 
 proc sort data=geo1;
 by state county tract;
 run;
 %mend;
 
 %macro geog1;
 %do j=1 %to 56;
 data _null_;
 set state;
 if _n_=&j;
 call symputx('statem', statecode);
 %geography;
 run;
 %end;
 %mend;
 
 %geog1;
 
 
 proc sort data=geo1;
 by state county tract;
 run;
 
 data acsgeototal (drop=logrecnum1);
 set geo1;
 by state county tract;
 if first.tract;
 logrecnum=logrecnum1*1;
 run;
 
 
 
 *Merge geographic files and estimate and margin of error files;
 
 proc sort data=estmoe;
 by state logrecnum;
 run;
 
 
 proc sort data=acsgeototal;
 by state logrecnum;
 run;
 
 
 data estmoegeo;
 merge acsgeototal (in=a) estmoe (in=b);
 by state logrecnum;
 if a;
 run;
 
 
 proc sort data=estmoegeo;
 by state county tract;
 run;
 
 
 
 
 *Restrict the dataset to just the needed variables;
 
 data estmoegeo1 (keep=state_fips county tract logrecnum1 var&startpos-var&endpos mvar&startpos-mvar&endpos);
 set estmoegeo;
 by state county tract;
 if first.tract;
 run;
 
 
 *this is the final dataset for the ACS data Income to Poverty Ratio table.  Variable names will change for different tables that are selected;
 
 PROC DATASETS LIBRARY = WORK;
 MODIFY estmoegeo1;
 RENAME var219= total_pop var220 = _under50
 
 var221 = _50to99
 var222 = _100to124
 var223 = _125to149
 var224 = _150to184
 var225 = _185to199
 var226 = _200over
 mvar219 = moe_total_pop
 mvar220 = moe_000to050
 mvar221 = moe_50to99
 mvar222 = moe_100to124
 mvar223 = moe_125to149
 mvar224 = moe_150to184
 mvar225 = moe_185to199
 mvar226 = moe_200over
 ;
 LABEL total_pop=estimate total pop
 _under50 = under 50%
 _50to99 = 50% to 99%
 _100to124 = 100% to 124%
 _125to149 = 125% to 149%
 _150to184 = 150% to 184%
 _185to199 = 185% to 199%
 _200over = over 200%
 moe_total_pop= margin of error total pop
 moe_000to050 = margin of error under 50%
 moe_50to99 = margin of error 50% to 99%
 moe_100to124 = margin of error 100% to 124%
 moe_125to149 = margin of error 125% to 149%
 moe_150to184 = margin of error 150% to 184%
 moe_185to199 = margin of error 185% to 199%
 moe_200over = margin of error over 200%
 ;
 QUIT; RUN;
 
 
 Data acs.acsinctopovratio1;
 Set estmoegeo1;
 geoid=cat(state_fips, county, tract);
 Run;
 
 
 *this should say zero cases deleted;
 
 proc sort data=acs.acsinctopovratio1 nodupkey out=checkacs;
 by geoid;
 run;