Planning Methods: working with Zip Business Patterns data

In this lab, you will download data on job by industry from the Zip Business Patterns data. You can find the data on the US Census website. Be sure to download the data as a CSV file. For now, download the full 2002 and 2012 data.  The links are:2002 & 2012

Be sure to read the descriptions of the data and data collection methods on the website. The FAQ may also prove useful.

The objectives of the lab are to learn to work with ZBP data using the midpoint method and to learn a few data processing commands in R. Specifically, we will take subsets of data and do a simple data merge. Although this may seem a little complicated to do some simple file management and calculations, the pay offs are also big. By the end of the lab, you should be able to create mid-point methods of jobs for all US Census Tracts by any available NAICS code or subsets of NAICS codes with just a few changes to your code.

To load the data:
1) Set your working directory to wherever you saved your files. Note that the data are stored a bit differently across the two

2) Run the following two lines of code:

zbp02 <- read.csv(“zbp02detail.txt”)
zbp12 <- read.csv(“zbp12detail.txt”)

This will take a moment to load, if you’re working on a home computer. To make sure that your data loaded properly check and have the same dimensions as below,

dim(zbp02)
## [1] 3251827      12
dim(zbp12)
## [1] 3179204      12

look the same,

head(zbp02)
##   ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 N500_999
## 1 501 ------  11    9    1      0      1      0        0        0        0
## 2 501 23----   1    1    0      0      0      0        0        0        0
## 3 501 235310   1    1    0      0      0      0        0        0        0
## 4 501 42----   3    3    0      0      0      0        0        0        0
## 5 501 421990   1    1    0      0      0      0        0        0        0
## 6 501 422920   1    1    0      0      0      0        0        0        0
##   N1000
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
head(zbp12)
##    zip  naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1  501 ------   2    2    0      0      0      0        0        0
## 2  501 81----   2    2    0      0      0      0        0        0
## 3  501 813110   2    2    0      0      0      0        0        0
## 4 1001 ------ 453  206   77     75     55     22       14        3
## 5 1001 22----   2    1    0      0      1      0        0        0
## 6 1001 221310   1    1    0      0      0      0        0        0
##   n500_999 n1000
## 1        0     0
## 2        0     0
## 3        0     0
## 4        1     0
## 5        0     0
## 6        0     0

and have the same column types. Most importantly, only NAICS should be a character or factor.

str(zbp02)
## 'data.frame':    3251827 obs. of  12 variables:
##  $ ZIP     : int  501 501 501 501 501 501 501 501 501 501 ...
##  $ NAICS   : Factor w/ 1104 levels "------","11----",..: 1 59 73 562 599 627 631 632 655 859 ...
##  $ EST     : int  11 1 1 3 1 1 1 1 1 2 ...
##  $ N1_4    : int  9 1 1 3 1 1 1 1 1 2 ...
##  $ N5_9    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ N10_19  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N20_49  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ N50_99  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N100_249: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N250_499: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N500_999: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
str(zbp12)
## 'data.frame':    3179204 obs. of  12 variables:
##  $ zip     : int  501 501 501 1001 1001 1001 1001 1001 1001 1001 ...
##  $ naics   : Factor w/ 999 levels "------","11----",..: 1 950 986 1 48 60 61 63 65 67 ...
##  $ est     : int  2 2 2 453 2 1 1 47 1 2 ...
##  $ n1_4    : int  2 2 2 206 1 1 0 30 1 1 ...
##  $ n5_9    : int  0 0 0 77 0 0 0 7 0 0 ...
##  $ n10_19  : int  0 0 0 75 0 0 0 3 0 0 ...
##  $ n20_49  : int  0 0 0 55 1 0 1 7 0 1 ...
##  $ n50_99  : int  0 0 0 22 0 0 0 0 0 0 ...
##  $ n100_249: int  0 0 0 14 0 0 0 0 0 0 ...
##  $ n250_499: int  0 0 0 3 0 0 0 0 0 0 ...
##  $ n500_999: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ n1000   : int  0 0 0 0 0 0 0 0 0 0 ...

If for some reason, one of the other variables did not load as you want, you can change it like so:

zbp02$EST <- as.numeric(zbp02$EST) ##as.integer() is also fine.

Note that the 2002 data include the same zip codes multiple times. These are the counts by different sector. (The 2012 data only includes totals.) To better understand this structure, look at the first Zip Code from the 2002 data:

subset(zbp02, ZIP == 501)
##    ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1  501 ------  11    9    1      0      1      0        0        0
## 2  501 23----   1    1    0      0      0      0        0        0
## 3  501 235310   1    1    0      0      0      0        0        0
## 4  501 42----   3    3    0      0      0      0        0        0
## 5  501 421990   1    1    0      0      0      0        0        0
## 6  501 422920   1    1    0      0      0      0        0        0
## 7  501 422990   1    1    0      0      0      0        0        0
## 8  501 44----   1    1    0      0      0      0        0        0
## 9  501 445110   1    1    0      0      0      0        0        0
## 10 501 54----   2    2    0      0      0      0        0        0
## 11 501 541110   1    1    0      0      0      0        0        0
## 12 501 541211   1    1    0      0      0      0        0        0
## 13 501 62----   1    1    0      0      0      0        0        0
## 14 501 621111   1    1    0      0      0      0        0        0
## 15 501 81----   2    0    1      0      1      0        0        0
## 16 501 813110   2    0    1      0      1      0        0        0
## 17 501 99----   1    1    0      0      0      0        0        0
##    N500_999 N1000
## 1         0     0
## 2         0     0
## 3         0     0
## 4         0     0
## 5         0     0
## 6         0     0
## 7         0     0
## 8         0     0
## 9         0     0
## 10        0     0
## 11        0     0
## 12        0     0
## 13        0     0
## 14        0     0
## 15        0     0
## 16        0     0
## 17        0     0

The NAICS includes the total number of establishments by all sectors [NAICS==“——”], by two-digit industry code, and by six-digit code. EST is the total number of establishments, and the columns to the right are the number of establishments by number of employees (1 to 4, 5 to 9, etc.) To check that the numbers add up try adding the establishments by sector.

zbp02$EST[zbp02$NAICS == "------" & zbp02$ZIP == 501] 
## [1] 11

And check it against the sum of the other columns.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501] )
## [1] 21

Whoops! What happened? The data includes six digit and two digit counts, which led to some double counting. Try including only NAICS that end in “—-” with the substring command.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501 & substring(zbp02$NAICS,3,6)== "----"] )
## [1] 11

To clean up the data a bit, create a a new file that includes only the 2-digit codes and totals. If you want to save a more complete file for future use, you can ignore this step. Note that no matter what you do, the original CSV files remain unchanged unless you specifically overwrite them. Thus you can run and rerun your code without ever changing the underlying data.

zbp02 <- subset(zbp02, substring(zbp02$NAICS,3,6)=="----")
zbp12 <- subset(zbp12, substring(zbp12$naics,3,6)=="----")

Again check your dimensions against the below to make sure that the commands worked.

dim(zbp02)
## [1] 478497     12
dim(zbp12)
## [1] 454477     12

And look at the data.

head(zbp02)
##    ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1  501 ------  11    9    1      0      1      0        0        0
## 2  501 23----   1    1    0      0      0      0        0        0
## 4  501 42----   3    3    0      0      0      0        0        0
## 8  501 44----   1    1    0      0      0      0        0        0
## 10 501 54----   2    2    0      0      0      0        0        0
## 13 501 62----   1    1    0      0      0      0        0        0
##    N500_999 N1000
## 1         0     0
## 2         0     0
## 4         0     0
## 8         0     0
## 10        0     0
## 13        0     0
head(zbp12)
##     zip  naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1   501 ------   2    2    0      0      0      0        0        0
## 2   501 81----   2    2    0      0      0      0        0        0
## 4  1001 ------ 453  206   77     75     55     22       14        3
## 5  1001 22----   2    1    0      0      1      0        0        0
## 8  1001 23----  47   30    7      3      7      0        0        0
## 27 1001 31----  48   10    8      9      9      6        3        3
##    n500_999 n1000
## 1         0     0
## 2         0     0
## 4         1     0
## 5         0     0
## 8         0     0
## 27        0     0

Now, use the midpoint method to estimate the number of jobs by establishment size. Note that the average size of 1-to-4 employee establishment is 2.5. For establishments with 1000 or more employees, the midpoint method sets the number of employees to 1000. (Do the 2002 data in steps and then 2012 all at once.) The within() command is one of several nifty ways to avoid writing zbp02$ before each variable.

zbp02 <- within(zbp02, jobs <- N1_4 *2.5 + N5_9 *7 + N10_19 * 14.5 + 
    N20_49 * 34.5 +  N50_99 *74.5 + 
    N100_249 * 174.5 + N250_499 *374.5 + 
    N500_999 * 749.5 +  N1000 *1000)

For this lab, drop all none PA Zip Codes to conserve space. You may want to go back and keep the other states as well at some point in time.PA’s Zip Codes run between 15000 and 19999. Note that you

zbp02<- subset(zbp02, as.numeric(zbp02$ZIP) 
                    >= 15000 & as.numeric(zbp02$ZIP) <= 19999) 

And get rid of the variables that are not needed.

keepers <- c("ZIP", "NAICS", "jobs")
zbp02 <- zbp02[keepers]

And finally clean up the labeling of the NAICS codes.

zbp02$NAICS <- substring(zbp02$NAICS,1,2)
zbp02$NAICS[zbp02$NAICS=="--"] <- "tot"

Have a look

head(zbp02)
##          ZIP NAICS   jobs
## 452595 14001   tot 3180.5
## 452596 14001    23  224.0
## 452610 14001    31 1383.0
## 452629 14001    42  820.5
## 452638 14001    44  189.0
## 452654 14001    48   20.0

This type of data is often referred to as long data. There are multiple entries for each Zip Code and the NAICS column indicates the industry. For future work, it is going to be a little easier to have the data in a wide format, which you can do with ease using the reshape() command.

zbp02 <- reshape(zbp02, 
             timevar = "NAICS",
             idvar = c("ZIP"),
             direction = "wide")

Have a look

head(zbp02)
##          ZIP jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48 jobs.51
## 452595 14001   3180.5   224.0  1383.0   820.5   189.0    20.0       5
## 452714 14003     12.0      NA      NA      NA     7.0      NA      NA
## 452721 14004   2270.5   292.0   467.0    39.0   527.5    60.5      14
## 452854 14005    167.0     7.5    34.5     9.5    31.0    37.0      NA
## 452887 14006   1898.5    92.0   489.5    71.5   295.0    96.5      19
## 452991 14008     51.5    10.0      NA      NA    12.5      NA      NA
##        jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 452595   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 452714      NA      NA      NA      NA      NA      NA      NA     2.5
## 452721    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 452854     2.5      NA      NA     2.5      NA     2.5      NA      NA
## 452887    85.0       7    31.5    22.0      NA    97.0    39.5   440.0
## 452991      NA      NA      NA     2.5      NA    14.5      NA     7.0
##        jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 452595    56.0      49     7.5      NA      NA      NA      NA
## 452714     2.5      NA      NA      NA      NA      NA      NA
## 452721   169.5      NA      NA     9.5      NA      NA      NA
## 452854    19.0      NA      NA    14.0       7      NA      NA
## 452887   106.0      NA     7.0      NA      NA      NA      NA
## 452991     5.0      NA      NA      NA      NA      NA      NA

The default setting sets missing entries to NA, but you should convert these to zero, since there are no reported establishments rather than unknown reported establishments.

Note that zbp02[is.na(zbp02)==T] <- 0 is a logic statement that replaces all variables in the dataset with zeros if they currently equal NA. If you only wanted to do this for job totals, you could write: zbp02$jobs.tot[is.na(zbp02$jobs.tot)==T] <- 0

zbp02[is.na(zbp02)==T] <- 0

The following goes through the same steps for the 2012 data. Try to go through it slowly and understand each step of the process. Use ?command to get a description of functions if needed. Use internet searches as well.

zbp12<- subset(zbp12, as.numeric(zbp12$zip) 
               >= 14000 & as.numeric(zbp12$zip) <= 19999) 

#calculate job totals
zbp12 <- within(zbp12, jobs <- n1_4 *2.5 + n5_9 *7 + n10_19 * 14.5 + 
                  n20_49 * 34.5 +  n50_99 *74.5 + 
                  n100_249 * 174.5 + n250_499 *374.5 + 
                  n500_999 * 749.5 +  n1000 *1000)

zbp12$ZIP <- zbp12$zip ##to make consistent with the 02 data
zbp12$NAICS <- zbp12$naics
##long to wide

zbp12 <- zbp12[keepers] #Keepers should still exist
zbp12$NAICS <- substring(zbp12$NAICS,1,2)
zbp12$NAICS[zbp12$NAICS=="--"] <- "tot"

#long to wide
zbp12 <- reshape(zbp12, 
                 timevar = "NAICS",
                 idvar = c("ZIP"),
                 direction = "wide")

#switch NA to zero
zbp12[is.na(zbp12)==T] <- 0

Finally, add the total number of jobs from 2010 to the 2000 dataset to use for making predictions in a later lab. Create a new dataset so that zbp12 does not change.

zipmerge <- zbp12[c("ZIP", "jobs.tot")]
zipmerge$jobs_plus10 <- zipmerge$jobs.tot
zipmerge <- zipmerge[c("ZIP", "jobs_plus10")]
zbp02 <- merge(zipmerge, zbp02, by ="ZIP", all.x = T, all.y =T)
rm(zipmerge, keepers)

Look at the file. This time keep the NAs since they do in fact represent missing data. For example there was no Zip code 14003 in 2012.

head(zbp02)
##     ZIP jobs_plus10 jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48
## 1 14001      3015.0   3180.5   224.0  1383.0   820.5   189.0    20.0
## 2 14003          NA     12.0     0.0     0.0     0.0     7.0     0.0
## 3 14004      2368.0   2270.5   292.0   467.0    39.0   527.5    60.5
## 4 14005       182.5    167.0     7.5    34.5     9.5    31.0    37.0
## 5 14006      1370.0   1898.5    92.0   489.5    71.5   295.0    96.5
## 6 14008        73.0     51.5    10.0     0.0     0.0    12.5     0.0
##   jobs.51 jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 1       5   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 2       0     0.0       0     0.0     0.0     0.0     0.0     0.0     2.5
## 3      14    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 4       0     2.5       0     0.0     2.5     0.0     2.5     0.0     0.0
## 5      19    85.0       7    31.5    22.0     0.0    97.0    39.5   440.0
## 6       0     0.0       0     0.0     2.5     0.0    14.5     0.0     7.0
##   jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 1    56.0      49     7.5     0.0       0       0       0
## 2     2.5       0     0.0     0.0       0       0       0
## 3   169.5       0     0.0     9.5       0       0       0
## 4    19.0       0     0.0    14.0       7       0       0
## 5   106.0       0     7.0     0.0       0       0       0
## 6     5.0       0     0.0     0.0       0       0       0

If desired, save this as an R data file. In the future, you can double click on this file to open the PA 2002 and 2012 job estimates by 2-digit NAICS for PA.

save(zbp02, zbp12,file = "zip_jobs.Rda")

As you process your own data in the future, you will lean a number of different commands and packages to help you along the way. There are always dozens of different ways to do the same thing. If you are having trouble to get the program to do what you want it to do, it often makes sense to think about the problem differently. For example, maybe you want a nice summary table of jobs by industry and by state, but you are having trouble getting a single command to do it. Try instead to do each operation separately, and then combine them.

EXERCISE

  1. Look up the 2-digit NAICS code for Manufacturing. What Zip Codes had the most gains or losses (both total and proportional) by State. Now, make a summary table by State that includes total estimated manufacturing jobs in 2002 and 2012 and the total and proportional change.
  2. Make the same table as above but only include Zip Codes that have 100 or more jobs. Describe differences between the tables.
  3. Make a table of the number of establishments that have 1000+ employees and the total number of establishments by 2-digit NAICS in 2012. (Hint. Either create two new datasets–one for 1000+ establishments by 2-digit NAICS, one for all establishments by 2-digit NAICS–and then merge them. Or use the sum() command on subsets of the data.) What industries have the highest proportion of 1000+ establishments.
  4. Make the same table as in question 3, but list the estimated number of jobs instead of establishments. Describe similarities and differences between the two tables.

 

 

This entry was posted in Planning Methods. Bookmark the permalink.