Planning Methods: Population trend model

Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville’s population. For a review of OLS, please refer to class readings:

An Introduction to Statistical Learning, chapter 3

Video lectures available here (also, chapter 3)

Tufte’s Data Analysis for Politics and Policy ($2 for ebook)

Let’s start by looking at the population over time:

plot(pop$Year, pop$Libreville)

It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we’ll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.

lbv <- lm(pop$Libreville ~ pop$Year)
summary(lbv)
## 
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.81 -24.46  -4.11  21.42  91.48 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.185e+04  1.314e+03  -16.62 3.84e-09 ***
## pop$Year     1.116e+01  6.637e-01   16.82 3.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.77 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9626, Adjusted R-squared:  0.9592 
## F-statistic:   283 on 1 and 11 DF,  p-value: 3.389e-09

What do the estimates mean?

The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year.

So if we want to know the estimated population in 2000, we could estimate it manually as -21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually.

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000
## [1] 481.7582

Or we could predict each year in the data:

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year
##  [1] -76.48352 -20.65934  35.16484  90.98901 146.81319 202.63736 258.46154
##  [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330

Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).

cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind
##    [,1] [,2]      [,3]
## 1  1950   15 -76.48352
## 2  1955   17 -20.65934
## 3  1960   29  35.16484
## 4  1965   49  90.98901
## 5  1970   77 146.81319
## 6  1975  155 202.63736
## 7  1980  234 258.46154
## 8  1985  293 314.28571
## 9  1990  366 370.10989
## 10 1995  439 425.93407
## 11 2000  496 481.75824
## 12 2005  559 537.58242
## 13 2010  631 593.40659
## 14 2015   NA 649.23077
## 15 2020   NA 705.05495
## 16 2025   NA 760.87912
## 17 2030   NA 816.70330

Is this a good predictor of population from 2015 to 2030?

The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let’s start by looking at the regression line?

plot(pop$Year, pop$Libreville)
abline(lbv)

The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)

plot(predict(lbv), resid(lbv))
abline(h=0)

Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?


If you guessed underpredict, then I agree. While we don’t necessarily know that the ongoing trend will continue, we’re making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future. *****

Another way that we might test our model is to see how well our general modeling framework predicts values outside of the sample used in the regression. For cross-sectional, we might fit our model on a random sample of our data and then see how well it predicts other data.

Since we are predicting growth trends, however, it makes more sense to intentionally remove the most recent datapoints and see how well a model using older data predicts them.

lbv_back <- lm(Libreville ~ Year, subset(pop, pop$Year < 1995))

plot(pop$Year, pop$Libreville)
abline(lbv_back)

This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data.

mean(resid(lbv_back)^2)
## [1] 1438.128
mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015] - predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)
## [1] 9702.728

Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.

There are a few ways that we might try to get better predictions.

The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:

plot(pop$Year, pop$Libreville)
lbv <- lm(Libreville ~ Year, subset(pop, pop$Year > 1970))
summary(lbv)
## 
## Call:
## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year > 
##     1970))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6667 -3.5595 -0.9524  3.5060  8.8095 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.636e+04  3.534e+02  -74.58 3.91e-10 ***
## Year         1.343e+01  1.774e-01   75.70 3.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.747 on 6 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.9988 
## F-statistic:  5731 on 1 and 6 DF,  p-value: 3.576e-10
abline(lbv)

Though not necessarily well-suited to these data, adding a quadratic terms can often improve model fit:

lbv2 <- lm(pop$Libreville ~ pop$Year + I(pop$Year^2))
summary(lbv2)
## 
## Call:
## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.701 -11.787   6.736  15.260  29.637 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.190e+05  8.743e+04   4.792 0.000733 ***
## pop$Year      -4.341e+02  8.832e+01  -4.915 0.000609 ***
## I(pop$Year^2)  1.124e-01  2.230e-02   5.042 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.95 on 10 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9873 
## F-statistic: 468.3 on 2 and 10 DF,  p-value: 1.315e-10
plot(pop$Year, pop$Libreville)

lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)

Or we could try predicting Libreville’s population as a function of a larger geography, such as all Sub-Saharan African cities over 300,000.

plot(pop$Sub_Saharan_300kplus, pop$Libreville)

summary(lm(Libreville~Sub_Saharan_300kplus,pop))
## 
## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.44 -27.04 -23.48  41.54  51.95 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          22.016224  15.931117   1.382    0.194    
## Sub_Saharan_300kplus  0.004803   0.000242  19.850 5.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.14 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9704 
## F-statistic:   394 on 1 and 11 DF,  p-value: 5.795e-10
abline(lm(Libreville~Sub_Saharan_300kplus,pop))

Or we could take advantage of the knowledge of the population in the year immediately before our prediction. Our previous models ignored the fact that the 2010 data is a lot more important than any of the other years of data for predicting the population in the future.

First, we need to create a new time-lagged population variable. Walk through the code slowly using the command head(pop_n1) between each line of code to see exactly what you are doing.

pop_n1 <- pop[c("Libreville", "Year")]
pop_n1$Year <- pop_n1$Year + 5 
pop_n1$Libreville_m5 <- pop_n1$Libreville
pop_n1 <- pop_n1[c(2:3)] 
pop_n1 <- subset(pop_n1, pop_n1$Year < 2015)
pop <- merge(pop, pop_n1, by = "Year", all.x = T)

Now predict Libreville’s population using the population 5 years earlier and the year.

lbv3 <- lm(Libreville ~ Libreville_m5 + Year, pop)
summary(lbv3)
## 
## Call:
## lm(formula = Libreville ~ Libreville_m5 + Year, data = pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.877  -6.747  -1.215   8.818  15.653 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -9.271e+03  1.961e+03  -4.726  0.00108 ** 
## Libreville_m5  6.705e-01  9.021e-02   7.432 3.96e-05 ***
## Year           4.740e+00  9.995e-01   4.742  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.79 on 9 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9972, Adjusted R-squared:  0.9966 
## F-statistic:  1601 on 2 and 9 DF,  p-value: 3.271e-12

This model explains 99.7% of the variance in the data an indicates that the population in 2015 is equal to the population in 2010 x .67 + 2010 (the year) x 4.74. To estimate the population in 2020, you will need to add the 2015 population estimate.

plot(pop$Year, pop$Libreville)
lines(pop$Year[pop$Year < 2014 & pop$Year > 1950], predict(lbv3), col=3)

EXERCISE Download population data for New York City and read it into R.

  1. Plot New York City’s population from 1790 to 2010 and describe the major trends.
  2. Based on these trends and your knowledge of NYC, describe what you think NYC’s population will be in 2020 and why.
  3. Use a linear regression to make a prediction of New York’s 2020 population. Is the prediction any good? Why?
  4. Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).
  5. Try a quadratic regression.
  6. Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).
  7. Use earlier data (i.e., excluding the prediction years) to predict NYC’s 2000 and 2010 population. Describe the predictions in relationship to the earlier estimates.
  8. Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?
  9. Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.
  10. Create a time-lagged population variable and use it to predict New York’s population. What is the expected population in 2020?
Posted in Planning Methods | Comments Off on Planning Methods: Population trend model

Moving to access report

Brookings Institution report with Gilles Duranton:

Developing a common narrative on urban accessibility: An urban planning perspective

Posted in Uncategorized | Leave a comment

Beyond Mobility book cover

Sneak peak at the cover of my forthcoming Island Press book with Robert Cervero and Stefan Al:

Traditionally, transport planning has sought to enhance mobility. Shifting the focus from movement to accessing destinations leads to an entirely different framework for planning and designing cities and the infrastructure that serves them. This book illustrate principles and practices of linking sustainable transport programs with place-making.

 

Posted in Uncategorized | Leave a comment

Queretaro Bike Planning Studio

Last fall, I took 8 graduate students to Queretaro, Mexico, for a bicycle planning studio. The final final planning document is available here: CYCLE_QRO_Small

A few group images from the trip:

Posted in Uncategorized | Leave a comment

Consumer preference survey on electric motorcycles in Solo, Indonesia

https://kotakita-journal.tumblr.com/post/137522690302/what-would-it-take-for-you-to-give-up-your

Posted in Uncategorized | Leave a comment

Planning for cars that drive themselves

Driverless cars are coming. Nearly all major car manufactures and several large technology firms are racing to develop vehicles that can operate with limited-to-no human interaction. Despite the probability of technological, legal, and political pitfalls along the way, fully autonomous vehicles (buses and cars) will likely be commercially available and driving themselves on city streets and highways in the next 20 years—well within a long-range planning horizon.

Read more.

 

Posted in Uncategorized | Leave a comment

Examining Philadelphia land use data by Census Tract





The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.

Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.

You can also find a description of the underlying tax assessor data here.

It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.

names(dat)
##  [1] "fips"                       "tract"                     
##  [3] "sfr.built.sf"               "mfr.built.sf"              
##  [5] "retail.built.sf"            "comm.built.sf"             
##  [7] "indust.built.sf"            "vacant.built.sf"           
##  [9] "exempt.built.sf"            "sfr.lot.sf"                
## [11] "mfr.lot.sf"                 "retail.lot.sf"             
## [13] "comm.lot.sf"                "indust.lot.sf"             
## [15] "vacant.lot.sf"              "exempt.lot.sf"             
## [17] "sfr.mark.val"               "mfr.mark.val"              
## [19] "retail.mark.val"            "comm.mark.val"             
## [21] "indust.mark.val"            "vacant.mark.val"           
## [23] "exempt.mark.val"            "total.built.sf"            
## [25] "total.lot.sf"               "total.mark.val"            
## [27] "ft.2.city.hall"             "ft.2.train.st"             
## [29] "ft.2.bus.st"                "pop.2010"                  
## [31] "pop_density.2010"           "square_miles.2010"         
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"            
## [35] "pop_asian.2010"             "pop_hispanic.2010"         
## [37] "households.2010"            "households_family.2010"    
## [39] "households_nonfamily.2010"  "avg_hh_size.2010"          
## [41] "pop_25_plus.2010"           "edu_less_highschool.2010"  
## [43] "edu_highschool.2010"        "edu_collegeplus.2010"      
## [45] "pop_civ_employed.2010"      "pop_civ_unemployed.2010"   
## [47] "median_hh_income.2010"      "per_capita_income.2010"    
## [49] "housing_vacant.2010"        "housing_occupied.2010"     
## [51] "housing_median_age.2010"    "percent_poverty.2010"      
## [53] "pop_change.2000"            "pop_plus10.2000"           
## [55] "pop.2000"                   "pop_density.2000"          
## [57] "square_miles.2000"          "pop_white_nonhispanic.2000"
## [59] "pop_black.2000"             "pop_asian.2000"            
## [61] "pop_hispanic.2000"          "households.2000"           
## [63] "households_family.2000"     "households_nonfamily.2000" 
## [65] "avg_hh_size.2000"           "pop_25_plus.2000"          
## [67] "edu_less_highschool.2000"   "edu_highschool.2000"       
## [69] "edu_collegeplus.2000"       "pop_civ_employed.2000"     
## [71] "pop_civ_unemployed.2000"    "median_hh_income.2000"     
## [73] "per_capita_income.2000"     "housing_vacant.2000"       
## [75] "housing_occupied.2000"      "housing_median_age.2000"   
## [77] "percent_poverty.2000"

Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.

summary(dat[c(3:26)])
##   sfr.built.sf      mfr.built.sf     retail.built.sf  comm.built.sf     
##  Min.   :      0   Min.   :      0   Min.   :     0   Min.   :       0  
##  1st Qu.: 848570   1st Qu.: 144324   1st Qu.: 22700   1st Qu.:   52053  
##  Median :1429116   Median : 270901   Median : 71180   Median :  120155  
##  Mean   :1432991   Mean   : 395834   Mean   : 93474   Mean   :  357440  
##  3rd Qu.:1970797   3rd Qu.: 461973   3rd Qu.:133074   3rd Qu.:  239220  
##  Max.   :4357472   Max.   :3099629   Max.   :780672   Max.   :19805858  
##  indust.built.sf   vacant.built.sf    exempt.built.sf   
##  Min.   :      0   Min.   :   0.000   Min.   :    3771  
##  1st Qu.:      0   1st Qu.:   0.000   1st Qu.:  178846  
##  Median :  32572   Median :   0.000   Median :  319638  
##  Mean   : 272199   Mean   :   8.906   Mean   :  712816  
##  3rd Qu.: 198236   3rd Qu.:   0.000   3rd Qu.:  683018  
##  Max.   :6372334   Max.   :1520.000   Max.   :28206707  
##    sfr.lot.sf         mfr.lot.sf      retail.lot.sf     comm.lot.sf      
##  Min.   :       0   Min.   :      0   Min.   :     0   Min.   :       0  
##  1st Qu.:  955739   1st Qu.: 115597   1st Qu.: 23083   1st Qu.:   79158  
##  Median : 1845223   Median : 251603   Median : 54802   Median :  168255  
##  Mean   : 2476646   Mean   : 400486   Mean   : 66522   Mean   :  497081  
##  3rd Qu.: 3139156   3rd Qu.: 516614   3rd Qu.: 89680   3rd Qu.:  455709  
##  Max.   :33126247   Max.   :3026616   Max.   :633754   Max.   :17741673  
##  indust.lot.sf      vacant.lot.sf      exempt.lot.sf     
##  Min.   :       0   Min.   :       0   Min.   :   29248  
##  1st Qu.:       0   1st Qu.:   68008   1st Qu.:  348156  
##  Median :   31122   Median :  167437   Median :  746535  
##  Mean   :  548498   Mean   :  448650   Mean   : 1803958  
##  3rd Qu.:  212020   3rd Qu.:  415549   3rd Qu.: 1628277  
##  Max.   :26713019   Max.   :13823323   Max.   :40611433  
##   sfr.mark.val        mfr.mark.val       retail.mark.val   
##  Min.   :        0   Min.   :        0   Min.   :       0  
##  1st Qu.: 17434950   1st Qu.:  2777300   1st Qu.:  522825  
##  Median : 40370150   Median :  6816500   Median : 1359150  
##  Mean   : 48218979   Mean   : 13362086   Mean   : 2320078  
##  3rd Qu.: 70492725   3rd Qu.: 14893450   3rd Qu.: 2729175  
##  Max.   :249615900   Max.   :248868900   Max.   :32522100  
##  comm.mark.val       indust.mark.val     vacant.mark.val   
##  Min.   :0.000e+00   Min.   :        0   Min.   :       0  
##  1st Qu.:1.369e+06   1st Qu.:        0   1st Qu.:  214075  
##  Median :3.658e+06   Median :   393350   Median :  481950  
##  Mean   :1.887e+07   Mean   :  4313410   Mean   : 1652835  
##  3rd Qu.:9.132e+06   3rd Qu.:  2400625   3rd Qu.: 1139750  
##  Max.   :1.671e+09   Max.   :154563400   Max.   :36376000  
##  exempt.mark.val     total.built.sf      total.lot.sf     
##  Min.   :5.657e+05   Min.   :   17886   Min.   :  605741  
##  1st Qu.:6.623e+06   1st Qu.: 2190500   1st Qu.: 2539507  
##  Median :1.379e+07   Median : 2830458   Median : 4162716  
##  Mean   :4.344e+07   Mean   : 3264763   Mean   : 6241842  
##  3rd Qu.:3.132e+07   3rd Qu.: 3534784   3rd Qu.: 6811800  
##  Max.   :1.502e+09   Max.   :28206707   Max.   :91325669  
##  total.mark.val     
##  Min.   :4.546e+06  
##  1st Qu.:5.499e+07  
##  Median :9.375e+07  
##  Mean   :1.322e+08  
##  3rd Qu.:1.392e+08  
##  Max.   :2.286e+09

There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.

summary(dat[c(1,27:29)])
##       fips          ft.2.city.hall  ft.2.train.st      ft.2.bus.st     
##  Min.   :4.21e+10   Min.   : 2881   Min.   :  258.3   Min.   :   6.36  
##  1st Qu.:4.21e+10   1st Qu.:16596   1st Qu.: 4087.1   1st Qu.: 265.63  
##  Median :4.21e+10   Median :28692   Median : 6777.2   Median : 628.73  
##  Mean   :4.21e+10   Mean   :31776   Mean   : 7443.2   Mean   : 781.29  
##  3rd Qu.:4.21e+10   3rd Qu.:43484   3rd Qu.: 9832.9   3rd Qu.:1083.03  
##  Max.   :4.21e+10   Max.   :87673   Max.   :26020.5   Max.   :7140.68

The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.

Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.

sum (dat$square_miles.2010)
## [1] 134.1014
sum(dat$total.lot.sf)/  27.8784e6 #divided by the number of square feet in a square miles 
## [1] 85.97578

Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.

dat$far <- dat$total.built.sf / dat$total.lot.sf
summary(dat$far)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.005557  0.419600  0.696300  0.875800  1.052000 13.330000

Try exluding vacant land from the FAR calculation.

summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.005557  0.423300  0.748200  0.943900  1.165000 14.650000

And looking at how vacancy varies across census tracts

hist(dat$vacant.lot.sf/ dat$total.lot.sf )

plot of chunk unnamed-chunk-8

Now, create some net FAR calculations by land use type.

summary(dat$comm.built.sf/ dat$comm.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00376  0.33490  0.61070  1.09600  1.16400 24.78000        9
summary(dat$indust.built.sf/ dat$indust.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00711  0.59420  1.04900  1.44700  1.66800 16.04000      104
summary(dat$retail.built.sf/ dat$retail.lot.sf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0355  0.9365  1.3690  1.3830  1.8030  4.3190      29
summary(dat$exempt.built.sf/ dat$exempt.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000

It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation.

dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010
dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity
summary( dat$res.sf.per.hh )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   104.1  1092.0  1220.0  1237.0  1367.0  4538.0       8

This looks about right with most households having between 1000 and 1400 square feet.

hist( dat$res.sf.per.hh )

plot of chunk unnamed-chunk-11

And per capita

dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010
dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA
hist( dat$res.sf.per.person )

plot of chunk unnamed-chunk-12

Look at the relationship between the population and square footage of housing.

plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )

plot of chunk unnamed-chunk-13

Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood.

plot(dat$res.sf.per.person , dat$ft.2.city.hall)

plot of chunk unnamed-chunk-14

There is a bit more of a pattern between location and the proportion of housing that is multifamily.

dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf )
plot(dat$per.mf , dat$ft.2.city.hall)

plot of chunk unnamed-chunk-15

Which land uses tend to go together? Use a correlation table to examine total square footage.

cor(dat[c(3:7,9)])
##                 sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## sfr.built.sf       1.0000000 -0.227687793     0.208504423   -0.21049419
## mfr.built.sf      -0.2276878  1.000000000     0.009489829    0.45513983
## retail.built.sf    0.2085044  0.009489829     1.000000000    0.04310543
## comm.built.sf     -0.2104942  0.455139826     0.043105428    1.00000000
## indust.built.sf   -0.1466598 -0.135759385    -0.038551580    0.04383247
## exempt.built.sf   -0.2613878  0.103988183    -0.049405209    0.16346712
##                 indust.built.sf exempt.built.sf
## sfr.built.sf        -0.14665984     -0.26138775
## mfr.built.sf        -0.13575939      0.10398818
## retail.built.sf     -0.03855158     -0.04940521
## comm.built.sf        0.04383247      0.16346712
## indust.built.sf      1.00000000      0.06878225
## exempt.built.sf      0.06878225      1.00000000

And land area

cor(dat[c(10:16)])
##                 sfr.lot.sf   mfr.lot.sf retail.lot.sf comm.lot.sf
## sfr.lot.sf     1.000000000  0.174062104   -0.01660241  0.03719301
## mfr.lot.sf     0.174062104  1.000000000   -0.06027767  0.12670109
## retail.lot.sf -0.016602407 -0.060277674    1.00000000 -0.01112620
## comm.lot.sf    0.037193013  0.126701092   -0.01112620  1.00000000
## indust.lot.sf -0.071661971 -0.009073780   -0.10092578  0.39273649
## vacant.lot.sf -0.013189068  0.009647197   -0.05241089  0.71002277
## exempt.lot.sf  0.000523062  0.011982805   -0.16823137  0.37194754
##               indust.lot.sf vacant.lot.sf exempt.lot.sf
## sfr.lot.sf      -0.07166197  -0.013189068   0.000523062
## mfr.lot.sf      -0.00907378   0.009647197   0.011982805
## retail.lot.sf   -0.10092578  -0.052410890  -0.168231370
## comm.lot.sf      0.39273649   0.710022767   0.371947536
## indust.lot.sf    1.00000000   0.643286320   0.489299242
## vacant.lot.sf    0.64328632   1.000000000   0.531508105
## exempt.lot.sf    0.48929924   0.531508105   1.000000000

Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses.

cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))
##              mfr.built.sf
## sfr.built.sf   -0.2779901
cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))
##              retail.built.sf
## sfr.built.sf       0.1160276
cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))
##              comm.built.sf
## sfr.built.sf    -0.4443231
cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))
##              indust.built.sf
## sfr.built.sf      -0.3305134
cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))
##              vacant.built.sf
## sfr.built.sf     0.008721208
cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))
##              exempt.built.sf
## sfr.built.sf      -0.6608287

Try using regressions to better understanding partial correlations between land uses.

summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf 
           +  exempt.built.sf +  vacant.lot.sf  , dat))
## 
## Call:
## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf + 
##     comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1548586  -500194   -17272   498795  2940319 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.550e+06  6.833e+04  22.686  < 2e-16 ***
## mfr.built.sf    -3.486e-01  1.005e-01  -3.469 0.000582 ***
## retail.built.sf  1.605e+00  3.848e-01   4.172 3.76e-05 ***
## comm.built.sf   -5.509e-02  3.127e-02  -1.762 0.078859 .  
## indust.built.sf -1.539e-01  6.400e-02  -2.405 0.016662 *  
## exempt.built.sf -8.591e-02  1.978e-02  -4.344 1.80e-05 ***
## vacant.lot.sf   -1.419e-02  3.871e-02  -0.367 0.714091    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728300 on 377 degrees of freedom
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1661 
## F-statistic: 13.71 on 6 and 377 DF,  p-value: 4.216e-14

EXERCISE

  1. Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation.

  2. Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors?

  3. Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses?

Posted in Planning Methods | Comments Off on Examining Philadelphia land use data by Census Tract

Planning Methods: Predicting changes in jobs by Zip Code





For this lab, upload the Zip Code data from the previous lab. This should include the following files wih the following dimensions. You can also download the data here.

dim(zbp02)
## [1] 2207   24
dim(zbp12)
## [1] 2125   22

The purposes of this lab is to reinforce the regression and visualization skills learned in lab 2 and lab 3 using the Zip Business Patterns data.

Start by summarzing the 2002 data, which you will use to preidct job changes in 2012.

summary(zbp02)
##       ZIP         jobs_plus10         jobs.tot           jobs.23      
##  Min.   :15001   Min.   :    2.5   Min.   :    2.50   Min.   :   0.0  
##  1st Qu.:16010   1st Qu.:   71.0   1st Qu.:   72.25   1st Qu.:   2.5  
##  Median :17235   Median :  411.0   Median :  387.50   Median :  22.0  
##  Mean   :17318   Mean   : 2654.3   Mean   : 2597.33   Mean   : 138.6  
##  3rd Qu.:18618   3rd Qu.: 2327.0   3rd Qu.: 2234.25   3rd Qu.: 126.5  
##  Max.   :19980   Max.   :79883.5   Max.   :77848.50   Max.   :4476.5  
##                  NA's   :82        NA's   :40         NA's   :40      
##     jobs.31           jobs.42          jobs.44           jobs.48       
##  Min.   :    0.0   Min.   :   0.0   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:    0.0   1st Qu.:   0.0   1st Qu.:    2.5   1st Qu.:   0.00  
##  Median :   34.5   Median :   7.5   Median :   31.0   Median :   7.00  
##  Mean   :  357.2   Mean   : 128.8   Mean   :  377.0   Mean   :  73.68  
##  3rd Qu.:  341.5   3rd Qu.:  74.5   3rd Qu.:  273.2   3rd Qu.:  47.25  
##  Max.   :10076.0   Max.   :5025.5   Max.   :10491.0   Max.   :5435.00  
##  NA's   :40        NA's   :40       NA's   :40        NA's   :40       
##     jobs.51           jobs.52            jobs.53           jobs.54       
##  Min.   :   0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.0  
##  1st Qu.:   0.00   1st Qu.:    0.00   1st Qu.:   0.00   1st Qu.:    0.0  
##  Median :   0.00   Median :    7.00   Median :   0.00   Median :    5.0  
##  Mean   :  72.96   Mean   :  155.58   Mean   :  39.46   Mean   :  158.4  
##  3rd Qu.:  14.50   3rd Qu.:   50.25   3rd Qu.:  17.50   3rd Qu.:   57.0  
##  Max.   :4203.50   Max.   :15671.50   Max.   :1739.50   Max.   :22421.0  
##  NA's   :40        NA's   :40         NA's   :40        NA's   :40       
##     jobs.56          jobs.61           jobs.62          jobs.71       
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.00  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.:   0.0   1st Qu.:   0.00  
##  Median :   5.0   Median :   0.00   Median :  16.5   Median :   0.00  
##  Mean   : 145.6   Mean   :  66.51   Mean   : 362.8   Mean   :  41.55  
##  3rd Qu.:  56.0   3rd Qu.:   8.50   3rd Qu.: 208.2   3rd Qu.:  17.00  
##  Max.   :9158.5   Max.   :3734.50   Max.   :8389.5   Max.   :2858.00  
##  NA's   :40       NA's   :40        NA's   :40       NA's   :40       
##     jobs.72          jobs.81          jobs.95           jobs.99       
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:   0.0   1st Qu.:   2.5   1st Qu.:   0.00   1st Qu.:  0.000  
##  Median :  19.5   Median :  19.5   Median :   0.00   Median :  0.000  
##  Mean   : 213.5   Mean   : 139.4   Mean   :  26.79   Mean   :  1.355  
##  3rd Qu.: 154.2   3rd Qu.: 114.8   3rd Qu.:   0.00   3rd Qu.:  0.000  
##  Max.   :5445.5   Max.   :4624.5   Max.   :1961.00   Max.   :214.500  
##  NA's   :40       NA's   :40       NA's   :40        NA's   :40       
##     jobs.21            jobs.11          jobs.22           jobs.55      
##  Min.   :   0.000   Min.   :  0.00   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.000   1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:   0.0  
##  Median :   0.000   Median :  0.00   Median :   0.00   Median :   0.0  
##  Mean   :   8.461   Mean   :  2.05   Mean   :  18.45   Mean   :  69.4  
##  3rd Qu.:   0.000   3rd Qu.:  0.00   3rd Qu.:   0.00   3rd Qu.:   2.5  
##  Max.   :1538.500   Max.   :194.00   Max.   :1785.50   Max.   :6602.0  
##  NA's   :40         NA's   :40       NA's   :40        NA's   :40

Like with the census tract data, not all of the zip coned match across the two years, resulting in missing data.

Plot the relationship between the number of jobs in 2002 and 2012.

plot(zbp02$jobs.tot, zbp02$jobs_plus10)

plot of chunk unnamed-chunk-4

And look at the distribution of jobs across Zip Codes. Most Zip Codes have very few jobs, with a long tail of job-rich Zip Codes.

plot(density(zbp02$jobs_plus10, na.rm = T))

plot of chunk unnamed-chunk-5

hist(zbp02$jobs_plus10)

plot of chunk unnamed-chunk-6

Look at a specific Zip Code (19104).

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP == 19104], zbp02$jobs.tot[zbp02$ZIP==19104], col="red")

plot of chunk unnamed-chunk-7

And all Philadelphia tracts, plus a few others. These tracts do not look particularly different from others in PA.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP > 19018 & zbp02$ZIP < 19256], 
       zbp02$jobs.tot[zbp02$ZIP> 19018 & zbp02$ZIP < 19256], col="red")

plot of chunk unnamed-chunk-8

Now, predict the 2012 jobs as a function of the number of jobs in 2002. The regression line fits the data almost perfectly.

reg1 <- lm(jobs_plus10 ~ jobs.tot, zbp02) 
summary(reg1)
## 
## Call:
## lm(formula = jobs_plus10 ~ jobs.tot, data = zbp02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7887.2  -158.2   -82.6    26.7 10332.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 82.743743  26.051260   3.176  0.00151 ** 
## jobs.tot     0.967799   0.004028 240.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1081 on 2083 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9652, Adjusted R-squared:  0.9652 
## F-statistic: 5.774e+04 on 1 and 2083 DF,  p-value: < 2.2e-16

The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)

plot of chunk unnamed-chunk-10

The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)

plot of chunk unnamed-chunk-11

It also looks like there might be some systematic differences in prediction quality by the number of jobs.

plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-12

Try to see which jobs are more or less likely to predict job losses or job increases.

summary(
  lm(jobs_plus10 ~jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
       jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 + 
       jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02)
)
## 
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 + 
##     jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 + 
##     jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + 
##     jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7705.6  -159.0   -51.4    39.6  9224.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.75265   25.06739   1.905 0.056923 .  
## jobs.23      1.48264    0.11445  12.954  < 2e-16 ***
## jobs.31      0.86957    0.03842  22.633  < 2e-16 ***
## jobs.42      1.75718    0.11463  15.329  < 2e-16 ***
## jobs.44      0.90575    0.05998  15.100  < 2e-16 ***
## jobs.48      0.66433    0.10652   6.237 5.41e-10 ***
## jobs.51      0.68900    0.13370   5.153 2.80e-07 ***
## jobs.52      0.59305    0.06731   8.811  < 2e-16 ***
## jobs.53      2.82325    0.31606   8.933  < 2e-16 ***
## jobs.54      1.51710    0.07039  21.552  < 2e-16 ***
## jobs.56      0.17040    0.08420   2.024 0.043112 *  
## jobs.61      1.47967    0.11901  12.433  < 2e-16 ***
## jobs.62      0.85000    0.04409  19.281  < 2e-16 ***
## jobs.71      0.75343    0.20036   3.760 0.000174 ***
## jobs.72      1.52875    0.12252  12.478  < 2e-16 ***
## jobs.81      0.44943    0.18831   2.387 0.017095 *  
## jobs.95      0.74847    0.19179   3.903 9.82e-05 ***
## jobs.99      5.25699    4.88099   1.077 0.281591    
## jobs.21      1.13877    0.36736   3.100 0.001962 ** 
## jobs.11      3.63400    2.12568   1.710 0.087495 .  
## jobs.22      0.88365    0.22376   3.949 8.11e-05 ***
## jobs.55      0.11145    0.11360   0.981 0.326664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9711 
## F-statistic:  3339 on 21 and 2063 DF,  p-value: < 2.2e-16

Most of the paramter estimates are statistically different from zero with a high degree of confidence, but that is really not a very useful finding. You really want to know whether job types are statisticallyt different from 1 (i.e., are certain job types more or less likely to be decreasing over time?). You can roughly approximate this by looking at the coefficent estimate and the standard error. If the coefficient estimate plus or minus two standard errors crosses the number 1, than the estimate is not statistically different from one with 95% confidence.

You can aslo set up the regression to compare the coefficient estimate against the total number of jobs in 2002.

summary(
  lm(jobs_plus10 ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
       jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 + 
       jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02, offset= 1.00*jobs.tot)
)
## 
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 + 
##     jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 + 
##     jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + 
##     jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02, 
##     offset = 1 * jobs.tot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7705.6  -159.0   -51.4    39.6  9224.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.75265   25.06739   1.905 0.056923 .  
## jobs.23      0.48264    0.11445   4.217 2.58e-05 ***
## jobs.31     -0.13043    0.03842  -3.395 0.000699 ***
## jobs.42      0.75718    0.11463   6.605 5.03e-11 ***
## jobs.44     -0.09425    0.05998  -1.571 0.116262    
## jobs.48     -0.33567    0.10652  -3.151 0.001650 ** 
## jobs.51     -0.31100    0.13370  -2.326 0.020107 *  
## jobs.52     -0.40695    0.06731  -6.046 1.76e-09 ***
## jobs.53      1.82325    0.31606   5.769 9.20e-09 ***
## jobs.54      0.51710    0.07039   7.346 2.93e-13 ***
## jobs.56     -0.82960    0.08420  -9.853  < 2e-16 ***
## jobs.61      0.47967    0.11901   4.031 5.77e-05 ***
## jobs.62     -0.15000    0.04409  -3.402 0.000681 ***
## jobs.71     -0.24657    0.20036  -1.231 0.218608    
## jobs.72      0.52875    0.12252   4.316 1.67e-05 ***
## jobs.81     -0.55057    0.18831  -2.924 0.003497 ** 
## jobs.95     -0.25153    0.19179  -1.311 0.189836    
## jobs.99      4.25699    4.88099   0.872 0.383223    
## jobs.21      0.13877    0.36736   0.378 0.705657    
## jobs.11      2.63400    2.12568   1.239 0.215437    
## jobs.22     -0.11635    0.22376  -0.520 0.603127    
## jobs.55     -0.88855    0.11360  -7.822 8.22e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9711 
## F-statistic:  3339 on 21 and 2063 DF,  p-value: < 2.2e-16

Each job in Real Estate in a Zip Code in 2002 (sector 53) correlates with another 1.62 jobs in 2012. Better, yet use the types of jobs in 2002 to predict the net change in jobs from 2002 to 2012.

reg2 <- lm(jobs_plus10-jobs.tot ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
       jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 + 
       jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02)
summary(reg2)
## 
## Call:
## lm(formula = jobs_plus10 - jobs.tot ~ jobs.23 + jobs.31 + jobs.42 + 
##     jobs.44 + jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + 
##     jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + 
##     jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, 
##     data = zbp02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7705.6  -159.0   -51.4    39.6  9224.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.75265   25.06739   1.905 0.056923 .  
## jobs.23      0.48264    0.11445   4.217 2.58e-05 ***
## jobs.31     -0.13043    0.03842  -3.395 0.000699 ***
## jobs.42      0.75718    0.11463   6.605 5.03e-11 ***
## jobs.44     -0.09425    0.05998  -1.571 0.116262    
## jobs.48     -0.33567    0.10652  -3.151 0.001650 ** 
## jobs.51     -0.31100    0.13370  -2.326 0.020107 *  
## jobs.52     -0.40695    0.06731  -6.046 1.76e-09 ***
## jobs.53      1.82325    0.31606   5.769 9.20e-09 ***
## jobs.54      0.51710    0.07039   7.346 2.93e-13 ***
## jobs.56     -0.82960    0.08420  -9.853  < 2e-16 ***
## jobs.61      0.47967    0.11901   4.031 5.77e-05 ***
## jobs.62     -0.15000    0.04409  -3.402 0.000681 ***
## jobs.71     -0.24657    0.20036  -1.231 0.218608    
## jobs.72      0.52875    0.12252   4.316 1.67e-05 ***
## jobs.81     -0.55057    0.18831  -2.924 0.003497 ** 
## jobs.95     -0.25153    0.19179  -1.311 0.189836    
## jobs.99      4.25699    4.88099   0.872 0.383223    
## jobs.21      0.13877    0.36736   0.378 0.705657    
## jobs.11      2.63400    2.12568   1.239 0.215437    
## jobs.22     -0.11635    0.22376  -0.520 0.603127    
## jobs.55     -0.88855    0.11360  -7.822 8.22e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.2035, Adjusted R-squared:  0.1954 
## F-statistic:  25.1 on 21 and 2063 DF,  p-value: < 2.2e-16

Note how all the parameter estimates are the same, but there is now a much lower and much more useful R-squared. Instead of showing that Zip Codes with a lot of jobs in 2002 tend to have a lot of jobs in 2012, the model now shows that the number and types of jobs in each Zip Code can be used to predict whether a Zip Code gained or lost jobs. Instead of explaining 97% of the variation in the relationship, the model now claims to explain 17%.

The distribution of this dependent variable also looks a lot more normal with most Zip Codes not having changed very much.

plot(density(zbp02$jobs_plus10- zbp02$jobs.tot, na.rm=T))

plot of chunk unnamed-chunk-16

The residual plit also looks a bit more homoschedastic.

plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-17

EXERCISE

  1. Try to make the most parsimonious model that does a good job of predicting the change in jobs from 2002 to 2012. Use a A full model vs. reduced model Anova test to compare this model to the fully specific model (with all job types) and descibe the results.

  2. Try predicting the percent change in jobs instead of the absolute change. Compare and contrast the two models. Which model do you prefer and why? Hint: use the identity function to generate the percent change variable inside of your regression: I(jobs_plus10/jobs.tot-1). Input industry codes as a percentage of total jobs: I(jobs.23/jobs.tot).

  3. What types of jobs are the best indicators of job growth and job losses? Desribe the quantitative relationship as expressed by the two models in question 2. Does this contradict or concur with your expectations?

  4. Look at the differences in jobs in sector 99 across the two time periods. What do you think is happening?

Posted in Planning Methods, Uncategorized | Leave a comment

Planning Methods: Predicting Census Tract change





In this lab, you will use Philadelphia's 2000 Census Tract data to make predictions about the 2010 and 2020 population, and to understand what types of Census Tracts have been growing or shrinking. Download the data here (password is PennDesign). You can load it by double clicking on the file or using the load command.

Start by summarizing the data:

summary(dat2000)
##     Geo_FIPS          pop_change         pop_plus10        pop       
##  Min.   :4.21e+10   Min.   :-1310.00   Min.   :   0   Min.   :    0  
##  1st Qu.:4.21e+10   1st Qu.: -394.50   1st Qu.:2721   1st Qu.: 2329  
##  Median :4.21e+10   Median :  -85.00   Median :3774   Median : 4008  
##  Mean   :4.21e+10   Mean   :  -43.79   Mean   :3919   Mean   : 3983  
##  3rd Qu.:4.21e+10   3rd Qu.:  309.50   3rd Qu.:4987   3rd Qu.: 5630  
##  Max.   :4.21e+10   Max.   : 2311.00   Max.   :9022   Max.   :10479  
##                     NA's   :215        NA's   :106    NA's   :109    
##   pop_density     square_miles     pop_white_nonhispanic   pop_black   
##  Min.   :    0   Min.   :0.06491   Min.   :   0          Min.   :   0  
##  1st Qu.: 8462   1st Qu.:0.18056   1st Qu.: 114          1st Qu.: 166  
##  Median :15692   Median :0.25657   Median : 821          Median : 903  
##  Mean   :16120   Mean   :0.35457   Mean   :1695          Mean   :1721  
##  3rd Qu.:22543   3rd Qu.:0.37694   3rd Qu.:2901          3rd Qu.:2959  
##  Max.   :54824   Max.   :3.26893   Max.   :8948          Max.   :8468  
##  NA's   :109     NA's   :109       NA's   :109           NA's   :109   
##    pop_asian       pop_hispanic      households   households_family
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0   Min.   :   0.0   
##  1st Qu.:  13.0   1st Qu.:  33.0   1st Qu.: 852   1st Qu.: 428.0   
##  Median :  46.0   Median :  89.0   Median :1502   Median : 899.0   
##  Mean   : 177.6   Mean   : 336.7   Mean   :1549   Mean   : 924.8   
##  3rd Qu.: 168.0   3rd Qu.: 230.0   3rd Qu.:2187   3rd Qu.:1323.0   
##  Max.   :2625.0   Max.   :6802.0   Max.   :6115   Max.   :2738.0   
##  NA's   :109      NA's   :109      NA's   :109    NA's   :109      
##  households_nonfamily  avg_hh_size     pop_25_plus   edu_less_highschool
##  Min.   :   0         Min.   :0.000   Min.   :   0   Min.   :   0.0     
##  1st Qu.: 303         1st Qu.:2.220   1st Qu.:1366   1st Qu.: 286.0     
##  Median : 529         Median :2.520   Median :2422   Median : 661.0     
##  Mean   : 624         Mean   :2.424   Mean   :2536   Mean   : 729.9     
##  3rd Qu.: 796         3rd Qu.:2.760   3rd Qu.:3624   3rd Qu.:1023.0     
##  Max.   :4806         Max.   :3.600   Max.   :7155   Max.   :2733.0     
##  NA's   :109          NA's   :109     NA's   :109    NA's   :109        
##  edu_highschool   edu_collegeplus  pop_civ_employed pop_civ_unemployed
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0     Min.   :   0.0    
##  1st Qu.: 362.0   1st Qu.: 119.0   1st Qu.: 780     1st Qu.:  75.0    
##  Median : 789.0   Median : 286.0   Median :1433     Median : 158.0    
##  Mean   : 845.3   Mean   : 453.1   Mean   :1535     Mean   : 187.9    
##  3rd Qu.:1260.0   3rd Qu.: 557.0   3rd Qu.:2225     3rd Qu.: 266.0    
##  Max.   :2809.0   Max.   :5735.0   Max.   :5692     Max.   :1678.0    
##  NA's   :109      NA's   :109      NA's   :109      NA's   :109       
##  median_hh_income per_capita_income housing_vacant  housing_occupied
##  Min.   :     0   Min.   :     0    Min.   :  0.0   Min.   :   0    
##  1st Qu.: 21500   1st Qu.: 10817    1st Qu.: 55.0   1st Qu.: 852    
##  Median : 29839   Median : 15615    Median :149.0   Median :1502    
##  Mean   : 32002   Mean   : 17604    Mean   :188.7   Mean   :1549    
##  3rd Qu.: 39528   3rd Qu.: 20613    3rd Qu.:279.0   3rd Qu.:2187    
##  Max.   :200001   Max.   :109633    Max.   :834.0   Max.   :6115    
##  NA's   :109      NA's   :109       NA's   :109     NA's   :109     
##  housing_median_age percent_poverty  
##  Min.   :   0       Min.   :0.00000  
##  1st Qu.:1939       1st Qu.:0.09552  
##  Median :1942       Median :0.18480  
##  Mean   :1885       Mean   :0.21014  
##  3rd Qu.:1952       3rd Qu.:0.30020  
##  Max.   :1983       Max.   :0.84337  
##  NA's   :109        NA's   :120

Notice how there are a number of missing data entries. This is because Census Tract numbers and borders change over time; sometimes combining, sometimes splitting. We have data for 106 2000 Census Tracts that are missing in 2010, and 109 2010 Census Tracts that are missing in 2000. There are several ways to combine Census Tracts to make them consistent over time. This website provides tools to estimate Census data from 1970 to 2000 within the 2010 boundaries.

For the purposes of this exercise, we will exclude the non-matching tracts. Look at the relationship between tracts population in 2000 and in 2010.

plot(dat2000$pop, dat2000$pop_plus10)

plot of chunk unnamed-chunk-3

The density and histogram functions are convenient ways to understand the distribution of the data (in this case the 2010 population in each Census Tract). The population is somewhat normally distributed and ranges from around zero to 10,000.

plot(density(dat2000$pop_plus10, na.rm = T))

plot of chunk unnamed-chunk-4

hist(dat2000$pop_plus10)

plot of chunk unnamed-chunk-5

Before proceeding, compare the Census Tracts that are consistent between 2000 and 2010 with those that are not. As you would expect, low population and high population Census Tracts were most likely to change boundaries.

hist(dat2000$pop[is.na(dat2000$pop_plus10)==T])

plot of chunk unnamed-chunk-6

hist(dat2000$pop[is.na(dat2000$pop_plus10)==F])

plot of chunk unnamed-chunk-6

How well can you predict the 2010 Census Tract population from the 2000 Census Tract population?

reg1 <- lm(pop_plus10~pop, dat2000)
summary(reg1)
## 
## Call:
## lm(formula = pop_plus10 ~ pop, data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1271.96  -357.73   -45.54   355.51  2334.68 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -92.07364   88.72101  -1.038      0.3    
## pop           1.01166    0.01986  50.938   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 552 on 273 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9048, Adjusted R-squared:  0.9045 
## F-statistic:  2595 on 1 and 273 DF,  p-value: < 2.2e-16

The R-squared statistic indicates that the 2000 population explains about 90% of the variation in the 2000 population. The following shows how well the regression line fits the data.

plot(dat2000$pop, dat2000$pop_plus10)
abline(reg1)

plot of chunk unnamed-chunk-8

Furthermore, each person in 2000 correlates with 1.01 person in 2010, though with a standard error of 0.02, this is hardly a ringing endorsement of population growth. Plotting the population change from 2000 to 2010 shows that almost the same number of Census Tracts gained and lost population.

plot(density(dat2000$pop_change, na.rm = T))

plot of chunk unnamed-chunk-9

Plotting the error term shows that the regression line fits the low population tracts a bit better than the high population tracts. (Note how closely this tracks with the population change number, since the prediction of the 2010 population relies entirely on the 2000 population.)

plot(density(resid(reg1)))

plot of chunk unnamed-chunk-10

Plotting the predicted population against the error terms reveals some additional patterns to help understand the regression. Ideally this pattern would look much more random with points equally and randomly distributed around the zero line.

plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-11

One way to try to deal with under predicting the more populous Census Tracts is to introduce a quadratic term.

reg2 <- lm(pop_plus10~pop + I(pop^2), dat2000)
summary(reg2)
## 
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2), data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1228.67  -338.37   -48.65   365.88  2334.17 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.354e+02  1.750e+02   2.488 0.013460 *  
## pop         7.114e-01  8.863e-02   8.027 3.02e-14 ***
## I(pop^2)    3.587e-05  1.033e-05   3.473 0.000599 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 541.2 on 272 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.9082 
## F-statistic:  1356 on 2 and 272 DF,  p-value: < 2.2e-16

Does this fit the data better? A full model vs. reduced model Anova test suggests that it does. So does the small improvement to the adjusted R-Squared.

anova(reg1,reg2)
## Analysis of Variance Table
## 
## Model 1: pop_plus10 ~ pop
## Model 2: pop_plus10 ~ pop + I(pop^2)
##   Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
## 1    273 83185653                                 
## 2    272 79653918  1   3531735 12.06 0.0005991 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual plot looks better…slightly…perhaps…

plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-14

Try testing some more interesting hypotheses. For example, do you expect higher income neighborhoods to grow more or less quickly than lower income neighborhoods? You could probably construct a reasonable theory in either direction, but my guess is that wealthier Census Tracts in Philadelphia were the most likely to grow.

reg3 <- lm(pop_plus10 ~ pop + I(pop^2)+ I(dat2000$median_hh_inc/1000), dat2000)
summary(reg3)
## 
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(dat2000$median_hh_inc/1000), 
##     data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1167.48  -361.12   -48.03   332.20  2313.85 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.836e+02  1.959e+02   1.448 0.148838    
## pop                           7.188e-01  8.843e-02   8.128 1.56e-14 ***
## I(pop^2)                      3.512e-05  1.030e-05   3.408 0.000753 ***
## I(dat2000$median_hh_inc/1000) 4.375e+00  2.570e+00   1.702 0.089840 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 539.3 on 271 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9098, Adjusted R-squared:  0.9088 
## F-statistic: 911.2 on 3 and 271 DF,  p-value: < 2.2e-16

Every $1,000 more in median income correlates with 4 more people in 2010, not a tremendously strong relationship. Furthermore, the coefficient is not statistically difference from 0 with 95% confidence (though it is with 90% confidence).

It also may be correlated with other variables that are associated with increasing or decreasing population, such as race.

reg4 <-  lm(pop_plus10~ pop + I(pop^2)+ I(median_hh_income/1000) + I(100*pop_black/pop), dat2000)
summary(reg4)
## 
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(median_hh_income/1000) + 
##     I(100 * pop_black/pop), data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1306.97  -327.63    -4.05   289.08  2213.82 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.204e+02  2.224e+02   2.789 0.005667 ** 
## pop                       7.208e-01  9.168e-02   7.862 9.19e-14 ***
## I(pop^2)                  3.365e-05  1.062e-05   3.170 0.001703 ** 
## I(median_hh_income/1000) -4.582e-01  2.871e+00  -0.160 0.873310    
## I(100 * pop_black/pop)   -3.354e+00  9.524e-01  -3.521 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 528.9 on 269 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.9122, Adjusted R-squared:  0.9109 
## F-statistic: 698.6 on 4 and 269 DF,  p-value: < 2.2e-16

Each percentage point higher concentration of black residents in Census Tracts in 2000 correlates with 3 fewer people in 2010. Using this regression, let's use the predict function to estimate Philadelphia's 2010 population:

sum(predict(reg4))
## [1] 1126700

This is quite a bit lower than the actual 2010 population count:

sum(dat2000$pop_plus10, na.rm=T)
## [1] 1504950

The regression, however, excludes the hundred or so Census Tracts that changed between 2000 and 2010. Adding these back makes the prediction much closer to the actual population.

sum(predict(reg4, newdata = dat2000), na.rm =T)
## [1] 1543117

Finally, use a similar command to estimate the 2020 population by applying our regression equation to the 2010 Census data.

sum(predict(reg4, newdata = dat2010), na.rm =T)
## [1] 1488331

EXERCISE

  1. Despite recent population growth, the model predicts that population will decline between 2010 and 2020. Do you agree with this prediction? What do you think is causing it?

  2. Replace median income with a measure of the proportion of the population over 25 with a college degree of higher (edu_collegeplus), Do you expect neighborhoods with a higher proportion of well-educated residents to have grown or lost population between 2000 and 2010? Describe the results of the model in relationship to your expectations.

  3. Does this change improve the model? Be sure to consider statistical fit, residual plots, and theory in providing your answer.

  4. Develop a different set of models predicting population change (pop_change) instead of the population ten years later (pop_plus10). What factors do you expect to be good predictors of total population change? Describe the model, model fits, and residual plots.

Posted in Planning Methods | Comments Off on Planning Methods: Predicting Census Tract change

Planning Methods: working with Zip Business Patterns data

In this lab, you will download data on job by industry from the Zip Business Patterns data. You can find the data on the US Census website. Be sure to download the data as a CSV file. For now, download the full 2002 and 2012 data.  The links are:2002 & 2012

You want the Complete ZIP Code Industry Detail File.

Be sure to read the descriptions of the data and data collection methods on the website. The FAQ may also prove useful.

The objectives of the lab are to learn to work with ZBP data using the midpoint method and to learn a few data processing commands in R. Specifically, we will take subsets of data and do a simple data merge. Although this may seem a little complicated to do some simple file management and calculations, the pay offs are also big. By the end of the lab, you should be able to create mid-point methods of jobs for all US Census Tracts by any available NAICS code or subsets of NAICS codes with just a few changes to your code.

To load the data:
1) Set your working directory to wherever you saved your files. Note that the data are stored a bit differently across the two

2) Run the following two lines of code:

zbp02 <- read.csv(“zbp02detail.txt”)
zbp12 <- read.csv(“zbp12detail.txt”)

This will take a moment to load, if you’re working on a home computer. To make sure that your data loaded properly check and have the same dimensions as below,

dim(zbp02)
## [1] 3251827      12
dim(zbp12)
## [1] 3179204      12

look the same,

head(zbp02)
##   ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 N500_999
## 1 501 ------  11    9    1      0      1      0        0        0        0
## 2 501 23----   1    1    0      0      0      0        0        0        0
## 3 501 235310   1    1    0      0      0      0        0        0        0
## 4 501 42----   3    3    0      0      0      0        0        0        0
## 5 501 421990   1    1    0      0      0      0        0        0        0
## 6 501 422920   1    1    0      0      0      0        0        0        0
##   N1000
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
head(zbp12)
##    zip  naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1  501 ------   2    2    0      0      0      0        0        0
## 2  501 81----   2    2    0      0      0      0        0        0
## 3  501 813110   2    2    0      0      0      0        0        0
## 4 1001 ------ 453  206   77     75     55     22       14        3
## 5 1001 22----   2    1    0      0      1      0        0        0
## 6 1001 221310   1    1    0      0      0      0        0        0
##   n500_999 n1000
## 1        0     0
## 2        0     0
## 3        0     0
## 4        1     0
## 5        0     0
## 6        0     0

and have the same column types. Most importantly, only NAICS should be a character or factor.

str(zbp02)
## 'data.frame':    3251827 obs. of  12 variables:
##  $ ZIP     : int  501 501 501 501 501 501 501 501 501 501 ...
##  $ NAICS   : Factor w/ 1104 levels "------","11----",..: 1 59 73 562 599 627 631 632 655 859 ...
##  $ EST     : int  11 1 1 3 1 1 1 1 1 2 ...
##  $ N1_4    : int  9 1 1 3 1 1 1 1 1 2 ...
##  $ N5_9    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ N10_19  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N20_49  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ N50_99  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N100_249: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N250_499: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N500_999: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
str(zbp12)
## 'data.frame':    3179204 obs. of  12 variables:
##  $ zip     : int  501 501 501 1001 1001 1001 1001 1001 1001 1001 ...
##  $ naics   : Factor w/ 999 levels "------","11----",..: 1 950 986 1 48 60 61 63 65 67 ...
##  $ est     : int  2 2 2 453 2 1 1 47 1 2 ...
##  $ n1_4    : int  2 2 2 206 1 1 0 30 1 1 ...
##  $ n5_9    : int  0 0 0 77 0 0 0 7 0 0 ...
##  $ n10_19  : int  0 0 0 75 0 0 0 3 0 0 ...
##  $ n20_49  : int  0 0 0 55 1 0 1 7 0 1 ...
##  $ n50_99  : int  0 0 0 22 0 0 0 0 0 0 ...
##  $ n100_249: int  0 0 0 14 0 0 0 0 0 0 ...
##  $ n250_499: int  0 0 0 3 0 0 0 0 0 0 ...
##  $ n500_999: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ n1000   : int  0 0 0 0 0 0 0 0 0 0 ...

If for some reason, one of the other variables did not load as you want, you can change it like so:

zbp02$EST <- as.numeric(zbp02$EST) ##as.integer() is also fine.

Note that the 2002 data include the same zip codes multiple times. These are the counts by different sector. (The 2012 data only includes totals.) To better understand this structure, look at the first Zip Code from the 2002 data:

subset(zbp02, ZIP == 501)
##    ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1  501 ------  11    9    1      0      1      0        0        0
## 2  501 23----   1    1    0      0      0      0        0        0
## 3  501 235310   1    1    0      0      0      0        0        0
## 4  501 42----   3    3    0      0      0      0        0        0
## 5  501 421990   1    1    0      0      0      0        0        0
## 6  501 422920   1    1    0      0      0      0        0        0
## 7  501 422990   1    1    0      0      0      0        0        0
## 8  501 44----   1    1    0      0      0      0        0        0
## 9  501 445110   1    1    0      0      0      0        0        0
## 10 501 54----   2    2    0      0      0      0        0        0
## 11 501 541110   1    1    0      0      0      0        0        0
## 12 501 541211   1    1    0      0      0      0        0        0
## 13 501 62----   1    1    0      0      0      0        0        0
## 14 501 621111   1    1    0      0      0      0        0        0
## 15 501 81----   2    0    1      0      1      0        0        0
## 16 501 813110   2    0    1      0      1      0        0        0
## 17 501 99----   1    1    0      0      0      0        0        0
##    N500_999 N1000
## 1         0     0
## 2         0     0
## 3         0     0
## 4         0     0
## 5         0     0
## 6         0     0
## 7         0     0
## 8         0     0
## 9         0     0
## 10        0     0
## 11        0     0
## 12        0     0
## 13        0     0
## 14        0     0
## 15        0     0
## 16        0     0
## 17        0     0

The NAICS includes the total number of establishments by all sectors [NAICS==“——”], by two-digit industry code, and by six-digit code. EST is the total number of establishments, and the columns to the right are the number of establishments by number of employees (1 to 4, 5 to 9, etc.) To check that the numbers add up try adding the establishments by sector.

zbp02$EST[zbp02$NAICS == "------" & zbp02$ZIP == 501] 
## [1] 11

And check it against the sum of the other columns.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501] )
## [1] 21

Whoops! What happened? The data includes six digit and two digit counts, which led to some double counting. Try including only NAICS that end in “—-” with the substring command.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501 & substring(zbp02$NAICS,3,6)== "----"] )
## [1] 11

To clean up the data a bit, create a a new file that includes only the 2-digit codes and totals. If you want to save a more complete file for future use, you can ignore this step. Note that no matter what you do, the original CSV files remain unchanged unless you specifically overwrite them. Thus you can run and rerun your code without ever changing the underlying data.

zbp02 <- subset(zbp02, substring(zbp02$NAICS,3,6)=="----")
zbp12 <- subset(zbp12, substring(zbp12$naics,3,6)=="----")

Again check your dimensions against the below to make sure that the commands worked.

dim(zbp02)
## [1] 478497     12
dim(zbp12)
## [1] 454477     12

And look at the data.

head(zbp02)
##    ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1  501 ------  11    9    1      0      1      0        0        0
## 2  501 23----   1    1    0      0      0      0        0        0
## 4  501 42----   3    3    0      0      0      0        0        0
## 8  501 44----   1    1    0      0      0      0        0        0
## 10 501 54----   2    2    0      0      0      0        0        0
## 13 501 62----   1    1    0      0      0      0        0        0
##    N500_999 N1000
## 1         0     0
## 2         0     0
## 4         0     0
## 8         0     0
## 10        0     0
## 13        0     0
head(zbp12)
##     zip  naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1   501 ------   2    2    0      0      0      0        0        0
## 2   501 81----   2    2    0      0      0      0        0        0
## 4  1001 ------ 453  206   77     75     55     22       14        3
## 5  1001 22----   2    1    0      0      1      0        0        0
## 8  1001 23----  47   30    7      3      7      0        0        0
## 27 1001 31----  48   10    8      9      9      6        3        3
##    n500_999 n1000
## 1         0     0
## 2         0     0
## 4         1     0
## 5         0     0
## 8         0     0
## 27        0     0

Now, use the midpoint method to estimate the number of jobs by establishment size. Note that the average size of 1-to-4 employee establishment is 2.5. For establishments with 1000 or more employees, the midpoint method sets the number of employees to 1000. (Do the 2002 data in steps and then 2012 all at once.) The within() command is one of several nifty ways to avoid writing zbp02$ before each variable.

zbp02 <- within(zbp02, jobs <- N1_4 *2.5 + N5_9 *7 + N10_19 * 14.5 + 
    N20_49 * 34.5 +  N50_99 *74.5 + 
    N100_249 * 174.5 + N250_499 *374.5 + 
    N500_999 * 749.5 +  N1000 *1000)

For this lab, drop all none PA Zip Codes to conserve space. You may want to go back and keep the other states as well at some point in time.PA’s Zip Codes run between 15000 and 19999. Note that you

zbp02<- subset(zbp02, as.numeric(zbp02$ZIP) 
                    >= 15000 & as.numeric(zbp02$ZIP) <= 19999) 

And get rid of the variables that are not needed.

keepers <- c("ZIP", "NAICS", "jobs")
zbp02 <- zbp02[keepers]

And finally clean up the labeling of the NAICS codes.

zbp02$NAICS <- substring(zbp02$NAICS,1,2)
zbp02$NAICS[zbp02$NAICS=="--"] <- "tot"

Have a look

head(zbp02)
##          ZIP NAICS   jobs
## 452595 14001   tot 3180.5
## 452596 14001    23  224.0
## 452610 14001    31 1383.0
## 452629 14001    42  820.5
## 452638 14001    44  189.0
## 452654 14001    48   20.0

This type of data is often referred to as long data. There are multiple entries for each Zip Code and the NAICS column indicates the industry. For future work, it is going to be a little easier to have the data in a wide format, which you can do with ease using the reshape() command.

zbp02 <- reshape(zbp02, 
             timevar = "NAICS",
             idvar = c("ZIP"),
             direction = "wide")

Have a look

head(zbp02)
##          ZIP jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48 jobs.51
## 452595 14001   3180.5   224.0  1383.0   820.5   189.0    20.0       5
## 452714 14003     12.0      NA      NA      NA     7.0      NA      NA
## 452721 14004   2270.5   292.0   467.0    39.0   527.5    60.5      14
## 452854 14005    167.0     7.5    34.5     9.5    31.0    37.0      NA
## 452887 14006   1898.5    92.0   489.5    71.5   295.0    96.5      19
## 452991 14008     51.5    10.0      NA      NA    12.5      NA      NA
##        jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 452595   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 452714      NA      NA      NA      NA      NA      NA      NA     2.5
## 452721    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 452854     2.5      NA      NA     2.5      NA     2.5      NA      NA
## 452887    85.0       7    31.5    22.0      NA    97.0    39.5   440.0
## 452991      NA      NA      NA     2.5      NA    14.5      NA     7.0
##        jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 452595    56.0      49     7.5      NA      NA      NA      NA
## 452714     2.5      NA      NA      NA      NA      NA      NA
## 452721   169.5      NA      NA     9.5      NA      NA      NA
## 452854    19.0      NA      NA    14.0       7      NA      NA
## 452887   106.0      NA     7.0      NA      NA      NA      NA
## 452991     5.0      NA      NA      NA      NA      NA      NA

The default setting sets missing entries to NA, but you should convert these to zero, since there are no reported establishments rather than unknown reported establishments.

Note that zbp02[is.na(zbp02)==T] <- 0 is a logic statement that replaces all variables in the dataset with zeros if they currently equal NA. If you only wanted to do this for job totals, you could write: zbp02$jobs.tot[is.na(zbp02$jobs.tot)==T] <- 0

zbp02[is.na(zbp02)==T] <- 0

The following goes through the same steps for the 2012 data. Try to go through it slowly and understand each step of the process. Use ?command to get a description of functions if needed. Use internet searches as well.

zbp12<- subset(zbp12, as.numeric(zbp12$zip) 
               >= 14000 & as.numeric(zbp12$zip) <= 19999) 

#calculate job totals
zbp12 <- within(zbp12, jobs <- n1_4 *2.5 + n5_9 *7 + n10_19 * 14.5 + 
                  n20_49 * 34.5 +  n50_99 *74.5 + 
                  n100_249 * 174.5 + n250_499 *374.5 + 
                  n500_999 * 749.5 +  n1000 *1000)

zbp12$ZIP <- zbp12$zip ##to make consistent with the 02 data
zbp12$NAICS <- zbp12$naics
##long to wide

zbp12 <- zbp12[keepers] #Keepers should still exist
zbp12$NAICS <- substring(zbp12$NAICS,1,2)
zbp12$NAICS[zbp12$NAICS=="--"] <- "tot"

#long to wide
zbp12 <- reshape(zbp12, 
                 timevar = "NAICS",
                 idvar = c("ZIP"),
                 direction = "wide")

#switch NA to zero
zbp12[is.na(zbp12)==T] <- 0

Finally, add the total number of jobs from 2010 to the 2000 dataset to use for making predictions in a later lab. Create a new dataset so that zbp12 does not change.

zipmerge <- zbp12[c("ZIP", "jobs.tot")]
zipmerge$jobs_plus10 <- zipmerge$jobs.tot
zipmerge <- zipmerge[c("ZIP", "jobs_plus10")]
zbp02 <- merge(zipmerge, zbp02, by ="ZIP", all.x = T, all.y =T)
rm(zipmerge, keepers)

Look at the file. This time keep the NAs since they do in fact represent missing data. For example there was no Zip code 14003 in 2012.

head(zbp02)
##     ZIP jobs_plus10 jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48
## 1 14001      3015.0   3180.5   224.0  1383.0   820.5   189.0    20.0
## 2 14003          NA     12.0     0.0     0.0     0.0     7.0     0.0
## 3 14004      2368.0   2270.5   292.0   467.0    39.0   527.5    60.5
## 4 14005       182.5    167.0     7.5    34.5     9.5    31.0    37.0
## 5 14006      1370.0   1898.5    92.0   489.5    71.5   295.0    96.5
## 6 14008        73.0     51.5    10.0     0.0     0.0    12.5     0.0
##   jobs.51 jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 1       5   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 2       0     0.0       0     0.0     0.0     0.0     0.0     0.0     2.5
## 3      14    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 4       0     2.5       0     0.0     2.5     0.0     2.5     0.0     0.0
## 5      19    85.0       7    31.5    22.0     0.0    97.0    39.5   440.0
## 6       0     0.0       0     0.0     2.5     0.0    14.5     0.0     7.0
##   jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 1    56.0      49     7.5     0.0       0       0       0
## 2     2.5       0     0.0     0.0       0       0       0
## 3   169.5       0     0.0     9.5       0       0       0
## 4    19.0       0     0.0    14.0       7       0       0
## 5   106.0       0     7.0     0.0       0       0       0
## 6     5.0       0     0.0     0.0       0       0       0

If desired, save this as an R data file. In the future, you can double click on this file to open the PA 2002 and 2012 job estimates by 2-digit NAICS for PA.

save(zbp02, zbp12,file = "zip_jobs.Rda")

As you process your own data in the future, you will lean a number of different commands and packages to help you along the way. There are always dozens of different ways to do the same thing. If you are having trouble to get the program to do what you want it to do, it often makes sense to think about the problem differently. For example, maybe you want a nice summary table of jobs by industry and by state, but you are having trouble getting a single command to do it. Try instead to do each operation separately, and then combine them.

EXERCISE

  1. Look up the 2-digit NAICS code for Manufacturing. What Zip Codes had the most gains or losses (both total and proportional) by State. Now, make a summary table by State that includes total estimated manufacturing jobs in 2002 and 2012 and the total and proportional change.
  2. Make the same table as above but only include Zip Codes that have 100 or more jobs. Describe differences between the tables.
  3. Make a table of the number of establishments that have 1000+ employees and the total number of establishments by 2-digit NAICS in 2012. (Hint. Either create two new datasets–one for 1000+ establishments by 2-digit NAICS, one for all establishments by 2-digit NAICS–and then merge them. Or use the sum() command on subsets of the data.) What industries have the highest proportion of 1000+ establishments.
  4. Make the same table as in question 3, but list the estimated number of jobs instead of establishments. Describe similarities and differences between the two tables.

 

 

Posted in Planning Methods | Comments Off on Planning Methods: working with Zip Business Patterns data