Planning Methods: Predicting changes in jobs by Zip Code





For this lab, upload the Zip Code data from the previous lab. This should include the following files wih the following dimensions. You can also download the data here.

dim(zbp02)
## [1] 2207   24
dim(zbp12)
## [1] 2125   22

The purposes of this lab is to reinforce the regression and visualization skills learned in lab 2 and lab 3 using the Zip Business Patterns data.

Start by summarzing the 2002 data, which you will use to preidct job changes in 2012.

summary(zbp02)
##       ZIP         jobs_plus10         jobs.tot           jobs.23      
##  Min.   :15001   Min.   :    2.5   Min.   :    2.50   Min.   :   0.0  
##  1st Qu.:16010   1st Qu.:   71.0   1st Qu.:   72.25   1st Qu.:   2.5  
##  Median :17235   Median :  411.0   Median :  387.50   Median :  22.0  
##  Mean   :17318   Mean   : 2654.3   Mean   : 2597.33   Mean   : 138.6  
##  3rd Qu.:18618   3rd Qu.: 2327.0   3rd Qu.: 2234.25   3rd Qu.: 126.5  
##  Max.   :19980   Max.   :79883.5   Max.   :77848.50   Max.   :4476.5  
##                  NA's   :82        NA's   :40         NA's   :40      
##     jobs.31           jobs.42          jobs.44           jobs.48       
##  Min.   :    0.0   Min.   :   0.0   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:    0.0   1st Qu.:   0.0   1st Qu.:    2.5   1st Qu.:   0.00  
##  Median :   34.5   Median :   7.5   Median :   31.0   Median :   7.00  
##  Mean   :  357.2   Mean   : 128.8   Mean   :  377.0   Mean   :  73.68  
##  3rd Qu.:  341.5   3rd Qu.:  74.5   3rd Qu.:  273.2   3rd Qu.:  47.25  
##  Max.   :10076.0   Max.   :5025.5   Max.   :10491.0   Max.   :5435.00  
##  NA's   :40        NA's   :40       NA's   :40        NA's   :40       
##     jobs.51           jobs.52            jobs.53           jobs.54       
##  Min.   :   0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.0  
##  1st Qu.:   0.00   1st Qu.:    0.00   1st Qu.:   0.00   1st Qu.:    0.0  
##  Median :   0.00   Median :    7.00   Median :   0.00   Median :    5.0  
##  Mean   :  72.96   Mean   :  155.58   Mean   :  39.46   Mean   :  158.4  
##  3rd Qu.:  14.50   3rd Qu.:   50.25   3rd Qu.:  17.50   3rd Qu.:   57.0  
##  Max.   :4203.50   Max.   :15671.50   Max.   :1739.50   Max.   :22421.0  
##  NA's   :40        NA's   :40         NA's   :40        NA's   :40       
##     jobs.56          jobs.61           jobs.62          jobs.71       
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.00  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.:   0.0   1st Qu.:   0.00  
##  Median :   5.0   Median :   0.00   Median :  16.5   Median :   0.00  
##  Mean   : 145.6   Mean   :  66.51   Mean   : 362.8   Mean   :  41.55  
##  3rd Qu.:  56.0   3rd Qu.:   8.50   3rd Qu.: 208.2   3rd Qu.:  17.00  
##  Max.   :9158.5   Max.   :3734.50   Max.   :8389.5   Max.   :2858.00  
##  NA's   :40       NA's   :40        NA's   :40       NA's   :40       
##     jobs.72          jobs.81          jobs.95           jobs.99       
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:   0.0   1st Qu.:   2.5   1st Qu.:   0.00   1st Qu.:  0.000  
##  Median :  19.5   Median :  19.5   Median :   0.00   Median :  0.000  
##  Mean   : 213.5   Mean   : 139.4   Mean   :  26.79   Mean   :  1.355  
##  3rd Qu.: 154.2   3rd Qu.: 114.8   3rd Qu.:   0.00   3rd Qu.:  0.000  
##  Max.   :5445.5   Max.   :4624.5   Max.   :1961.00   Max.   :214.500  
##  NA's   :40       NA's   :40       NA's   :40        NA's   :40       
##     jobs.21            jobs.11          jobs.22           jobs.55      
##  Min.   :   0.000   Min.   :  0.00   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.000   1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:   0.0  
##  Median :   0.000   Median :  0.00   Median :   0.00   Median :   0.0  
##  Mean   :   8.461   Mean   :  2.05   Mean   :  18.45   Mean   :  69.4  
##  3rd Qu.:   0.000   3rd Qu.:  0.00   3rd Qu.:   0.00   3rd Qu.:   2.5  
##  Max.   :1538.500   Max.   :194.00   Max.   :1785.50   Max.   :6602.0  
##  NA's   :40         NA's   :40       NA's   :40        NA's   :40

Like with the census tract data, not all of the zip coned match across the two years, resulting in missing data.

Plot the relationship between the number of jobs in 2002 and 2012.

plot(zbp02$jobs.tot, zbp02$jobs_plus10)

plot of chunk unnamed-chunk-4

And look at the distribution of jobs across Zip Codes. Most Zip Codes have very few jobs, with a long tail of job-rich Zip Codes.

plot(density(zbp02$jobs_plus10, na.rm = T))

plot of chunk unnamed-chunk-5

hist(zbp02$jobs_plus10)

plot of chunk unnamed-chunk-6

Look at a specific Zip Code (19104).

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP == 19104], zbp02$jobs.tot[zbp02$ZIP==19104], col="red")

plot of chunk unnamed-chunk-7

And all Philadelphia tracts, plus a few others. These tracts do not look particularly different from others in PA.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP > 19018 & zbp02$ZIP < 19256], 
       zbp02$jobs.tot[zbp02$ZIP> 19018 & zbp02$ZIP < 19256], col="red")

plot of chunk unnamed-chunk-8

Now, predict the 2012 jobs as a function of the number of jobs in 2002. The regression line fits the data almost perfectly.

reg1 <- lm(jobs_plus10 ~ jobs.tot, zbp02) 
summary(reg1)
## 
## Call:
## lm(formula = jobs_plus10 ~ jobs.tot, data = zbp02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7887.2  -158.2   -82.6    26.7 10332.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 82.743743  26.051260   3.176  0.00151 ** 
## jobs.tot     0.967799   0.004028 240.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1081 on 2083 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9652, Adjusted R-squared:  0.9652 
## F-statistic: 5.774e+04 on 1 and 2083 DF,  p-value: < 2.2e-16

The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)

plot of chunk unnamed-chunk-10

The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)

plot of chunk unnamed-chunk-11

It also looks like there might be some systematic differences in prediction quality by the number of jobs.

plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-12

Try to see which jobs are more or less likely to predict job losses or job increases.

summary(
  lm(jobs_plus10 ~jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
       jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 + 
       jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02)
)
## 
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 + 
##     jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 + 
##     jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + 
##     jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7705.6  -159.0   -51.4    39.6  9224.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.75265   25.06739   1.905 0.056923 .  
## jobs.23      1.48264    0.11445  12.954  < 2e-16 ***
## jobs.31      0.86957    0.03842  22.633  < 2e-16 ***
## jobs.42      1.75718    0.11463  15.329  < 2e-16 ***
## jobs.44      0.90575    0.05998  15.100  < 2e-16 ***
## jobs.48      0.66433    0.10652   6.237 5.41e-10 ***
## jobs.51      0.68900    0.13370   5.153 2.80e-07 ***
## jobs.52      0.59305    0.06731   8.811  < 2e-16 ***
## jobs.53      2.82325    0.31606   8.933  < 2e-16 ***
## jobs.54      1.51710    0.07039  21.552  < 2e-16 ***
## jobs.56      0.17040    0.08420   2.024 0.043112 *  
## jobs.61      1.47967    0.11901  12.433  < 2e-16 ***
## jobs.62      0.85000    0.04409  19.281  < 2e-16 ***
## jobs.71      0.75343    0.20036   3.760 0.000174 ***
## jobs.72      1.52875    0.12252  12.478  < 2e-16 ***
## jobs.81      0.44943    0.18831   2.387 0.017095 *  
## jobs.95      0.74847    0.19179   3.903 9.82e-05 ***
## jobs.99      5.25699    4.88099   1.077 0.281591    
## jobs.21      1.13877    0.36736   3.100 0.001962 ** 
## jobs.11      3.63400    2.12568   1.710 0.087495 .  
## jobs.22      0.88365    0.22376   3.949 8.11e-05 ***
## jobs.55      0.11145    0.11360   0.981 0.326664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9711 
## F-statistic:  3339 on 21 and 2063 DF,  p-value: < 2.2e-16

Most of the paramter estimates are statistically different from zero with a high degree of confidence, but that is really not a very useful finding. You really want to know whether job types are statisticallyt different from 1 (i.e., are certain job types more or less likely to be decreasing over time?). You can roughly approximate this by looking at the coefficent estimate and the standard error. If the coefficient estimate plus or minus two standard errors crosses the number 1, than the estimate is not statistically different from one with 95% confidence.

You can aslo set up the regression to compare the coefficient estimate against the total number of jobs in 2002.

summary(
  lm(jobs_plus10 ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
       jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 + 
       jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02, offset= 1.00*jobs.tot)
)
## 
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 + 
##     jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 + 
##     jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + 
##     jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02, 
##     offset = 1 * jobs.tot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7705.6  -159.0   -51.4    39.6  9224.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.75265   25.06739   1.905 0.056923 .  
## jobs.23      0.48264    0.11445   4.217 2.58e-05 ***
## jobs.31     -0.13043    0.03842  -3.395 0.000699 ***
## jobs.42      0.75718    0.11463   6.605 5.03e-11 ***
## jobs.44     -0.09425    0.05998  -1.571 0.116262    
## jobs.48     -0.33567    0.10652  -3.151 0.001650 ** 
## jobs.51     -0.31100    0.13370  -2.326 0.020107 *  
## jobs.52     -0.40695    0.06731  -6.046 1.76e-09 ***
## jobs.53      1.82325    0.31606   5.769 9.20e-09 ***
## jobs.54      0.51710    0.07039   7.346 2.93e-13 ***
## jobs.56     -0.82960    0.08420  -9.853  < 2e-16 ***
## jobs.61      0.47967    0.11901   4.031 5.77e-05 ***
## jobs.62     -0.15000    0.04409  -3.402 0.000681 ***
## jobs.71     -0.24657    0.20036  -1.231 0.218608    
## jobs.72      0.52875    0.12252   4.316 1.67e-05 ***
## jobs.81     -0.55057    0.18831  -2.924 0.003497 ** 
## jobs.95     -0.25153    0.19179  -1.311 0.189836    
## jobs.99      4.25699    4.88099   0.872 0.383223    
## jobs.21      0.13877    0.36736   0.378 0.705657    
## jobs.11      2.63400    2.12568   1.239 0.215437    
## jobs.22     -0.11635    0.22376  -0.520 0.603127    
## jobs.55     -0.88855    0.11360  -7.822 8.22e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9711 
## F-statistic:  3339 on 21 and 2063 DF,  p-value: < 2.2e-16

Each job in Real Estate in a Zip Code in 2002 (sector 53) correlates with another 1.62 jobs in 2012. Better, yet use the types of jobs in 2002 to predict the net change in jobs from 2002 to 2012.

reg2 <- lm(jobs_plus10-jobs.tot ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
       jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 + 
       jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02)
summary(reg2)
## 
## Call:
## lm(formula = jobs_plus10 - jobs.tot ~ jobs.23 + jobs.31 + jobs.42 + 
##     jobs.44 + jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + 
##     jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + 
##     jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, 
##     data = zbp02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7705.6  -159.0   -51.4    39.6  9224.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.75265   25.06739   1.905 0.056923 .  
## jobs.23      0.48264    0.11445   4.217 2.58e-05 ***
## jobs.31     -0.13043    0.03842  -3.395 0.000699 ***
## jobs.42      0.75718    0.11463   6.605 5.03e-11 ***
## jobs.44     -0.09425    0.05998  -1.571 0.116262    
## jobs.48     -0.33567    0.10652  -3.151 0.001650 ** 
## jobs.51     -0.31100    0.13370  -2.326 0.020107 *  
## jobs.52     -0.40695    0.06731  -6.046 1.76e-09 ***
## jobs.53      1.82325    0.31606   5.769 9.20e-09 ***
## jobs.54      0.51710    0.07039   7.346 2.93e-13 ***
## jobs.56     -0.82960    0.08420  -9.853  < 2e-16 ***
## jobs.61      0.47967    0.11901   4.031 5.77e-05 ***
## jobs.62     -0.15000    0.04409  -3.402 0.000681 ***
## jobs.71     -0.24657    0.20036  -1.231 0.218608    
## jobs.72      0.52875    0.12252   4.316 1.67e-05 ***
## jobs.81     -0.55057    0.18831  -2.924 0.003497 ** 
## jobs.95     -0.25153    0.19179  -1.311 0.189836    
## jobs.99      4.25699    4.88099   0.872 0.383223    
## jobs.21      0.13877    0.36736   0.378 0.705657    
## jobs.11      2.63400    2.12568   1.239 0.215437    
## jobs.22     -0.11635    0.22376  -0.520 0.603127    
## jobs.55     -0.88855    0.11360  -7.822 8.22e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.2035, Adjusted R-squared:  0.1954 
## F-statistic:  25.1 on 21 and 2063 DF,  p-value: < 2.2e-16

Note how all the parameter estimates are the same, but there is now a much lower and much more useful R-squared. Instead of showing that Zip Codes with a lot of jobs in 2002 tend to have a lot of jobs in 2012, the model now shows that the number and types of jobs in each Zip Code can be used to predict whether a Zip Code gained or lost jobs. Instead of explaining 97% of the variation in the relationship, the model now claims to explain 17%.

The distribution of this dependent variable also looks a lot more normal with most Zip Codes not having changed very much.

plot(density(zbp02$jobs_plus10- zbp02$jobs.tot, na.rm=T))

plot of chunk unnamed-chunk-16

The residual plit also looks a bit more homoschedastic.

plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-17

EXERCISE

  1. Try to make the most parsimonious model that does a good job of predicting the change in jobs from 2002 to 2012. Use a A full model vs. reduced model Anova test to compare this model to the fully specific model (with all job types) and descibe the results.

  2. Try predicting the percent change in jobs instead of the absolute change. Compare and contrast the two models. Which model do you prefer and why? Hint: use the identity function to generate the percent change variable inside of your regression: I(jobs_plus10/jobs.tot-1). Input industry codes as a percentage of total jobs: I(jobs.23/jobs.tot).

  3. What types of jobs are the best indicators of job growth and job losses? Desribe the quantitative relationship as expressed by the two models in question 2. Does this contradict or concur with your expectations?

  4. Look at the differences in jobs in sector 99 across the two time periods. What do you think is happening?

Posted in Planning Methods, Uncategorized | Leave a comment

Planning Methods: Predicting Census Tract change





In this lab, you will use Philadelphia's 2000 Census Tract data to make predictions about the 2010 and 2020 population, and to understand what types of Census Tracts have been growing or shrinking. Download the data here (password is PennDesign). You can load it by double clicking on the file or using the load command.

Start by summarizing the data:

summary(dat2000)
##     Geo_FIPS          pop_change         pop_plus10        pop       
##  Min.   :4.21e+10   Min.   :-1310.00   Min.   :   0   Min.   :    0  
##  1st Qu.:4.21e+10   1st Qu.: -394.50   1st Qu.:2721   1st Qu.: 2329  
##  Median :4.21e+10   Median :  -85.00   Median :3774   Median : 4008  
##  Mean   :4.21e+10   Mean   :  -43.79   Mean   :3919   Mean   : 3983  
##  3rd Qu.:4.21e+10   3rd Qu.:  309.50   3rd Qu.:4987   3rd Qu.: 5630  
##  Max.   :4.21e+10   Max.   : 2311.00   Max.   :9022   Max.   :10479  
##                     NA's   :215        NA's   :106    NA's   :109    
##   pop_density     square_miles     pop_white_nonhispanic   pop_black   
##  Min.   :    0   Min.   :0.06491   Min.   :   0          Min.   :   0  
##  1st Qu.: 8462   1st Qu.:0.18056   1st Qu.: 114          1st Qu.: 166  
##  Median :15692   Median :0.25657   Median : 821          Median : 903  
##  Mean   :16120   Mean   :0.35457   Mean   :1695          Mean   :1721  
##  3rd Qu.:22543   3rd Qu.:0.37694   3rd Qu.:2901          3rd Qu.:2959  
##  Max.   :54824   Max.   :3.26893   Max.   :8948          Max.   :8468  
##  NA's   :109     NA's   :109       NA's   :109           NA's   :109   
##    pop_asian       pop_hispanic      households   households_family
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0   Min.   :   0.0   
##  1st Qu.:  13.0   1st Qu.:  33.0   1st Qu.: 852   1st Qu.: 428.0   
##  Median :  46.0   Median :  89.0   Median :1502   Median : 899.0   
##  Mean   : 177.6   Mean   : 336.7   Mean   :1549   Mean   : 924.8   
##  3rd Qu.: 168.0   3rd Qu.: 230.0   3rd Qu.:2187   3rd Qu.:1323.0   
##  Max.   :2625.0   Max.   :6802.0   Max.   :6115   Max.   :2738.0   
##  NA's   :109      NA's   :109      NA's   :109    NA's   :109      
##  households_nonfamily  avg_hh_size     pop_25_plus   edu_less_highschool
##  Min.   :   0         Min.   :0.000   Min.   :   0   Min.   :   0.0     
##  1st Qu.: 303         1st Qu.:2.220   1st Qu.:1366   1st Qu.: 286.0     
##  Median : 529         Median :2.520   Median :2422   Median : 661.0     
##  Mean   : 624         Mean   :2.424   Mean   :2536   Mean   : 729.9     
##  3rd Qu.: 796         3rd Qu.:2.760   3rd Qu.:3624   3rd Qu.:1023.0     
##  Max.   :4806         Max.   :3.600   Max.   :7155   Max.   :2733.0     
##  NA's   :109          NA's   :109     NA's   :109    NA's   :109        
##  edu_highschool   edu_collegeplus  pop_civ_employed pop_civ_unemployed
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0     Min.   :   0.0    
##  1st Qu.: 362.0   1st Qu.: 119.0   1st Qu.: 780     1st Qu.:  75.0    
##  Median : 789.0   Median : 286.0   Median :1433     Median : 158.0    
##  Mean   : 845.3   Mean   : 453.1   Mean   :1535     Mean   : 187.9    
##  3rd Qu.:1260.0   3rd Qu.: 557.0   3rd Qu.:2225     3rd Qu.: 266.0    
##  Max.   :2809.0   Max.   :5735.0   Max.   :5692     Max.   :1678.0    
##  NA's   :109      NA's   :109      NA's   :109      NA's   :109       
##  median_hh_income per_capita_income housing_vacant  housing_occupied
##  Min.   :     0   Min.   :     0    Min.   :  0.0   Min.   :   0    
##  1st Qu.: 21500   1st Qu.: 10817    1st Qu.: 55.0   1st Qu.: 852    
##  Median : 29839   Median : 15615    Median :149.0   Median :1502    
##  Mean   : 32002   Mean   : 17604    Mean   :188.7   Mean   :1549    
##  3rd Qu.: 39528   3rd Qu.: 20613    3rd Qu.:279.0   3rd Qu.:2187    
##  Max.   :200001   Max.   :109633    Max.   :834.0   Max.   :6115    
##  NA's   :109      NA's   :109       NA's   :109     NA's   :109     
##  housing_median_age percent_poverty  
##  Min.   :   0       Min.   :0.00000  
##  1st Qu.:1939       1st Qu.:0.09552  
##  Median :1942       Median :0.18480  
##  Mean   :1885       Mean   :0.21014  
##  3rd Qu.:1952       3rd Qu.:0.30020  
##  Max.   :1983       Max.   :0.84337  
##  NA's   :109        NA's   :120

Notice how there are a number of missing data entries. This is because Census Tract numbers and borders change over time; sometimes combining, sometimes splitting. We have data for 106 2000 Census Tracts that are missing in 2010, and 109 2010 Census Tracts that are missing in 2000. There are several ways to combine Census Tracts to make them consistent over time. This website provides tools to estimate Census data from 1970 to 2000 within the 2010 boundaries.

For the purposes of this exercise, we will exclude the non-matching tracts. Look at the relationship between tracts population in 2000 and in 2010.

plot(dat2000$pop, dat2000$pop_plus10)

plot of chunk unnamed-chunk-3

The density and histogram functions are convenient ways to understand the distribution of the data (in this case the 2010 population in each Census Tract). The population is somewhat normally distributed and ranges from around zero to 10,000.

plot(density(dat2000$pop_plus10, na.rm = T))

plot of chunk unnamed-chunk-4

hist(dat2000$pop_plus10)

plot of chunk unnamed-chunk-5

Before proceeding, compare the Census Tracts that are consistent between 2000 and 2010 with those that are not. As you would expect, low population and high population Census Tracts were most likely to change boundaries.

hist(dat2000$pop[is.na(dat2000$pop_plus10)==T])

plot of chunk unnamed-chunk-6

hist(dat2000$pop[is.na(dat2000$pop_plus10)==F])

plot of chunk unnamed-chunk-6

How well can you predict the 2010 Census Tract population from the 2000 Census Tract population?

reg1 <- lm(pop_plus10~pop, dat2000)
summary(reg1)
## 
## Call:
## lm(formula = pop_plus10 ~ pop, data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1271.96  -357.73   -45.54   355.51  2334.68 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -92.07364   88.72101  -1.038      0.3    
## pop           1.01166    0.01986  50.938   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 552 on 273 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9048, Adjusted R-squared:  0.9045 
## F-statistic:  2595 on 1 and 273 DF,  p-value: < 2.2e-16

The R-squared statistic indicates that the 2000 population explains about 90% of the variation in the 2000 population. The following shows how well the regression line fits the data.

plot(dat2000$pop, dat2000$pop_plus10)
abline(reg1)

plot of chunk unnamed-chunk-8

Furthermore, each person in 2000 correlates with 1.01 person in 2010, though with a standard error of 0.02, this is hardly a ringing endorsement of population growth. Plotting the population change from 2000 to 2010 shows that almost the same number of Census Tracts gained and lost population.

plot(density(dat2000$pop_change, na.rm = T))

plot of chunk unnamed-chunk-9

Plotting the error term shows that the regression line fits the low population tracts a bit better than the high population tracts. (Note how closely this tracks with the population change number, since the prediction of the 2010 population relies entirely on the 2000 population.)

plot(density(resid(reg1)))

plot of chunk unnamed-chunk-10

Plotting the predicted population against the error terms reveals some additional patterns to help understand the regression. Ideally this pattern would look much more random with points equally and randomly distributed around the zero line.

plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-11

One way to try to deal with under predicting the more populous Census Tracts is to introduce a quadratic term.

reg2 <- lm(pop_plus10~pop + I(pop^2), dat2000)
summary(reg2)
## 
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2), data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1228.67  -338.37   -48.65   365.88  2334.17 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.354e+02  1.750e+02   2.488 0.013460 *  
## pop         7.114e-01  8.863e-02   8.027 3.02e-14 ***
## I(pop^2)    3.587e-05  1.033e-05   3.473 0.000599 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 541.2 on 272 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.9082 
## F-statistic:  1356 on 2 and 272 DF,  p-value: < 2.2e-16

Does this fit the data better? A full model vs. reduced model Anova test suggests that it does. So does the small improvement to the adjusted R-Squared.

anova(reg1,reg2)
## Analysis of Variance Table
## 
## Model 1: pop_plus10 ~ pop
## Model 2: pop_plus10 ~ pop + I(pop^2)
##   Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
## 1    273 83185653                                 
## 2    272 79653918  1   3531735 12.06 0.0005991 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual plot looks better…slightly…perhaps…

plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)

plot of chunk unnamed-chunk-14

Try testing some more interesting hypotheses. For example, do you expect higher income neighborhoods to grow more or less quickly than lower income neighborhoods? You could probably construct a reasonable theory in either direction, but my guess is that wealthier Census Tracts in Philadelphia were the most likely to grow.

reg3 <- lm(pop_plus10 ~ pop + I(pop^2)+ I(dat2000$median_hh_inc/1000), dat2000)
summary(reg3)
## 
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(dat2000$median_hh_inc/1000), 
##     data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1167.48  -361.12   -48.03   332.20  2313.85 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.836e+02  1.959e+02   1.448 0.148838    
## pop                           7.188e-01  8.843e-02   8.128 1.56e-14 ***
## I(pop^2)                      3.512e-05  1.030e-05   3.408 0.000753 ***
## I(dat2000$median_hh_inc/1000) 4.375e+00  2.570e+00   1.702 0.089840 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 539.3 on 271 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9098, Adjusted R-squared:  0.9088 
## F-statistic: 911.2 on 3 and 271 DF,  p-value: < 2.2e-16

Every $1,000 more in median income correlates with 4 more people in 2010, not a tremendously strong relationship. Furthermore, the coefficient is not statistically difference from 0 with 95% confidence (though it is with 90% confidence).

It also may be correlated with other variables that are associated with increasing or decreasing population, such as race.

reg4 <-  lm(pop_plus10~ pop + I(pop^2)+ I(median_hh_income/1000) + I(100*pop_black/pop), dat2000)
summary(reg4)
## 
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(median_hh_income/1000) + 
##     I(100 * pop_black/pop), data = dat2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1306.97  -327.63    -4.05   289.08  2213.82 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.204e+02  2.224e+02   2.789 0.005667 ** 
## pop                       7.208e-01  9.168e-02   7.862 9.19e-14 ***
## I(pop^2)                  3.365e-05  1.062e-05   3.170 0.001703 ** 
## I(median_hh_income/1000) -4.582e-01  2.871e+00  -0.160 0.873310    
## I(100 * pop_black/pop)   -3.354e+00  9.524e-01  -3.521 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 528.9 on 269 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.9122, Adjusted R-squared:  0.9109 
## F-statistic: 698.6 on 4 and 269 DF,  p-value: < 2.2e-16

Each percentage point higher concentration of black residents in Census Tracts in 2000 correlates with 3 fewer people in 2010. Using this regression, let's use the predict function to estimate Philadelphia's 2010 population:

sum(predict(reg4))
## [1] 1126700

This is quite a bit lower than the actual 2010 population count:

sum(dat2000$pop_plus10, na.rm=T)
## [1] 1504950

The regression, however, excludes the hundred or so Census Tracts that changed between 2000 and 2010. Adding these back makes the prediction much closer to the actual population.

sum(predict(reg4, newdata = dat2000), na.rm =T)
## [1] 1543117

Finally, use a similar command to estimate the 2020 population by applying our regression equation to the 2010 Census data.

sum(predict(reg4, newdata = dat2010), na.rm =T)
## [1] 1488331

EXERCISE

  1. Despite recent population growth, the model predicts that population will decline between 2010 and 2020. Do you agree with this prediction? What do you think is causing it?

  2. Replace median income with a measure of the proportion of the population over 25 with a college degree of higher (edu_collegeplus), Do you expect neighborhoods with a higher proportion of well-educated residents to have grown or lost population between 2000 and 2010? Describe the results of the model in relationship to your expectations.

  3. Does this change improve the model? Be sure to consider statistical fit, residual plots, and theory in providing your answer.

  4. Develop a different set of models predicting population change (pop_change) instead of the population ten years later (pop_plus10). What factors do you expect to be good predictors of total population change? Describe the model, model fits, and residual plots.

Posted in Planning Methods | Comments Off on Planning Methods: Predicting Census Tract change

Planning Methods: working with Zip Business Patterns data

In this lab, you will download data on job by industry from the Zip Business Patterns data. You can find the data on the US Census website. Be sure to download the data as a CSV file. For now, download the full 2002 and 2012 data.  The links are:2002 & 2012

You want the Complete ZIP Code Industry Detail File.

Be sure to read the descriptions of the data and data collection methods on the website. The FAQ may also prove useful.

The objectives of the lab are to learn to work with ZBP data using the midpoint method and to learn a few data processing commands in R. Specifically, we will take subsets of data and do a simple data merge. Although this may seem a little complicated to do some simple file management and calculations, the pay offs are also big. By the end of the lab, you should be able to create mid-point methods of jobs for all US Census Tracts by any available NAICS code or subsets of NAICS codes with just a few changes to your code.

To load the data:
1) Set your working directory to wherever you saved your files. Note that the data are stored a bit differently across the two

2) Run the following two lines of code:

zbp02 <- read.csv(“zbp02detail.txt”)
zbp12 <- read.csv(“zbp12detail.txt”)

This will take a moment to load, if you’re working on a home computer. To make sure that your data loaded properly check and have the same dimensions as below,

dim(zbp02)
## [1] 3251827      12
dim(zbp12)
## [1] 3179204      12

look the same,

head(zbp02)
##   ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 N500_999
## 1 501 ------  11    9    1      0      1      0        0        0        0
## 2 501 23----   1    1    0      0      0      0        0        0        0
## 3 501 235310   1    1    0      0      0      0        0        0        0
## 4 501 42----   3    3    0      0      0      0        0        0        0
## 5 501 421990   1    1    0      0      0      0        0        0        0
## 6 501 422920   1    1    0      0      0      0        0        0        0
##   N1000
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
head(zbp12)
##    zip  naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1  501 ------   2    2    0      0      0      0        0        0
## 2  501 81----   2    2    0      0      0      0        0        0
## 3  501 813110   2    2    0      0      0      0        0        0
## 4 1001 ------ 453  206   77     75     55     22       14        3
## 5 1001 22----   2    1    0      0      1      0        0        0
## 6 1001 221310   1    1    0      0      0      0        0        0
##   n500_999 n1000
## 1        0     0
## 2        0     0
## 3        0     0
## 4        1     0
## 5        0     0
## 6        0     0

and have the same column types. Most importantly, only NAICS should be a character or factor.

str(zbp02)
## 'data.frame':    3251827 obs. of  12 variables:
##  $ ZIP     : int  501 501 501 501 501 501 501 501 501 501 ...
##  $ NAICS   : Factor w/ 1104 levels "------","11----",..: 1 59 73 562 599 627 631 632 655 859 ...
##  $ EST     : int  11 1 1 3 1 1 1 1 1 2 ...
##  $ N1_4    : int  9 1 1 3 1 1 1 1 1 2 ...
##  $ N5_9    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ N10_19  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N20_49  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ N50_99  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N100_249: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N250_499: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N500_999: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
str(zbp12)
## 'data.frame':    3179204 obs. of  12 variables:
##  $ zip     : int  501 501 501 1001 1001 1001 1001 1001 1001 1001 ...
##  $ naics   : Factor w/ 999 levels "------","11----",..: 1 950 986 1 48 60 61 63 65 67 ...
##  $ est     : int  2 2 2 453 2 1 1 47 1 2 ...
##  $ n1_4    : int  2 2 2 206 1 1 0 30 1 1 ...
##  $ n5_9    : int  0 0 0 77 0 0 0 7 0 0 ...
##  $ n10_19  : int  0 0 0 75 0 0 0 3 0 0 ...
##  $ n20_49  : int  0 0 0 55 1 0 1 7 0 1 ...
##  $ n50_99  : int  0 0 0 22 0 0 0 0 0 0 ...
##  $ n100_249: int  0 0 0 14 0 0 0 0 0 0 ...
##  $ n250_499: int  0 0 0 3 0 0 0 0 0 0 ...
##  $ n500_999: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ n1000   : int  0 0 0 0 0 0 0 0 0 0 ...

If for some reason, one of the other variables did not load as you want, you can change it like so:

zbp02$EST <- as.numeric(zbp02$EST) ##as.integer() is also fine.

Note that the 2002 data include the same zip codes multiple times. These are the counts by different sector. (The 2012 data only includes totals.) To better understand this structure, look at the first Zip Code from the 2002 data:

subset(zbp02, ZIP == 501)
##    ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1  501 ------  11    9    1      0      1      0        0        0
## 2  501 23----   1    1    0      0      0      0        0        0
## 3  501 235310   1    1    0      0      0      0        0        0
## 4  501 42----   3    3    0      0      0      0        0        0
## 5  501 421990   1    1    0      0      0      0        0        0
## 6  501 422920   1    1    0      0      0      0        0        0
## 7  501 422990   1    1    0      0      0      0        0        0
## 8  501 44----   1    1    0      0      0      0        0        0
## 9  501 445110   1    1    0      0      0      0        0        0
## 10 501 54----   2    2    0      0      0      0        0        0
## 11 501 541110   1    1    0      0      0      0        0        0
## 12 501 541211   1    1    0      0      0      0        0        0
## 13 501 62----   1    1    0      0      0      0        0        0
## 14 501 621111   1    1    0      0      0      0        0        0
## 15 501 81----   2    0    1      0      1      0        0        0
## 16 501 813110   2    0    1      0      1      0        0        0
## 17 501 99----   1    1    0      0      0      0        0        0
##    N500_999 N1000
## 1         0     0
## 2         0     0
## 3         0     0
## 4         0     0
## 5         0     0
## 6         0     0
## 7         0     0
## 8         0     0
## 9         0     0
## 10        0     0
## 11        0     0
## 12        0     0
## 13        0     0
## 14        0     0
## 15        0     0
## 16        0     0
## 17        0     0

The NAICS includes the total number of establishments by all sectors [NAICS==“——”], by two-digit industry code, and by six-digit code. EST is the total number of establishments, and the columns to the right are the number of establishments by number of employees (1 to 4, 5 to 9, etc.) To check that the numbers add up try adding the establishments by sector.

zbp02$EST[zbp02$NAICS == "------" & zbp02$ZIP == 501] 
## [1] 11

And check it against the sum of the other columns.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501] )
## [1] 21

Whoops! What happened? The data includes six digit and two digit counts, which led to some double counting. Try including only NAICS that end in “—-” with the substring command.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501 & substring(zbp02$NAICS,3,6)== "----"] )
## [1] 11

To clean up the data a bit, create a a new file that includes only the 2-digit codes and totals. If you want to save a more complete file for future use, you can ignore this step. Note that no matter what you do, the original CSV files remain unchanged unless you specifically overwrite them. Thus you can run and rerun your code without ever changing the underlying data.

zbp02 <- subset(zbp02, substring(zbp02$NAICS,3,6)=="----")
zbp12 <- subset(zbp12, substring(zbp12$naics,3,6)=="----")

Again check your dimensions against the below to make sure that the commands worked.

dim(zbp02)
## [1] 478497     12
dim(zbp12)
## [1] 454477     12

And look at the data.

head(zbp02)
##    ZIP  NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1  501 ------  11    9    1      0      1      0        0        0
## 2  501 23----   1    1    0      0      0      0        0        0
## 4  501 42----   3    3    0      0      0      0        0        0
## 8  501 44----   1    1    0      0      0      0        0        0
## 10 501 54----   2    2    0      0      0      0        0        0
## 13 501 62----   1    1    0      0      0      0        0        0
##    N500_999 N1000
## 1         0     0
## 2         0     0
## 4         0     0
## 8         0     0
## 10        0     0
## 13        0     0
head(zbp12)
##     zip  naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1   501 ------   2    2    0      0      0      0        0        0
## 2   501 81----   2    2    0      0      0      0        0        0
## 4  1001 ------ 453  206   77     75     55     22       14        3
## 5  1001 22----   2    1    0      0      1      0        0        0
## 8  1001 23----  47   30    7      3      7      0        0        0
## 27 1001 31----  48   10    8      9      9      6        3        3
##    n500_999 n1000
## 1         0     0
## 2         0     0
## 4         1     0
## 5         0     0
## 8         0     0
## 27        0     0

Now, use the midpoint method to estimate the number of jobs by establishment size. Note that the average size of 1-to-4 employee establishment is 2.5. For establishments with 1000 or more employees, the midpoint method sets the number of employees to 1000. (Do the 2002 data in steps and then 2012 all at once.) The within() command is one of several nifty ways to avoid writing zbp02$ before each variable.

zbp02 <- within(zbp02, jobs <- N1_4 *2.5 + N5_9 *7 + N10_19 * 14.5 + 
    N20_49 * 34.5 +  N50_99 *74.5 + 
    N100_249 * 174.5 + N250_499 *374.5 + 
    N500_999 * 749.5 +  N1000 *1000)

For this lab, drop all none PA Zip Codes to conserve space. You may want to go back and keep the other states as well at some point in time.PA’s Zip Codes run between 15000 and 19999. Note that you

zbp02<- subset(zbp02, as.numeric(zbp02$ZIP) 
                    >= 15000 & as.numeric(zbp02$ZIP) <= 19999) 

And get rid of the variables that are not needed.

keepers <- c("ZIP", "NAICS", "jobs")
zbp02 <- zbp02[keepers]

And finally clean up the labeling of the NAICS codes.

zbp02$NAICS <- substring(zbp02$NAICS,1,2)
zbp02$NAICS[zbp02$NAICS=="--"] <- "tot"

Have a look

head(zbp02)
##          ZIP NAICS   jobs
## 452595 14001   tot 3180.5
## 452596 14001    23  224.0
## 452610 14001    31 1383.0
## 452629 14001    42  820.5
## 452638 14001    44  189.0
## 452654 14001    48   20.0

This type of data is often referred to as long data. There are multiple entries for each Zip Code and the NAICS column indicates the industry. For future work, it is going to be a little easier to have the data in a wide format, which you can do with ease using the reshape() command.

zbp02 <- reshape(zbp02, 
             timevar = "NAICS",
             idvar = c("ZIP"),
             direction = "wide")

Have a look

head(zbp02)
##          ZIP jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48 jobs.51
## 452595 14001   3180.5   224.0  1383.0   820.5   189.0    20.0       5
## 452714 14003     12.0      NA      NA      NA     7.0      NA      NA
## 452721 14004   2270.5   292.0   467.0    39.0   527.5    60.5      14
## 452854 14005    167.0     7.5    34.5     9.5    31.0    37.0      NA
## 452887 14006   1898.5    92.0   489.5    71.5   295.0    96.5      19
## 452991 14008     51.5    10.0      NA      NA    12.5      NA      NA
##        jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 452595   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 452714      NA      NA      NA      NA      NA      NA      NA     2.5
## 452721    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 452854     2.5      NA      NA     2.5      NA     2.5      NA      NA
## 452887    85.0       7    31.5    22.0      NA    97.0    39.5   440.0
## 452991      NA      NA      NA     2.5      NA    14.5      NA     7.0
##        jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 452595    56.0      49     7.5      NA      NA      NA      NA
## 452714     2.5      NA      NA      NA      NA      NA      NA
## 452721   169.5      NA      NA     9.5      NA      NA      NA
## 452854    19.0      NA      NA    14.0       7      NA      NA
## 452887   106.0      NA     7.0      NA      NA      NA      NA
## 452991     5.0      NA      NA      NA      NA      NA      NA

The default setting sets missing entries to NA, but you should convert these to zero, since there are no reported establishments rather than unknown reported establishments.

Note that zbp02[is.na(zbp02)==T] <- 0 is a logic statement that replaces all variables in the dataset with zeros if they currently equal NA. If you only wanted to do this for job totals, you could write: zbp02$jobs.tot[is.na(zbp02$jobs.tot)==T] <- 0

zbp02[is.na(zbp02)==T] <- 0

The following goes through the same steps for the 2012 data. Try to go through it slowly and understand each step of the process. Use ?command to get a description of functions if needed. Use internet searches as well.

zbp12<- subset(zbp12, as.numeric(zbp12$zip) 
               >= 14000 & as.numeric(zbp12$zip) <= 19999) 

#calculate job totals
zbp12 <- within(zbp12, jobs <- n1_4 *2.5 + n5_9 *7 + n10_19 * 14.5 + 
                  n20_49 * 34.5 +  n50_99 *74.5 + 
                  n100_249 * 174.5 + n250_499 *374.5 + 
                  n500_999 * 749.5 +  n1000 *1000)

zbp12$ZIP <- zbp12$zip ##to make consistent with the 02 data
zbp12$NAICS <- zbp12$naics
##long to wide

zbp12 <- zbp12[keepers] #Keepers should still exist
zbp12$NAICS <- substring(zbp12$NAICS,1,2)
zbp12$NAICS[zbp12$NAICS=="--"] <- "tot"

#long to wide
zbp12 <- reshape(zbp12, 
                 timevar = "NAICS",
                 idvar = c("ZIP"),
                 direction = "wide")

#switch NA to zero
zbp12[is.na(zbp12)==T] <- 0

Finally, add the total number of jobs from 2010 to the 2000 dataset to use for making predictions in a later lab. Create a new dataset so that zbp12 does not change.

zipmerge <- zbp12[c("ZIP", "jobs.tot")]
zipmerge$jobs_plus10 <- zipmerge$jobs.tot
zipmerge <- zipmerge[c("ZIP", "jobs_plus10")]
zbp02 <- merge(zipmerge, zbp02, by ="ZIP", all.x = T, all.y =T)
rm(zipmerge, keepers)

Look at the file. This time keep the NAs since they do in fact represent missing data. For example there was no Zip code 14003 in 2012.

head(zbp02)
##     ZIP jobs_plus10 jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48
## 1 14001      3015.0   3180.5   224.0  1383.0   820.5   189.0    20.0
## 2 14003          NA     12.0     0.0     0.0     0.0     7.0     0.0
## 3 14004      2368.0   2270.5   292.0   467.0    39.0   527.5    60.5
## 4 14005       182.5    167.0     7.5    34.5     9.5    31.0    37.0
## 5 14006      1370.0   1898.5    92.0   489.5    71.5   295.0    96.5
## 6 14008        73.0     51.5    10.0     0.0     0.0    12.5     0.0
##   jobs.51 jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 1       5   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 2       0     0.0       0     0.0     0.0     0.0     0.0     0.0     2.5
## 3      14    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 4       0     2.5       0     0.0     2.5     0.0     2.5     0.0     0.0
## 5      19    85.0       7    31.5    22.0     0.0    97.0    39.5   440.0
## 6       0     0.0       0     0.0     2.5     0.0    14.5     0.0     7.0
##   jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 1    56.0      49     7.5     0.0       0       0       0
## 2     2.5       0     0.0     0.0       0       0       0
## 3   169.5       0     0.0     9.5       0       0       0
## 4    19.0       0     0.0    14.0       7       0       0
## 5   106.0       0     7.0     0.0       0       0       0
## 6     5.0       0     0.0     0.0       0       0       0

If desired, save this as an R data file. In the future, you can double click on this file to open the PA 2002 and 2012 job estimates by 2-digit NAICS for PA.

save(zbp02, zbp12,file = "zip_jobs.Rda")

As you process your own data in the future, you will lean a number of different commands and packages to help you along the way. There are always dozens of different ways to do the same thing. If you are having trouble to get the program to do what you want it to do, it often makes sense to think about the problem differently. For example, maybe you want a nice summary table of jobs by industry and by state, but you are having trouble getting a single command to do it. Try instead to do each operation separately, and then combine them.

EXERCISE

  1. Look up the 2-digit NAICS code for Manufacturing. What Zip Codes had the most gains or losses (both total and proportional) by State. Now, make a summary table by State that includes total estimated manufacturing jobs in 2002 and 2012 and the total and proportional change.
  2. Make the same table as above but only include Zip Codes that have 100 or more jobs. Describe differences between the tables.
  3. Make a table of the number of establishments that have 1000+ employees and the total number of establishments by 2-digit NAICS in 2012. (Hint. Either create two new datasets–one for 1000+ establishments by 2-digit NAICS, one for all establishments by 2-digit NAICS–and then merge them. Or use the sum() command on subsets of the data.) What industries have the highest proportion of 1000+ establishments.
  4. Make the same table as in question 3, but list the estimated number of jobs instead of establishments. Describe similarities and differences between the two tables.

 

 

Posted in Planning Methods | Comments Off on Planning Methods: working with Zip Business Patterns data

Planning Methods: Loading and examining data

This page describes how to upload and look at data in the R package. First, we will downlaod a CSV of UN population estimates on central African cities to import into R: Central Africa West.

The first thing to do is to let R know where to find the CSV. You can do this by navigating through the bottom right window in R Studio (Files/More/Set As Working Directory) or setting your working directory using the setwd command.

Then read the CSV file: read.csv(“Central Africa West.csv”)
Note how the data are read directly printed on the command console. Next we will create a new object called pop so that we can call on the data more easily and interact with it.

Try: pop <- read.csv(“Central Africa West.csv”)

NB you can use = instead of <- but it is better coding practice to save the = sign for other uses.

You can also load data in Rstudio using a point-and-click interface. To set your working directory, click on Session at the top of the window and then click Set Working Directory/Choose Directory. To import a data file, click on File/Import Dataset.

To look at the data, try cutting and pasting the following commands into the command console and hitting enter.

str(pop)
## 'data.frame':    17 obs. of  12 variables:
##  $ Year                   : int  1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 ...
##  $ Period                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Douala                 : int  95 117 153 205 298 433 571 740 940 1184 ...
##  $ Yaounde                : int  32 49 75 112 183 292 415 578 777 1025 ...
##  $ Libreville             : int  15 17 29 49 77 155 234 293 366 439 ...
##  $ Brazzaville            : int  83 92 124 172 238 329 446 596 704 830 ...
##  $ Pointe_Noire           : int  16 35 64 89 116 154 217 300 363 439 ...
##  $ Sub_Saharan_300kplus   : int  3923 4909 7083 10779 16335 24143 34813 47767 62327 75996 ...
##  $ Central_Africa_300kplus: int  3660 4521 5651 7047 8921 11495 14465 18043 22566 28525 ...
##  $ Cameroon_urban         : int  417 557 747 1011 1375 2112 2851 3761 4787 5930 ...
##  $ Congo_urban            : int  201 253 320 408 522 672 860 1086 1295 1535 ...
##  $ Gabon_urban            : int  54 68 87 126 189 279 397 515 655 814 ...

This tells you the structure of the data. In this case, all of the variables are integers, but it is also common to see characters, factors, and other types of data.

names(pop)
##  [1] "Year"                    "Period"                 
##  [3] "Douala"                  "Yaounde"                
##  [5] "Libreville"              "Brazzaville"            
##  [7] "Pointe_Noire"            "Sub_Saharan_300kplus"   
##  [9] "Central_Africa_300kplus" "Cameroon_urban"         
## [11] "Congo_urban"             "Gabon_urban"

This command gives the names of all the variables in the data. The country, city, and regions contain population estimates in thousands.

summary(pop)
##       Year          Period       Douala          Yaounde      
##  Min.   :1950   Min.   : 1   Min.   :  95.0   Min.   :  32.0  
##  1st Qu.:1970   1st Qu.: 5   1st Qu.: 205.0   1st Qu.: 112.0  
##  Median :1990   Median : 9   Median : 571.0   Median : 415.0  
##  Mean   :1990   Mean   : 9   Mean   : 804.8   Mean   : 693.8  
##  3rd Qu.:2010   3rd Qu.:13   3rd Qu.:1184.0   3rd Qu.:1025.0  
##  Max.   :2030   Max.   :17   Max.   :2361.0   Max.   :2349.0  
##                              NA's   :4        NA's   :4       
##    Libreville     Brazzaville      Pointe_Noire   Sub_Saharan_300kplus
##  Min.   : 15.0   Min.   :  83.0   Min.   : 16.0   Min.   :  3923      
##  1st Qu.: 49.0   1st Qu.: 172.0   1st Qu.: 89.0   1st Qu.: 16335      
##  Median :234.0   Median : 446.0   Median :217.0   Median : 62327      
##  Mean   :258.5   Mean   : 575.3   Mean   :293.1   Mean   : 91980      
##  3rd Qu.:439.0   3rd Qu.: 830.0   3rd Qu.:439.0   3rd Qu.:137283      
##  Max.   :631.0   Max.   :1574.0   Max.   :815.0   Max.   :300153      
##  NA's   :4       NA's   :4        NA's   :4                           
##  Central_Africa_300kplus Cameroon_urban   Congo_urban    Gabon_urban    
##  Min.   :  3660          Min.   :  417   Min.   : 201   Min.   :  54.0  
##  1st Qu.:  8921          1st Qu.: 1375   1st Qu.: 522   1st Qu.: 189.0  
##  Median : 22566          Median : 4787   Median :1295   Median : 655.0  
##  Mean   : 34795          Mean   : 6835   Mean   :1723   Mean   : 819.6  
##  3rd Qu.: 51883          3rd Qu.:10625   3rd Qu.:2600   3rd Qu.:1334.0  
##  Max.   :107747          Max.   :20492   Max.   :4804   Max.   :2122.0  
## 

Another good way to get a sense for the data is to look at the first or last entries, using the head or tail commands.

head(pop)
##   Year Period Douala Yaounde Libreville Brazzaville Pointe_Noire
## 1 1950      1     95      32         15          83           16
## 2 1955      2    117      49         17          92           35
## 3 1960      3    153      75         29         124           64
## 4 1965      4    205     112         49         172           89
## 5 1970      5    298     183         77         238          116
## 6 1975      6    433     292        155         329          154
##   Sub_Saharan_300kplus Central_Africa_300kplus Cameroon_urban Congo_urban
## 1                 3923                    3660            417         201
## 2                 4909                    4521            557         253
## 3                 7083                    5651            747         320
## 4                10779                    7047           1011         408
## 5                16335                    8921           1375         522
## 6                24143                   11495           2112         672
##   Gabon_urban
## 1          54
## 2          68
## 3          87
## 4         126
## 5         189
## 6         279

For help with any of these commands, use the help function by typing ? before the command name. For example, try typing ?head into the command console

EXERCISE

Walk through this brief introduction to R and R Studio.

 

 

Posted in Planning Methods | Comments Off on Planning Methods: Loading and examining data

Exploring US bus ridership with the National Transit Database

I frequently use the National Transit Database for analysis or quick summaries of key information about public transportation in the US. I also have my students use it to perform a back-of-the-envelope cost benefit analysis for a proposed light rail system in one of my transportation planning courses. In both instances, I find the downloadable excel sheets to be a bit clunky and slow to manipulate. Most recently I wanted a quick comparison of bus ridership per route kilometer for a system in Indonesia. As usual, I downloaded the excel sheets and spent about an hour playing with the data until I got a single reasonable answer. As usual, I’d have to do another bit of work if I wanted to change anything. Instead, I decided to clean up a subset of the data and rearrange it into a long-format panel. For a description of the data and some of the choices made (like when to make an entry a 0 or an NA), see here.

I’m also hoping that the clean data set will encourage my students to choose to work with the data in R. I use R in my Planning Methods and Transportation Planning courses because:

  1. Data sets are ever larger and more readily available. Many are useful to planners and for planning. The ability to manage and work with large amounts of data is an important skill for planners, particularly transportation planners.
  2. It’s a great program for managing, visualizing, and analyzing data. Many of its primary competitors are only really good at the third. Poor data management can be embarrassing.
  3. It has a strong user community that builds add-on packages to accomplish a range of tasks, provides tutorials, and answers questions online. It’s also frequently used in free, online courses and is a program that grows with users as they learn more. Here are some links to great free courses from professors at Johns Hopkins and Stanford.
  4. It’s free, so even the most cash-strapped municipality has access.

Unfortunately, the learning curve can be steep for new users. Processing data, in particular, can be slow and is rarely immediately rewarding. New users are unlikely to see much benefit to downloading, cleaning, and formatting data just to use it in a class assignment. I wouldn’t either.

Now back to my original question: how does the number of passengers per kilometer on my Indonesian city stack up to ridership in US cities?

After loading the file, this code does the trick. I only include directly operated routes (Service == “DO”) because the privately operated routes (Service == “PT”) are less accurately reported. Unlinked passenger trips (UPT) aren’t the same thing as total bus passengers–riders who transfer once count as two unlinked passenger trips–but they’re collected in the same way that corridor bus counts in Indonesia would likely have occurred. I divide directional route miles (DRM) by two because DRM counts 1 mile of bidirectional service as 2 miles and is closer to what the Indonesian counts measure.

summary(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    0.2915   14.9500   32.9000   67.7300   64.3100 1582.0000       162

And this provides even more information about the distribution of ridership (Unlinked Passenger Trips anyway) per kilometer of bus route.

plot(density(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ), na.rm=T))

And this tells me how many systems perform better than my Indonesian system, which carries only 16 passengers per kilometer.

table(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )<16)
## 
## FALSE  TRUE 
##   317   158

Before going any further, I recommend trying these three commands to get a better feel for the data: head(NTD.ts) str(NTD.ts) summary(NTD.ts) In the interest of space, I’m excluding the outputs.

One of the handiest features of R is the ease with which it handles multiple data sets, so I’m going to add a couple variables and then create a new new data set that only looks at buses in 2013 that are directly operated by public agencies. I switched from Stata to R halfway into my dissertation (an otherwise foolish decision) because this made it so much easier to interact with and combine several large data sets.

NTD.ts$passenger.km <- (NTD.ts$UPT/365) / ((1.61*NTD.ts$DRM)/2)
NTD.ts$fare.recovery <- NTD.ts$FARES / NTD.ts$OPEXP_TOTAL
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0 & Service == "DO")

Now let’s plot passengers per kilometer against fare recovery.

plot(new.dat$fare.recovery, new.dat$passenger.km)

I’m a bit suspicious of the systems drawing more fare revenue than they spend on operations, so I want to look at them more closely: two University systems, an airport shuttle, and one I’m not sure about.

subset(new.dat, new.dat$fare.recovery>1)
##       Year TRS.ID Mode Service Last.Report.Year
## 58601 2013   2166   MB      DO             2013
## 59799 2013   2132   MB      DO             2013
## 59988 2013   4180   MB      DO             2013
## 60326 2013   8107   MB      DO             2013
##                                                    System.Name
## 58601                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799               New Jersey Transit Corporation-45(NJTC-45)
## 59988                University of Georgia Transit System(UGA)
## 60326 The University of Montana - ASUM Transportation(ASUM OT)
##       Small.Systems.Waiver      City State Census.Year
## 58601                    N Elizabeth    NJ        2010
## 59799                    N    Newark    NJ        2010
## 59988                    N    Athens    GA        2010
## 60326                    N  Missoula    MT        2010
##                        UZA.Name UZA UZA.Area.SQ.Miles UZA.Population
## 58601 New York-Newark, NY-NJ-CT   1              3450       18351295
## 59799 New York-Newark, NY-NJ-CT   1              3450       18351295
## 59988  Athens-Clarke County, GA 249                98         128754
## 60326              Missoula, MT 348                45          82157
##       OPEXP_TOTAL OPEXP_VO OPEXP_VM OPEXP_NVM OPEXP_GA    FARES VOMS
## 58601    15077606  9014072  4403391    708772   951371 16883604   52
## 59799     8299771  5067517  2285398    217667   729189 10034922   41
## 59988     5355484  4052395   791332     52669   459088  6281720   47
## 60326      598190   327957   164685     13613    91935   601851    6
##           VRM    VRH  DRM      UPT      PMT CPI passenger.km fare.recovery
## 58601 1750683 191541 69.8 10294583 32942666   1     501.9548      1.119780
## 59799 1079382 135429 68.9  5504389  8461884   1     271.8950      1.209060
## 59988  853644 114959 96.0 11058944  4202399   1     392.0610      1.172951
## 60326  153163  11578  6.3   421694   737965   1     227.8076      1.006120

In case I want to only look at a few items.

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$fare.recovery>1)
##       TRS.ID                                              System.Name
## 58601   2166                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799   2132               New Jersey Transit Corporation-45(NJTC-45)
## 59988   4180                University of Georgia Transit System(UGA)
## 60326   8107 The University of Montana - ASUM Transportation(ASUM OT)
##       OPEXP_TOTAL  DRM      UPT      PMT passenger.km fare.recovery
## 58601    15077606 69.8 10294583 32942666     501.9548      1.119780
## 59799     8299771 68.9  5504389  8461884     271.8950      1.209060
## 59988     5355484 96.0 11058944  4202399     392.0610      1.172951
## 60326      598190  6.3   421694   737965     227.8076      1.006120

Which systems are busiest?

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$passenger.km > 750)
##       TRS.ID
## 58641   2008
## 59050   5066
## 59876   5158
##                                                            System.Name
## 58641                                  MTA New York City Transit(NYCT)
## 59050                                   Chicago Transit Authority(CTA)
## 59876 University of Michigan Parking and Transportation Services(UMTS)
##       OPEXP_TOTAL    DRM       UPT        PMT passenger.km fare.recovery
## 58641  2487134393 1659.0 770962014 1614997081    1581.6043     0.3417458
## 59050   764280757 1311.2 300116357  728561319     778.9902     0.3909879
## 59876     6995443   20.7   7263234   20293476    1194.1832     0.2824377

I might also want to see how DRM and UPT look against eachother, since these two numbers for them basis of the calculation. It looks an awful lot like fare recovery against productivity.

plot(new.dat$DRM, new.dat$UPT)

Lastly I want to look at how passengers for kilometer has changed over time. First I’ll edit new.dat to include the other years. Then I’ll do a box plot by year.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F)
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

Clearly there’s some bad data. This merits further investigation, but for now I’ll just remove the worst offenders.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 )
boxplot(passenger.km ~ Year,data=new.dat, main="Bus ridership desnity over time", na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

There’s so much variation that it’s hard to read so I’ll make two plots, one for high productivity systems and one for low.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 & passenger.km > 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")


 

Posted in Transportation Planning | Tagged | Comments Off on Exploring US bus ridership with the National Transit Database

National Transit Database panel by system, mode, and service

This page contains links to a long-formatted panel of the National Transit Database’s TS2.1 – Service Data and Operating Expenses Time-Series by Mode. I maintain the NTD data conventions, so that the column names match the NTD’s data glossary. I also added a column labeled CPI that is generated from the Bureau of Labor Statistics’ Inflation Calculator and provides the 2013 value of one dollar in the reported year. I have not reformatted the other time series NTD data, which I use less frequently and which are not available at the same level of disaggregation. For example, the capital expenditures data include system, mode, and year, but not service type. Some data products have also been discontinued, such as the detailed sources of local revenues data. The reformatted panel is available in the following formats:

R (password: NTD1),

Stata (password: NTD2),

and as a CSV (password: NTD3).

I haven’t looked closely at the Stata file but exported it using the foreign package. The code used to format the data is available here (password: NTD4), if you want to make any changes. Below, I describe three important choices about when to treat missing values as an NA or a 0. The NTD data do not take a particularly clear approach to the problem. For example, entries for missing cost data vary between a null entry and $0 with no apparent pattern. For additional information about how the data were constructed, see the code above.

  1. Keep zeros. In 1992, Denver did not have a light rail. Instead of replacing entries of zero with NAs, I keep the data. This will make it easier to analyze influence of new light rail in the future. It is, however, worth noting that this is somewhat inconsistent and introduces some problems. It is inconsistent because Honolulu, which as never had a light rail system, is excluded from the panel and effectively the data are treated as NAs not zeros. It is potentially problematic, because if I will miscalculate average light rail ridership or operating expenses in 1992, if I’m not careful:
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992), summary(OPEXP_TOTAL) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0        0        0  6334000  5050000 62260000
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992 & NTD.ts$OPEXP_TOTAL > 0), summary(OPEXP_TOTAL) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   261200  4800000 11440000 17730000 21360000 62260000
  1. Replace known missing data with an NA. For example, the NTD does not report fare revenues earlier than 2002, so I replaced all fare entries with NAs.
  2. Systematically coerce likely incorrect zeros into NAs. The NTD are frequently messy, particularly for small systems and privately operated systems. In one year, an agency might report unlinked passenger trips, but fail to report any operating costs or passenger miles. Since ridership is particularly important, I used PMT and UPT to identify unreported data treated as zeros. There were only 12 instances where an agency reported positive PMT but no UPT. By contrast there were numerous instances where an agency reported positive UPT, but no PMT. I first coerced these 12 cases of UPT to be NAs. I then replaced zeros in remaining columns (including PMT) with NAs whenever the agency reported positive UPT. Note that this does not remove any data, but makes it a lot less likely to generate bad data (if for example dividing total fare revenues or operating expenses by unlinked passenger trips or passenger miles).
Posted in Transportation Planning | Tagged | Comments Off on National Transit Database panel by system, mode, and service

The New Suburbs: Evolving travel behavior, the built environment, and subway investments in Mexico City

Mexico City is a suburban metropolis, yet most of its suburbs would be unfamiliar to urbanists accustomed to thinking about US metropolitan regions. Mexico City’s suburbs are densely populated—not thinly settled—and its residents rely primarily on informal transit rather than privately-owned automobiles for their daily transportation. These types of dense and transit-dependent suburbs have emerged as the fastest-growing form of human settlement in cities throughout Latin America, Asia, and Africa. Wealthier and at a later stage in its economic development than other developing-world metropolises, Mexico City is a compelling place to investigate the effects of rising incomes, increased car ownership, and transit investments in the dense, peripheral areas that have grown rapidly around informal transit in the past decades, and is a bellwether for cities like Dakar, Cairo, Lima, and Jakarta.

I begin this dissertation with a historical overview of the demographic, economic, and political trends that have helped shape existing urban form, transportation infrastructure, and travel behavior in Mexico City. Despite an uptick in car ownership and use, most households—both urban and suburban—continue to rely on public transportation. Furthermore, suburban Mexico City has lower rates of car ownership and use than its central areas. In subsequent chapters, I frame, pose, and investigate three interrelated questions about Mexico City’s evolving suburban landscape, the nature of households’ travel decisions, and the relationship between the built environment and travel behavior. Together, these inquiries tell a story that differs significantly from narratives about US suburbs, and provide insight into the future transportation needs and likely effects of land and transportation policy in these communities and others like them in Mexico and throughout the developing world.

First, how has the influence of the built environment on travel behavior changed as more households have moved into the suburbs and aggregate car use has increased? Using two large metropolitan household travel surveys from 1994 and 2007, I model two related-but-distinct household travel decisions: whether to drive on an average weekday, and if so, how far to drive. After controlling for income and other household attributes, I find that the influence of population and job density on whether a household undertakes any daily car trips is strong and has increased marginally over time. By contrast, high job and population densities have a much smaller influence on the total distance of weekday car travel that a household generates. For the subset of households whose members drive on a given weekday, job and population densities have no statistical effect at all. Contrary to expectations, a household’s distance from the urban center is strongly correlated with a lower probability of driving, even after controlling for income. This effect, however, appears to be diminishing over time, and when members of a household drive, they drive significantly more if they live farther from the urban center. The combination of informal transit, public buses, and the Metro has provided sufficient transit service to constrain car use in the densely populated suburban environments of Mexico City. Once suburban residents drive, however, they tend to drive a lot regardless of transit or the features of the built environment.

Second, how much are the recent trends of increased suburbanization, rising car-ownership, and the proliferation of massive commercially-built peripheral housing developments interrelated? To investigate this question, I first disentangle urban growth and car ownership trends by geographic area. The fastest-growing areas tend to be poorer and have had a much smaller impact on the size of the metropolitan car fleet than wealthier, more established neighborhoods in the center and western half of the metropolis. I then zoom in to examine several recent commercial housing developments. These developments, supported by publicly-subsidized mortgages, contain thousands of densely-packed, small, and modestly-priced housing units. Their residents remain highly reliant on public transportation, particularly informal transit, and the neighborhoods become less homogeneous over time as homeowners convert units and parking spaces to shops and offices. Finally, I use the 2007 household travel survey to model households’ intertwined decisions of where to live and whether to own a car. As expected, wealthier and smaller households are more likely to purchase vehicles. However, they prefer to live in more central areas where households with cars tend to drive shorter distances. If housing policy and production cannot adapt to provide more centrally-located housing, growing incomes will tend to increase car ownership but concentrate more of it in areas where car-owning households drive much farther.

Third, how has the Metro’s Line B, one of the first and only suburban high-capacity transit investments, influenced local and regional travel behavior and land use? To explore this question, I compare travel behavior and land use measures at six geographic scales, including the investment’s immediate catchment area, across two time periods: six years before and seven years after the investment opened. Line B, which opened in stages in 1999 and 2000, significantly expanded Metro coverage into the densely populated and fast-growing suburban municipality of Ecatepec. While the investment sparked a significant increase in local Metro use, most of this increase came from people relying on informal transit, rather than cars. While this shift reduced transit fares and increased transit speeds for local residents, it also increased government subsidies for the Metro and had no apparent effect on road speeds. Furthermore, the Metro remains highly dependent on informal transit to provide feeder service even within Ecatepec. In terms of land use, the investment increased density around the stations but appears to have had little to no effect on downtown commercial development, where it might have been expected to have a significant influence. In short, the effects of Line B demonstrate much of the promise and problem with expanding high capacity transit service into the suburbs. Ridership is likely to be high, but so too will be the costs and subsidies, while the effects on car ownership and urban form are likely to be modest.

View of Ecatepec from Line B (2012)

Individually, each chapter contributes to a specific body of transportation and planning literature drawn from the US as well as developing countries. Collectively, they point to connection between land use and transportation in Mexico City that is different from the connection in US and other rich-world cities. In particular, there is a physical disconnect between the generally suburban homes of transit users and the generally central location of high-capacity public transit. Addressing this disconnect by shifting housing production from the periphery to the center or by expanding high-capacity transit to the periphery would require significant amounts of time and public subsidy. Thus, contemporary policies to reduce car use or increase accessibility for the poor in the short and medium term would do well to focus on improving the flexible, medium-capacity informal transit around which the city’s dense and transit-dependent suburbs have grown and continue to grow.

http://www.uctc.net/research/UCTC-DISS-2013-01.pdf

Posted in Uncategorized | Leave a comment

Is a Half-Mile Circle the Right Standard for TODs?

Planners and researchers use transit catchment areas—the land around stations—as geographic units for predicting ridership, assessing the impacts of transit investments and, recently, for designing transit-oriented developments (TODs). In the US, a half-mile-radius circle has become the de facto standard for rail-transit catchment areas.

There is surprisingly little evidence to justify any particular catchment area. Why a half mile? Why not a quarter mile or two-fifths of a mile? Is there anything special about a half mile or is this simply a convenient figure that has become an industry standard? A half mile roughly corresponds to the distance someone can walk in 10 minutes at 3 miles per hour and is a common estimate for the distance people will walk to get to a rail station. The half-mile ring is a little more than 500 acres in size.

Read more.

Posted in Uncategorized | Leave a comment

Accessibility, it’s not just a wonkish term for being near stuff

I was recently at a transportation conference, where one of the speakers admonished academics for not doing more to get their research out the academy and into the hands of policy makers and the public. He compared this with think tanks, which spend around a quarter of their budgets on marketing.

Perhaps even more than marketing budgets,  inaccessible language, research methods, and publication practice help keep too many books and articles locked away in the ivory tower. While simplifying methods may not always be desirable–and it often is–we could certainly do a lot more to make our writing and publications more accessible.

I recently had the opportunity to rewrite an article with Robert Cervero in the Journal of the American Planning Association into an ACCESS article. If you click on the former, you will likely be directed to an abstract and the opportunity to purchase the article. Unless you have an institutional proxy server that grants you free access to the pdf,  you may look for a free draft working paper that’s available through University of California Transportation Center. This working paper, however, did not benefit from JAPA‘s peer review process (many thanks to David Sawicki, Randy Crane, and three anonymous reviewers), which led us to make serious modifications to our data collection and analysis methods. In short, it shows a much earlier iteration of a work in progress.

If you click on the ACCESS link, by contrast, you will be taken directly to a 2,000 word rewrite and summary of the full JAPA article. You can view this as an html, download a pdf, or order a paper subscription. If you have additional questions about the findings or how the study was conducted, the article provides the JAPA citation. ACCESS‘s goal “…is to translate academic research into readable prose that is useful for policymakers and practitioners. Articles in ACCESS are intended to catapult academic research into debates about public policy, and convert knowledge into action.”

Since the ACCESS piece came out, I have received far more inquiries about the work, Robert Cervero was on the local news, and a friend working in New Zealand pointed me to this discussion. This kind of feedback has been important to me, as a young, and hopefully budding, academic. I’m grateful to Don Shoup and the rest of the ACCESS editing team for their hard work. I’ll strive to remember that accessibility is a lot more than a measure of desired destinations, opportunities, and experiences that can be reached in a given amount time.

 

 

Posted in Uncategorized | Leave a comment

Informal Public Transportation Networks in Three Indonesian Cities

“As Indonesian cities today become more prosperous, the demand for mobility among the urban poor is rapidly growing. This is nowhere more the case than in Jakarta – each day city streets become frozen with congestion. Government efforts to supply public transportation are insufficient at best. Informal transportation providers driving ojeks are a common sight weaving among unmoving traffic, simultaneously offering a faster route and contributing to congestion. Growth is also coming to other medium-sized Indonesian cities like Jogja and Solo in Central Java and Palembang in Sumatra—and so is the traffic. Yet while motorized transport is increasing in these cities, the environmental problems of congestion and pollution have not reached the scale of Indonesia’s biggest cities. In fact, as this report describes, informal public transportation offers potential alternatives to the negative stresses of growth on urban transportation systems as well as innovative approaches to provide service to people in poverty.

This report looks at informal public transportation (IPT) from different perspectives and reconsiders its value not just in improving urban mobility, but also as a provider of employment and backbone of the informal economy.”
Read More

Posted in Uncategorized | Leave a comment