## Queretaro Bike Planning Studio

Last fall, I took 8 graduate students to Queretaro, Mexico, for a bicycle planning studio. The final final planning document is available here: CYCLE_QRO_Small

A few group images from the trip:

## Planning for cars that drive themselves

Driverless cars are coming. Nearly all major car manufactures and several large technology firms are racing to develop vehicles that can operate with limited-to-no human interaction. Despite the probability of technological, legal, and political pitfalls along the way, fully autonomous vehicles (buses and cars) will likely be commercially available and driving themselves on city streets and highways in the next 20 years—well within a long-range planning horizon.

## Examining Philadelphia land use data by Census Tract

The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.

Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.

You can also find a description of the underlying tax assessor data here.

It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.

names(dat)

##  [1] "fips"                       "tract"
##  [3] "sfr.built.sf"               "mfr.built.sf"
##  [5] "retail.built.sf"            "comm.built.sf"
##  [7] "indust.built.sf"            "vacant.built.sf"
##  [9] "exempt.built.sf"            "sfr.lot.sf"
## [11] "mfr.lot.sf"                 "retail.lot.sf"
## [13] "comm.lot.sf"                "indust.lot.sf"
## [15] "vacant.lot.sf"              "exempt.lot.sf"
## [17] "sfr.mark.val"               "mfr.mark.val"
## [19] "retail.mark.val"            "comm.mark.val"
## [21] "indust.mark.val"            "vacant.mark.val"
## [23] "exempt.mark.val"            "total.built.sf"
## [25] "total.lot.sf"               "total.mark.val"
## [27] "ft.2.city.hall"             "ft.2.train.st"
## [29] "ft.2.bus.st"                "pop.2010"
## [31] "pop_density.2010"           "square_miles.2010"
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"
## [35] "pop_asian.2010"             "pop_hispanic.2010"
## [37] "households.2010"            "households_family.2010"
## [39] "households_nonfamily.2010"  "avg_hh_size.2010"
## [41] "pop_25_plus.2010"           "edu_less_highschool.2010"
## [43] "edu_highschool.2010"        "edu_collegeplus.2010"
## [45] "pop_civ_employed.2010"      "pop_civ_unemployed.2010"
## [47] "median_hh_income.2010"      "per_capita_income.2010"
## [49] "housing_vacant.2010"        "housing_occupied.2010"
## [51] "housing_median_age.2010"    "percent_poverty.2010"
## [53] "pop_change.2000"            "pop_plus10.2000"
## [55] "pop.2000"                   "pop_density.2000"
## [57] "square_miles.2000"          "pop_white_nonhispanic.2000"
## [59] "pop_black.2000"             "pop_asian.2000"
## [61] "pop_hispanic.2000"          "households.2000"
## [63] "households_family.2000"     "households_nonfamily.2000"
## [65] "avg_hh_size.2000"           "pop_25_plus.2000"
## [67] "edu_less_highschool.2000"   "edu_highschool.2000"
## [69] "edu_collegeplus.2000"       "pop_civ_employed.2000"
## [71] "pop_civ_unemployed.2000"    "median_hh_income.2000"
## [73] "per_capita_income.2000"     "housing_vacant.2000"
## [75] "housing_occupied.2000"      "housing_median_age.2000"
## [77] "percent_poverty.2000"


Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.

summary(dat[c(3:26)])

##   sfr.built.sf      mfr.built.sf     retail.built.sf  comm.built.sf
##  Min.   :      0   Min.   :      0   Min.   :     0   Min.   :       0
##  1st Qu.: 848570   1st Qu.: 144324   1st Qu.: 22700   1st Qu.:   52053
##  Median :1429116   Median : 270901   Median : 71180   Median :  120155
##  Mean   :1432991   Mean   : 395834   Mean   : 93474   Mean   :  357440
##  3rd Qu.:1970797   3rd Qu.: 461973   3rd Qu.:133074   3rd Qu.:  239220
##  Max.   :4357472   Max.   :3099629   Max.   :780672   Max.   :19805858
##  indust.built.sf   vacant.built.sf    exempt.built.sf
##  Min.   :      0   Min.   :   0.000   Min.   :    3771
##  1st Qu.:      0   1st Qu.:   0.000   1st Qu.:  178846
##  Median :  32572   Median :   0.000   Median :  319638
##  Mean   : 272199   Mean   :   8.906   Mean   :  712816
##  3rd Qu.: 198236   3rd Qu.:   0.000   3rd Qu.:  683018
##  Max.   :6372334   Max.   :1520.000   Max.   :28206707
##    sfr.lot.sf         mfr.lot.sf      retail.lot.sf     comm.lot.sf
##  Min.   :       0   Min.   :      0   Min.   :     0   Min.   :       0
##  1st Qu.:  955739   1st Qu.: 115597   1st Qu.: 23083   1st Qu.:   79158
##  Median : 1845223   Median : 251603   Median : 54802   Median :  168255
##  Mean   : 2476646   Mean   : 400486   Mean   : 66522   Mean   :  497081
##  3rd Qu.: 3139156   3rd Qu.: 516614   3rd Qu.: 89680   3rd Qu.:  455709
##  Max.   :33126247   Max.   :3026616   Max.   :633754   Max.   :17741673
##  indust.lot.sf      vacant.lot.sf      exempt.lot.sf
##  Min.   :       0   Min.   :       0   Min.   :   29248
##  1st Qu.:       0   1st Qu.:   68008   1st Qu.:  348156
##  Median :   31122   Median :  167437   Median :  746535
##  Mean   :  548498   Mean   :  448650   Mean   : 1803958
##  3rd Qu.:  212020   3rd Qu.:  415549   3rd Qu.: 1628277
##  Max.   :26713019   Max.   :13823323   Max.   :40611433
##   sfr.mark.val        mfr.mark.val       retail.mark.val
##  Min.   :        0   Min.   :        0   Min.   :       0
##  1st Qu.: 17434950   1st Qu.:  2777300   1st Qu.:  522825
##  Median : 40370150   Median :  6816500   Median : 1359150
##  Mean   : 48218979   Mean   : 13362086   Mean   : 2320078
##  3rd Qu.: 70492725   3rd Qu.: 14893450   3rd Qu.: 2729175
##  Max.   :249615900   Max.   :248868900   Max.   :32522100
##  comm.mark.val       indust.mark.val     vacant.mark.val
##  Min.   :0.000e+00   Min.   :        0   Min.   :       0
##  1st Qu.:1.369e+06   1st Qu.:        0   1st Qu.:  214075
##  Median :3.658e+06   Median :   393350   Median :  481950
##  Mean   :1.887e+07   Mean   :  4313410   Mean   : 1652835
##  3rd Qu.:9.132e+06   3rd Qu.:  2400625   3rd Qu.: 1139750
##  Max.   :1.671e+09   Max.   :154563400   Max.   :36376000
##  exempt.mark.val     total.built.sf      total.lot.sf
##  Min.   :5.657e+05   Min.   :   17886   Min.   :  605741
##  1st Qu.:6.623e+06   1st Qu.: 2190500   1st Qu.: 2539507
##  Median :1.379e+07   Median : 2830458   Median : 4162716
##  Mean   :4.344e+07   Mean   : 3264763   Mean   : 6241842
##  3rd Qu.:3.132e+07   3rd Qu.: 3534784   3rd Qu.: 6811800
##  Max.   :1.502e+09   Max.   :28206707   Max.   :91325669
##  total.mark.val
##  Min.   :4.546e+06
##  1st Qu.:5.499e+07
##  Median :9.375e+07
##  Mean   :1.322e+08
##  3rd Qu.:1.392e+08
##  Max.   :2.286e+09


There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.

summary(dat[c(1,27:29)])

##       fips          ft.2.city.hall  ft.2.train.st      ft.2.bus.st
##  Min.   :4.21e+10   Min.   : 2881   Min.   :  258.3   Min.   :   6.36
##  1st Qu.:4.21e+10   1st Qu.:16596   1st Qu.: 4087.1   1st Qu.: 265.63
##  Median :4.21e+10   Median :28692   Median : 6777.2   Median : 628.73
##  Mean   :4.21e+10   Mean   :31776   Mean   : 7443.2   Mean   : 781.29
##  3rd Qu.:4.21e+10   3rd Qu.:43484   3rd Qu.: 9832.9   3rd Qu.:1083.03
##  Max.   :4.21e+10   Max.   :87673   Max.   :26020.5   Max.   :7140.68


The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.

Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.

sum (dat$square_miles.2010)  ## [1] 134.1014  sum(dat$total.lot.sf)/  27.8784e6 #divided by the number of square feet in a square miles

## [1] 85.97578


Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.

dat$far <- dat$total.built.sf / dat$total.lot.sf summary(dat$far)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
##  0.005557  0.419600  0.696300  0.875800  1.052000 13.330000


Try exluding vacant land from the FAR calculation.

summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))  ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.005557 0.423300 0.748200 0.943900 1.165000 14.650000  And looking at how vacancy varies across census tracts hist(dat$vacant.lot.sf/ dat$total.lot.sf )  Now, create some net FAR calculations by land use type. summary(dat$comm.built.sf/ dat$comm.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.00376 0.33490 0.61070 1.09600 1.16400 24.78000 9  summary(dat$indust.built.sf/ dat$indust.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.00711 0.59420 1.04900 1.44700 1.66800 16.04000 104  summary(dat$retail.built.sf/ dat$retail.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0355 0.9365 1.3690 1.3830 1.8030 4.3190 29  summary(dat$exempt.built.sf/ dat$exempt.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000  It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation. dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010 dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity summary( dat$res.sf.per.hh )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##   104.1  1092.0  1220.0  1237.0  1367.0  4538.0       8


This looks about right with most households having between 1000 and 1400 square feet.

hist( dat$res.sf.per.hh )  And per capita dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010 dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA hist( dat$res.sf.per.person )


Look at the relationship between the population and square footage of housing.

plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )  Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood. plot(dat$res.sf.per.person , dat$ft.2.city.hall)  There is a bit more of a pattern between location and the proportion of housing that is multifamily. dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf ) plot(dat$per.mf , dat$ft.2.city.hall)  Which land uses tend to go together? Use a correlation table to examine total square footage. cor(dat[c(3:7,9)])  ## sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf ## sfr.built.sf 1.0000000 -0.227687793 0.208504423 -0.21049419 ## mfr.built.sf -0.2276878 1.000000000 0.009489829 0.45513983 ## retail.built.sf 0.2085044 0.009489829 1.000000000 0.04310543 ## comm.built.sf -0.2104942 0.455139826 0.043105428 1.00000000 ## indust.built.sf -0.1466598 -0.135759385 -0.038551580 0.04383247 ## exempt.built.sf -0.2613878 0.103988183 -0.049405209 0.16346712 ## indust.built.sf exempt.built.sf ## sfr.built.sf -0.14665984 -0.26138775 ## mfr.built.sf -0.13575939 0.10398818 ## retail.built.sf -0.03855158 -0.04940521 ## comm.built.sf 0.04383247 0.16346712 ## indust.built.sf 1.00000000 0.06878225 ## exempt.built.sf 0.06878225 1.00000000  And land area cor(dat[c(10:16)])  ## sfr.lot.sf mfr.lot.sf retail.lot.sf comm.lot.sf ## sfr.lot.sf 1.000000000 0.174062104 -0.01660241 0.03719301 ## mfr.lot.sf 0.174062104 1.000000000 -0.06027767 0.12670109 ## retail.lot.sf -0.016602407 -0.060277674 1.00000000 -0.01112620 ## comm.lot.sf 0.037193013 0.126701092 -0.01112620 1.00000000 ## indust.lot.sf -0.071661971 -0.009073780 -0.10092578 0.39273649 ## vacant.lot.sf -0.013189068 0.009647197 -0.05241089 0.71002277 ## exempt.lot.sf 0.000523062 0.011982805 -0.16823137 0.37194754 ## indust.lot.sf vacant.lot.sf exempt.lot.sf ## sfr.lot.sf -0.07166197 -0.013189068 0.000523062 ## mfr.lot.sf -0.00907378 0.009647197 0.011982805 ## retail.lot.sf -0.10092578 -0.052410890 -0.168231370 ## comm.lot.sf 0.39273649 0.710022767 0.371947536 ## indust.lot.sf 1.00000000 0.643286320 0.489299242 ## vacant.lot.sf 0.64328632 1.000000000 0.531508105 ## exempt.lot.sf 0.48929924 0.531508105 1.000000000  Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses. cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))  ## mfr.built.sf ## sfr.built.sf -0.2779901  cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))  ## retail.built.sf ## sfr.built.sf 0.1160276  cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))  ## comm.built.sf ## sfr.built.sf -0.4443231  cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))  ## indust.built.sf ## sfr.built.sf -0.3305134  cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))  ## vacant.built.sf ## sfr.built.sf 0.008721208  cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))  ## exempt.built.sf ## sfr.built.sf -0.6608287  Try using regressions to better understanding partial correlations between land uses. summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf , dat))  ## ## Call: ## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf + ## comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf, ## data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1548586 -500194 -17272 498795 2940319 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.550e+06 6.833e+04 22.686 < 2e-16 *** ## mfr.built.sf -3.486e-01 1.005e-01 -3.469 0.000582 *** ## retail.built.sf 1.605e+00 3.848e-01 4.172 3.76e-05 *** ## comm.built.sf -5.509e-02 3.127e-02 -1.762 0.078859 . ## indust.built.sf -1.539e-01 6.400e-02 -2.405 0.016662 * ## exempt.built.sf -8.591e-02 1.978e-02 -4.344 1.80e-05 *** ## vacant.lot.sf -1.419e-02 3.871e-02 -0.367 0.714091 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 728300 on 377 degrees of freedom ## Multiple R-squared: 0.1791, Adjusted R-squared: 0.1661 ## F-statistic: 13.71 on 6 and 377 DF, p-value: 4.216e-14  EXERCISE 1. Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation. 2. Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors? 3. Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses? Posted in Planning Methods | Comments Off on Examining Philadelphia land use data by Census Tract ## Planning Methods: Predicting changes in jobs by Zip Code For this lab, upload the Zip Code data from the previous lab. This should include the following files wih the following dimensions. You can also download the data here. dim(zbp02)  ## [1] 2207 24  dim(zbp12)  ## [1] 2125 22  The purposes of this lab is to reinforce the regression and visualization skills learned in lab 2 and lab 3 using the Zip Business Patterns data. Start by summarzing the 2002 data, which you will use to preidct job changes in 2012. summary(zbp02)  ## ZIP jobs_plus10 jobs.tot jobs.23 ## Min. :15001 Min. : 2.5 Min. : 2.50 Min. : 0.0 ## 1st Qu.:16010 1st Qu.: 71.0 1st Qu.: 72.25 1st Qu.: 2.5 ## Median :17235 Median : 411.0 Median : 387.50 Median : 22.0 ## Mean :17318 Mean : 2654.3 Mean : 2597.33 Mean : 138.6 ## 3rd Qu.:18618 3rd Qu.: 2327.0 3rd Qu.: 2234.25 3rd Qu.: 126.5 ## Max. :19980 Max. :79883.5 Max. :77848.50 Max. :4476.5 ## NA's :82 NA's :40 NA's :40 ## jobs.31 jobs.42 jobs.44 jobs.48 ## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00 ## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 2.5 1st Qu.: 0.00 ## Median : 34.5 Median : 7.5 Median : 31.0 Median : 7.00 ## Mean : 357.2 Mean : 128.8 Mean : 377.0 Mean : 73.68 ## 3rd Qu.: 341.5 3rd Qu.: 74.5 3rd Qu.: 273.2 3rd Qu.: 47.25 ## Max. :10076.0 Max. :5025.5 Max. :10491.0 Max. :5435.00 ## NA's :40 NA's :40 NA's :40 NA's :40 ## jobs.51 jobs.52 jobs.53 jobs.54 ## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0 ## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 ## Median : 0.00 Median : 7.00 Median : 0.00 Median : 5.0 ## Mean : 72.96 Mean : 155.58 Mean : 39.46 Mean : 158.4 ## 3rd Qu.: 14.50 3rd Qu.: 50.25 3rd Qu.: 17.50 3rd Qu.: 57.0 ## Max. :4203.50 Max. :15671.50 Max. :1739.50 Max. :22421.0 ## NA's :40 NA's :40 NA's :40 NA's :40 ## jobs.56 jobs.61 jobs.62 jobs.71 ## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00 ## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 ## Median : 5.0 Median : 0.00 Median : 16.5 Median : 0.00 ## Mean : 145.6 Mean : 66.51 Mean : 362.8 Mean : 41.55 ## 3rd Qu.: 56.0 3rd Qu.: 8.50 3rd Qu.: 208.2 3rd Qu.: 17.00 ## Max. :9158.5 Max. :3734.50 Max. :8389.5 Max. :2858.00 ## NA's :40 NA's :40 NA's :40 NA's :40 ## jobs.72 jobs.81 jobs.95 jobs.99 ## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.000 ## 1st Qu.: 0.0 1st Qu.: 2.5 1st Qu.: 0.00 1st Qu.: 0.000 ## Median : 19.5 Median : 19.5 Median : 0.00 Median : 0.000 ## Mean : 213.5 Mean : 139.4 Mean : 26.79 Mean : 1.355 ## 3rd Qu.: 154.2 3rd Qu.: 114.8 3rd Qu.: 0.00 3rd Qu.: 0.000 ## Max. :5445.5 Max. :4624.5 Max. :1961.00 Max. :214.500 ## NA's :40 NA's :40 NA's :40 NA's :40 ## jobs.21 jobs.11 jobs.22 jobs.55 ## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.0 ## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 ## Median : 0.000 Median : 0.00 Median : 0.00 Median : 0.0 ## Mean : 8.461 Mean : 2.05 Mean : 18.45 Mean : 69.4 ## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 2.5 ## Max. :1538.500 Max. :194.00 Max. :1785.50 Max. :6602.0 ## NA's :40 NA's :40 NA's :40 NA's :40  Like with the census tract data, not all of the zip coned match across the two years, resulting in missing data. Plot the relationship between the number of jobs in 2002 and 2012. plot(zbp02$jobs.tot, zbp02$jobs_plus10)  And look at the distribution of jobs across Zip Codes. Most Zip Codes have very few jobs, with a long tail of job-rich Zip Codes. plot(density(zbp02$jobs_plus10, na.rm = T))


hist(zbp02$jobs_plus10)  Look at a specific Zip Code (19104). plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan") points(zbp02$jobs_plus10[zbp02$ZIP == 19104], zbp02$jobs.tot[zbp02$ZIP==19104], col="red")  And all Philadelphia tracts, plus a few others. These tracts do not look particularly different from others in PA. plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan") points(zbp02$jobs_plus10[zbp02$ZIP > 19018 & zbp02$ZIP < 19256],
zbp02$jobs.tot[zbp02$ZIP> 19018 & zbp02$ZIP < 19256], col="red")  Now, predict the 2012 jobs as a function of the number of jobs in 2002. The regression line fits the data almost perfectly. reg1 <- lm(jobs_plus10 ~ jobs.tot, zbp02) summary(reg1)  ## ## Call: ## lm(formula = jobs_plus10 ~ jobs.tot, data = zbp02) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7887.2 -158.2 -82.6 26.7 10332.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 82.743743 26.051260 3.176 0.00151 ** ## jobs.tot 0.967799 0.004028 240.296 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1081 on 2083 degrees of freedom ## (122 observations deleted due to missingness) ## Multiple R-squared: 0.9652, Adjusted R-squared: 0.9652 ## F-statistic: 5.774e+04 on 1 and 2083 DF, p-value: < 2.2e-16  The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well. plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan") abline(reg1)  The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well. plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan") abline(reg1)  It also looks like there might be some systematic differences in prediction quality by the number of jobs. plot(predict(reg1), resid(reg1)) abline(h=0,col=3,lty=3)  Try to see which jobs are more or less likely to predict job losses or job increases. summary( lm(jobs_plus10 ~jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 + jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02) )  ## ## Call: ## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 + ## jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 + ## jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + ## jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7705.6 -159.0 -51.4 39.6 9224.1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 47.75265 25.06739 1.905 0.056923 . ## jobs.23 1.48264 0.11445 12.954 < 2e-16 *** ## jobs.31 0.86957 0.03842 22.633 < 2e-16 *** ## jobs.42 1.75718 0.11463 15.329 < 2e-16 *** ## jobs.44 0.90575 0.05998 15.100 < 2e-16 *** ## jobs.48 0.66433 0.10652 6.237 5.41e-10 *** ## jobs.51 0.68900 0.13370 5.153 2.80e-07 *** ## jobs.52 0.59305 0.06731 8.811 < 2e-16 *** ## jobs.53 2.82325 0.31606 8.933 < 2e-16 *** ## jobs.54 1.51710 0.07039 21.552 < 2e-16 *** ## jobs.56 0.17040 0.08420 2.024 0.043112 * ## jobs.61 1.47967 0.11901 12.433 < 2e-16 *** ## jobs.62 0.85000 0.04409 19.281 < 2e-16 *** ## jobs.71 0.75343 0.20036 3.760 0.000174 *** ## jobs.72 1.52875 0.12252 12.478 < 2e-16 *** ## jobs.81 0.44943 0.18831 2.387 0.017095 * ## jobs.95 0.74847 0.19179 3.903 9.82e-05 *** ## jobs.99 5.25699 4.88099 1.077 0.281591 ## jobs.21 1.13877 0.36736 3.100 0.001962 ** ## jobs.11 3.63400 2.12568 1.710 0.087495 . ## jobs.22 0.88365 0.22376 3.949 8.11e-05 *** ## jobs.55 0.11145 0.11360 0.981 0.326664 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 984.3 on 2063 degrees of freedom ## (122 observations deleted due to missingness) ## Multiple R-squared: 0.9714, Adjusted R-squared: 0.9711 ## F-statistic: 3339 on 21 and 2063 DF, p-value: < 2.2e-16  Most of the paramter estimates are statistically different from zero with a high degree of confidence, but that is really not a very useful finding. You really want to know whether job types are statisticallyt different from 1 (i.e., are certain job types more or less likely to be decreasing over time?). You can roughly approximate this by looking at the coefficent estimate and the standard error. If the coefficient estimate plus or minus two standard errors crosses the number 1, than the estimate is not statistically different from one with 95% confidence. You can aslo set up the regression to compare the coefficient estimate against the total number of jobs in 2002. summary( lm(jobs_plus10 ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 + jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02, offset= 1.00*jobs.tot) )  ## ## Call: ## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 + ## jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 + ## jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + ## jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02, ## offset = 1 * jobs.tot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7705.6 -159.0 -51.4 39.6 9224.1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 47.75265 25.06739 1.905 0.056923 . ## jobs.23 0.48264 0.11445 4.217 2.58e-05 *** ## jobs.31 -0.13043 0.03842 -3.395 0.000699 *** ## jobs.42 0.75718 0.11463 6.605 5.03e-11 *** ## jobs.44 -0.09425 0.05998 -1.571 0.116262 ## jobs.48 -0.33567 0.10652 -3.151 0.001650 ** ## jobs.51 -0.31100 0.13370 -2.326 0.020107 * ## jobs.52 -0.40695 0.06731 -6.046 1.76e-09 *** ## jobs.53 1.82325 0.31606 5.769 9.20e-09 *** ## jobs.54 0.51710 0.07039 7.346 2.93e-13 *** ## jobs.56 -0.82960 0.08420 -9.853 < 2e-16 *** ## jobs.61 0.47967 0.11901 4.031 5.77e-05 *** ## jobs.62 -0.15000 0.04409 -3.402 0.000681 *** ## jobs.71 -0.24657 0.20036 -1.231 0.218608 ## jobs.72 0.52875 0.12252 4.316 1.67e-05 *** ## jobs.81 -0.55057 0.18831 -2.924 0.003497 ** ## jobs.95 -0.25153 0.19179 -1.311 0.189836 ## jobs.99 4.25699 4.88099 0.872 0.383223 ## jobs.21 0.13877 0.36736 0.378 0.705657 ## jobs.11 2.63400 2.12568 1.239 0.215437 ## jobs.22 -0.11635 0.22376 -0.520 0.603127 ## jobs.55 -0.88855 0.11360 -7.822 8.22e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 984.3 on 2063 degrees of freedom ## (122 observations deleted due to missingness) ## Multiple R-squared: 0.9714, Adjusted R-squared: 0.9711 ## F-statistic: 3339 on 21 and 2063 DF, p-value: < 2.2e-16  Each job in Real Estate in a Zip Code in 2002 (sector 53) correlates with another 1.62 jobs in 2012. Better, yet use the types of jobs in 2002 to predict the net change in jobs from 2002 to 2012. reg2 <- lm(jobs_plus10-jobs.tot ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 + jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02) summary(reg2)  ## ## Call: ## lm(formula = jobs_plus10 - jobs.tot ~ jobs.23 + jobs.31 + jobs.42 + ## jobs.44 + jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + ## jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + ## jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, ## data = zbp02) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7705.6 -159.0 -51.4 39.6 9224.1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 47.75265 25.06739 1.905 0.056923 . ## jobs.23 0.48264 0.11445 4.217 2.58e-05 *** ## jobs.31 -0.13043 0.03842 -3.395 0.000699 *** ## jobs.42 0.75718 0.11463 6.605 5.03e-11 *** ## jobs.44 -0.09425 0.05998 -1.571 0.116262 ## jobs.48 -0.33567 0.10652 -3.151 0.001650 ** ## jobs.51 -0.31100 0.13370 -2.326 0.020107 * ## jobs.52 -0.40695 0.06731 -6.046 1.76e-09 *** ## jobs.53 1.82325 0.31606 5.769 9.20e-09 *** ## jobs.54 0.51710 0.07039 7.346 2.93e-13 *** ## jobs.56 -0.82960 0.08420 -9.853 < 2e-16 *** ## jobs.61 0.47967 0.11901 4.031 5.77e-05 *** ## jobs.62 -0.15000 0.04409 -3.402 0.000681 *** ## jobs.71 -0.24657 0.20036 -1.231 0.218608 ## jobs.72 0.52875 0.12252 4.316 1.67e-05 *** ## jobs.81 -0.55057 0.18831 -2.924 0.003497 ** ## jobs.95 -0.25153 0.19179 -1.311 0.189836 ## jobs.99 4.25699 4.88099 0.872 0.383223 ## jobs.21 0.13877 0.36736 0.378 0.705657 ## jobs.11 2.63400 2.12568 1.239 0.215437 ## jobs.22 -0.11635 0.22376 -0.520 0.603127 ## jobs.55 -0.88855 0.11360 -7.822 8.22e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 984.3 on 2063 degrees of freedom ## (122 observations deleted due to missingness) ## Multiple R-squared: 0.2035, Adjusted R-squared: 0.1954 ## F-statistic: 25.1 on 21 and 2063 DF, p-value: < 2.2e-16  Note how all the parameter estimates are the same, but there is now a much lower and much more useful R-squared. Instead of showing that Zip Codes with a lot of jobs in 2002 tend to have a lot of jobs in 2012, the model now shows that the number and types of jobs in each Zip Code can be used to predict whether a Zip Code gained or lost jobs. Instead of explaining 97% of the variation in the relationship, the model now claims to explain 17%. The distribution of this dependent variable also looks a lot more normal with most Zip Codes not having changed very much. plot(density(zbp02$jobs_plus10- zbp02$jobs.tot, na.rm=T))  The residual plit also looks a bit more homoschedastic. plot(predict(reg2), resid(reg2)) abline(h=0,col=3,lty=3)  EXERCISE 1. Try to make the most parsimonious model that does a good job of predicting the change in jobs from 2002 to 2012. Use a A full model vs. reduced model Anova test to compare this model to the fully specific model (with all job types) and descibe the results. 2. Try predicting the percent change in jobs instead of the absolute change. Compare and contrast the two models. Which model do you prefer and why? Hint: use the identity function to generate the percent change variable inside of your regression: I(jobs_plus10/jobs.tot-1). Input industry codes as a percentage of total jobs: I(jobs.23/jobs.tot). 3. What types of jobs are the best indicators of job growth and job losses? Desribe the quantitative relationship as expressed by the two models in question 2. Does this contradict or concur with your expectations? 4. Look at the differences in jobs in sector 99 across the two time periods. What do you think is happening? Posted in Planning Methods, Uncategorized | Leave a comment ## Planning Methods: Predicting Census Tract change In this lab, you will use Philadelphia's 2000 Census Tract data to make predictions about the 2010 and 2020 population, and to understand what types of Census Tracts have been growing or shrinking. Download the data here (password is PennDesign). You can load it by double clicking on the file or using the load command. Start by summarizing the data: summary(dat2000)  ## Geo_FIPS pop_change pop_plus10 pop ## Min. :4.21e+10 Min. :-1310.00 Min. : 0 Min. : 0 ## 1st Qu.:4.21e+10 1st Qu.: -394.50 1st Qu.:2721 1st Qu.: 2329 ## Median :4.21e+10 Median : -85.00 Median :3774 Median : 4008 ## Mean :4.21e+10 Mean : -43.79 Mean :3919 Mean : 3983 ## 3rd Qu.:4.21e+10 3rd Qu.: 309.50 3rd Qu.:4987 3rd Qu.: 5630 ## Max. :4.21e+10 Max. : 2311.00 Max. :9022 Max. :10479 ## NA's :215 NA's :106 NA's :109 ## pop_density square_miles pop_white_nonhispanic pop_black ## Min. : 0 Min. :0.06491 Min. : 0 Min. : 0 ## 1st Qu.: 8462 1st Qu.:0.18056 1st Qu.: 114 1st Qu.: 166 ## Median :15692 Median :0.25657 Median : 821 Median : 903 ## Mean :16120 Mean :0.35457 Mean :1695 Mean :1721 ## 3rd Qu.:22543 3rd Qu.:0.37694 3rd Qu.:2901 3rd Qu.:2959 ## Max. :54824 Max. :3.26893 Max. :8948 Max. :8468 ## NA's :109 NA's :109 NA's :109 NA's :109 ## pop_asian pop_hispanic households households_family ## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0 ## 1st Qu.: 13.0 1st Qu.: 33.0 1st Qu.: 852 1st Qu.: 428.0 ## Median : 46.0 Median : 89.0 Median :1502 Median : 899.0 ## Mean : 177.6 Mean : 336.7 Mean :1549 Mean : 924.8 ## 3rd Qu.: 168.0 3rd Qu.: 230.0 3rd Qu.:2187 3rd Qu.:1323.0 ## Max. :2625.0 Max. :6802.0 Max. :6115 Max. :2738.0 ## NA's :109 NA's :109 NA's :109 NA's :109 ## households_nonfamily avg_hh_size pop_25_plus edu_less_highschool ## Min. : 0 Min. :0.000 Min. : 0 Min. : 0.0 ## 1st Qu.: 303 1st Qu.:2.220 1st Qu.:1366 1st Qu.: 286.0 ## Median : 529 Median :2.520 Median :2422 Median : 661.0 ## Mean : 624 Mean :2.424 Mean :2536 Mean : 729.9 ## 3rd Qu.: 796 3rd Qu.:2.760 3rd Qu.:3624 3rd Qu.:1023.0 ## Max. :4806 Max. :3.600 Max. :7155 Max. :2733.0 ## NA's :109 NA's :109 NA's :109 NA's :109 ## edu_highschool edu_collegeplus pop_civ_employed pop_civ_unemployed ## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0 ## 1st Qu.: 362.0 1st Qu.: 119.0 1st Qu.: 780 1st Qu.: 75.0 ## Median : 789.0 Median : 286.0 Median :1433 Median : 158.0 ## Mean : 845.3 Mean : 453.1 Mean :1535 Mean : 187.9 ## 3rd Qu.:1260.0 3rd Qu.: 557.0 3rd Qu.:2225 3rd Qu.: 266.0 ## Max. :2809.0 Max. :5735.0 Max. :5692 Max. :1678.0 ## NA's :109 NA's :109 NA's :109 NA's :109 ## median_hh_income per_capita_income housing_vacant housing_occupied ## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0 ## 1st Qu.: 21500 1st Qu.: 10817 1st Qu.: 55.0 1st Qu.: 852 ## Median : 29839 Median : 15615 Median :149.0 Median :1502 ## Mean : 32002 Mean : 17604 Mean :188.7 Mean :1549 ## 3rd Qu.: 39528 3rd Qu.: 20613 3rd Qu.:279.0 3rd Qu.:2187 ## Max. :200001 Max. :109633 Max. :834.0 Max. :6115 ## NA's :109 NA's :109 NA's :109 NA's :109 ## housing_median_age percent_poverty ## Min. : 0 Min. :0.00000 ## 1st Qu.:1939 1st Qu.:0.09552 ## Median :1942 Median :0.18480 ## Mean :1885 Mean :0.21014 ## 3rd Qu.:1952 3rd Qu.:0.30020 ## Max. :1983 Max. :0.84337 ## NA's :109 NA's :120  Notice how there are a number of missing data entries. This is because Census Tract numbers and borders change over time; sometimes combining, sometimes splitting. We have data for 106 2000 Census Tracts that are missing in 2010, and 109 2010 Census Tracts that are missing in 2000. There are several ways to combine Census Tracts to make them consistent over time. This website provides tools to estimate Census data from 1970 to 2000 within the 2010 boundaries. For the purposes of this exercise, we will exclude the non-matching tracts. Look at the relationship between tracts population in 2000 and in 2010. plot(dat2000$pop, dat2000$pop_plus10)  The density and histogram functions are convenient ways to understand the distribution of the data (in this case the 2010 population in each Census Tract). The population is somewhat normally distributed and ranges from around zero to 10,000. plot(density(dat2000$pop_plus10, na.rm = T))


hist(dat2000$pop_plus10)  Before proceeding, compare the Census Tracts that are consistent between 2000 and 2010 with those that are not. As you would expect, low population and high population Census Tracts were most likely to change boundaries. hist(dat2000$pop[is.na(dat2000$pop_plus10)==T])  hist(dat2000$pop[is.na(dat2000$pop_plus10)==F])  How well can you predict the 2010 Census Tract population from the 2000 Census Tract population? reg1 <- lm(pop_plus10~pop, dat2000) summary(reg1)  ## ## Call: ## lm(formula = pop_plus10 ~ pop, data = dat2000) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1271.96 -357.73 -45.54 355.51 2334.68 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -92.07364 88.72101 -1.038 0.3 ## pop 1.01166 0.01986 50.938 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 552 on 273 degrees of freedom ## (215 observations deleted due to missingness) ## Multiple R-squared: 0.9048, Adjusted R-squared: 0.9045 ## F-statistic: 2595 on 1 and 273 DF, p-value: < 2.2e-16  The R-squared statistic indicates that the 2000 population explains about 90% of the variation in the 2000 population. The following shows how well the regression line fits the data. plot(dat2000$pop, dat2000$pop_plus10) abline(reg1)  Furthermore, each person in 2000 correlates with 1.01 person in 2010, though with a standard error of 0.02, this is hardly a ringing endorsement of population growth. Plotting the population change from 2000 to 2010 shows that almost the same number of Census Tracts gained and lost population. plot(density(dat2000$pop_change, na.rm = T))


Plotting the error term shows that the regression line fits the low population tracts a bit better than the high population tracts. (Note how closely this tracks with the population change number, since the prediction of the 2010 population relies entirely on the 2000 population.)

plot(density(resid(reg1)))


Plotting the predicted population against the error terms reveals some additional patterns to help understand the regression. Ideally this pattern would look much more random with points equally and randomly distributed around the zero line.

plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)


One way to try to deal with under predicting the more populous Census Tracts is to introduce a quadratic term.

reg2 <- lm(pop_plus10~pop + I(pop^2), dat2000)
summary(reg2)

##
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2), data = dat2000)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1228.67  -338.37   -48.65   365.88  2334.17
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.354e+02  1.750e+02   2.488 0.013460 *
## pop         7.114e-01  8.863e-02   8.027 3.02e-14 ***
## I(pop^2)    3.587e-05  1.033e-05   3.473 0.000599 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 541.2 on 272 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.9082
## F-statistic:  1356 on 2 and 272 DF,  p-value: < 2.2e-16


Does this fit the data better? A full model vs. reduced model Anova test suggests that it does. So does the small improvement to the adjusted R-Squared.

anova(reg1,reg2)

## Analysis of Variance Table
##
## Model 1: pop_plus10 ~ pop
## Model 2: pop_plus10 ~ pop + I(pop^2)
##   Res.Df      RSS Df Sum of Sq     F    Pr(>F)
## 1    273 83185653
## 2    272 79653918  1   3531735 12.06 0.0005991 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The residual plot looks better…slightly…perhaps…

plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)


Try testing some more interesting hypotheses. For example, do you expect higher income neighborhoods to grow more or less quickly than lower income neighborhoods? You could probably construct a reasonable theory in either direction, but my guess is that wealthier Census Tracts in Philadelphia were the most likely to grow.

reg3 <- lm(pop_plus10 ~ pop + I(pop^2)+ I(dat2000$median_hh_inc/1000), dat2000) summary(reg3)  ## ## Call: ## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(dat2000$median_hh_inc/1000),
##     data = dat2000)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1167.48  -361.12   -48.03   332.20  2313.85
##
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   2.836e+02  1.959e+02   1.448 0.148838
## pop                           7.188e-01  8.843e-02   8.128 1.56e-14 ***
## I(pop^2)                      3.512e-05  1.030e-05   3.408 0.000753 ***
## I(dat2000$median_hh_inc/1000) 4.375e+00 2.570e+00 1.702 0.089840 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 539.3 on 271 degrees of freedom ## (215 observations deleted due to missingness) ## Multiple R-squared: 0.9098, Adjusted R-squared: 0.9088 ## F-statistic: 911.2 on 3 and 271 DF, p-value: < 2.2e-16  Every$1,000 more in median income correlates with 4 more people in 2010, not a tremendously strong relationship. Furthermore, the coefficient is not statistically difference from 0 with 95% confidence (though it is with 90% confidence).

It also may be correlated with other variables that are associated with increasing or decreasing population, such as race.

reg4 <-  lm(pop_plus10~ pop + I(pop^2)+ I(median_hh_income/1000) + I(100*pop_black/pop), dat2000)
summary(reg4)

##
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(median_hh_income/1000) +
##     I(100 * pop_black/pop), data = dat2000)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1306.97  -327.63    -4.05   289.08  2213.82
##
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)               6.204e+02  2.224e+02   2.789 0.005667 **
## pop                       7.208e-01  9.168e-02   7.862 9.19e-14 ***
## I(pop^2)                  3.365e-05  1.062e-05   3.170 0.001703 **
## I(median_hh_income/1000) -4.582e-01  2.871e+00  -0.160 0.873310
## I(100 * pop_black/pop)   -3.354e+00  9.524e-01  -3.521 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 528.9 on 269 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.9122, Adjusted R-squared:  0.9109
## F-statistic: 698.6 on 4 and 269 DF,  p-value: < 2.2e-16


Each percentage point higher concentration of black residents in Census Tracts in 2000 correlates with 3 fewer people in 2010. Using this regression, let's use the predict function to estimate Philadelphia's 2010 population:

sum(predict(reg4))

## [1] 1126700


This is quite a bit lower than the actual 2010 population count:

sum(dat2000$pop_plus10, na.rm=T)  ## [1] 1504950  The regression, however, excludes the hundred or so Census Tracts that changed between 2000 and 2010. Adding these back makes the prediction much closer to the actual population. sum(predict(reg4, newdata = dat2000), na.rm =T)  ## [1] 1543117  Finally, use a similar command to estimate the 2020 population by applying our regression equation to the 2010 Census data. sum(predict(reg4, newdata = dat2010), na.rm =T)  ## [1] 1488331  EXERCISE 1. Despite recent population growth, the model predicts that population will decline between 2010 and 2020. Do you agree with this prediction? What do you think is causing it? 2. Replace median income with a measure of the proportion of the population over 25 with a college degree of higher (edu_collegeplus), Do you expect neighborhoods with a higher proportion of well-educated residents to have grown or lost population between 2000 and 2010? Describe the results of the model in relationship to your expectations. 3. Does this change improve the model? Be sure to consider statistical fit, residual plots, and theory in providing your answer. 4. Develop a different set of models predicting population change (pop_change) instead of the population ten years later (pop_plus10). What factors do you expect to be good predictors of total population change? Describe the model, model fits, and residual plots. Posted in Planning Methods | Comments Off on Planning Methods: Predicting Census Tract change ## Planning Methods: working with Zip Business Patterns data In this lab, you will download data on job by industry from the Zip Business Patterns data. You can find the data on the US Census website. Be sure to download the data as a CSV file. For now, download the full 2002 and 2012 data. The links are:2002 & 2012 You want the Complete ZIP Code Industry Detail File. Be sure to read the descriptions of the data and data collection methods on the website. The FAQ may also prove useful. The objectives of the lab are to learn to work with ZBP data using the midpoint method and to learn a few data processing commands in R. Specifically, we will take subsets of data and do a simple data merge. Although this may seem a little complicated to do some simple file management and calculations, the pay offs are also big. By the end of the lab, you should be able to create mid-point methods of jobs for all US Census Tracts by any available NAICS code or subsets of NAICS codes with just a few changes to your code. To load the data: 1) Set your working directory to wherever you saved your files. Note that the data are stored a bit differently across the two 2) Run the following two lines of code: zbp02 <- read.csv(“zbp02detail.txt”) zbp12 <- read.csv(“zbp12detail.txt”) This will take a moment to load, if you’re working on a home computer. To make sure that your data loaded properly check and have the same dimensions as below, dim(zbp02)  ## [1] 3251827 12  dim(zbp12)  ## [1] 3179204 12  look the same, head(zbp02)  ## ZIP NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 N500_999 ## 1 501 ------ 11 9 1 0 1 0 0 0 0 ## 2 501 23---- 1 1 0 0 0 0 0 0 0 ## 3 501 235310 1 1 0 0 0 0 0 0 0 ## 4 501 42---- 3 3 0 0 0 0 0 0 0 ## 5 501 421990 1 1 0 0 0 0 0 0 0 ## 6 501 422920 1 1 0 0 0 0 0 0 0 ## N1000 ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 0  head(zbp12)  ## zip naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499 ## 1 501 ------ 2 2 0 0 0 0 0 0 ## 2 501 81---- 2 2 0 0 0 0 0 0 ## 3 501 813110 2 2 0 0 0 0 0 0 ## 4 1001 ------ 453 206 77 75 55 22 14 3 ## 5 1001 22---- 2 1 0 0 1 0 0 0 ## 6 1001 221310 1 1 0 0 0 0 0 0 ## n500_999 n1000 ## 1 0 0 ## 2 0 0 ## 3 0 0 ## 4 1 0 ## 5 0 0 ## 6 0 0  and have the same column types. Most importantly, only NAICS should be a character or factor. str(zbp02)  ## 'data.frame': 3251827 obs. of 12 variables: ##$ ZIP     : int  501 501 501 501 501 501 501 501 501 501 ...
##  $NAICS : Factor w/ 1104 levels "------","11----",..: 1 59 73 562 599 627 631 632 655 859 ... ##$ EST     : int  11 1 1 3 1 1 1 1 1 2 ...
##  $N1_4 : int 9 1 1 3 1 1 1 1 1 2 ... ##$ N5_9    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $N10_19 : int 0 0 0 0 0 0 0 0 0 0 ... ##$ N20_49  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $N50_99 : int 0 0 0 0 0 0 0 0 0 0 ... ##$ N100_249: int  0 0 0 0 0 0 0 0 0 0 ...
##  $N250_499: int 0 0 0 0 0 0 0 0 0 0 ... ##$ N500_999: int  0 0 0 0 0 0 0 0 0 0 ...
##  $N1000 : int 0 0 0 0 0 0 0 0 0 0 ...  str(zbp12)  ## 'data.frame': 3179204 obs. of 12 variables: ##$ zip     : int  501 501 501 1001 1001 1001 1001 1001 1001 1001 ...
##  $naics : Factor w/ 999 levels "------","11----",..: 1 950 986 1 48 60 61 63 65 67 ... ##$ est     : int  2 2 2 453 2 1 1 47 1 2 ...
##  $n1_4 : int 2 2 2 206 1 1 0 30 1 1 ... ##$ n5_9    : int  0 0 0 77 0 0 0 7 0 0 ...
##  $n10_19 : int 0 0 0 75 0 0 0 3 0 0 ... ##$ n20_49  : int  0 0 0 55 1 0 1 7 0 1 ...
##  $n50_99 : int 0 0 0 22 0 0 0 0 0 0 ... ##$ n100_249: int  0 0 0 14 0 0 0 0 0 0 ...
##  $n250_499: int 0 0 0 3 0 0 0 0 0 0 ... ##$ n500_999: int  0 0 0 1 0 0 0 0 0 0 ...
##  $n1000 : int 0 0 0 0 0 0 0 0 0 0 ...  If for some reason, one of the other variables did not load as you want, you can change it like so: zbp02$EST <- as.numeric(zbp02$EST) ##as.integer() is also fine.  Note that the 2002 data include the same zip codes multiple times. These are the counts by different sector. (The 2012 data only includes totals.) To better understand this structure, look at the first Zip Code from the 2002 data: subset(zbp02, ZIP == 501)  ## ZIP NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 ## 1 501 ------ 11 9 1 0 1 0 0 0 ## 2 501 23---- 1 1 0 0 0 0 0 0 ## 3 501 235310 1 1 0 0 0 0 0 0 ## 4 501 42---- 3 3 0 0 0 0 0 0 ## 5 501 421990 1 1 0 0 0 0 0 0 ## 6 501 422920 1 1 0 0 0 0 0 0 ## 7 501 422990 1 1 0 0 0 0 0 0 ## 8 501 44---- 1 1 0 0 0 0 0 0 ## 9 501 445110 1 1 0 0 0 0 0 0 ## 10 501 54---- 2 2 0 0 0 0 0 0 ## 11 501 541110 1 1 0 0 0 0 0 0 ## 12 501 541211 1 1 0 0 0 0 0 0 ## 13 501 62---- 1 1 0 0 0 0 0 0 ## 14 501 621111 1 1 0 0 0 0 0 0 ## 15 501 81---- 2 0 1 0 1 0 0 0 ## 16 501 813110 2 0 1 0 1 0 0 0 ## 17 501 99---- 1 1 0 0 0 0 0 0 ## N500_999 N1000 ## 1 0 0 ## 2 0 0 ## 3 0 0 ## 4 0 0 ## 5 0 0 ## 6 0 0 ## 7 0 0 ## 8 0 0 ## 9 0 0 ## 10 0 0 ## 11 0 0 ## 12 0 0 ## 13 0 0 ## 14 0 0 ## 15 0 0 ## 16 0 0 ## 17 0 0  The NAICS includes the total number of establishments by all sectors [NAICS==“——”], by two-digit industry code, and by six-digit code. EST is the total number of establishments, and the columns to the right are the number of establishments by number of employees (1 to 4, 5 to 9, etc.) To check that the numbers add up try adding the establishments by sector. zbp02$EST[zbp02$NAICS == "------" & zbp02$ZIP == 501]

## [1] 11


And check it against the sum of the other columns.

sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501] )  ## [1] 21  Whoops! What happened? The data includes six digit and two digit counts, which led to some double counting. Try including only NAICS that end in “—-” with the substring command. sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501 & substring(zbp02$NAICS,3,6)== "----"] )  ## [1] 11  To clean up the data a bit, create a a new file that includes only the 2-digit codes and totals. If you want to save a more complete file for future use, you can ignore this step. Note that no matter what you do, the original CSV files remain unchanged unless you specifically overwrite them. Thus you can run and rerun your code without ever changing the underlying data. zbp02 <- subset(zbp02, substring(zbp02$NAICS,3,6)=="----")
zbp12 <- subset(zbp12, substring(zbp12$naics,3,6)=="----")  Again check your dimensions against the below to make sure that the commands worked. dim(zbp02)  ## [1] 478497 12  dim(zbp12)  ## [1] 454477 12  And look at the data. head(zbp02)  ## ZIP NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 ## 1 501 ------ 11 9 1 0 1 0 0 0 ## 2 501 23---- 1 1 0 0 0 0 0 0 ## 4 501 42---- 3 3 0 0 0 0 0 0 ## 8 501 44---- 1 1 0 0 0 0 0 0 ## 10 501 54---- 2 2 0 0 0 0 0 0 ## 13 501 62---- 1 1 0 0 0 0 0 0 ## N500_999 N1000 ## 1 0 0 ## 2 0 0 ## 4 0 0 ## 8 0 0 ## 10 0 0 ## 13 0 0  head(zbp12)  ## zip naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499 ## 1 501 ------ 2 2 0 0 0 0 0 0 ## 2 501 81---- 2 2 0 0 0 0 0 0 ## 4 1001 ------ 453 206 77 75 55 22 14 3 ## 5 1001 22---- 2 1 0 0 1 0 0 0 ## 8 1001 23---- 47 30 7 3 7 0 0 0 ## 27 1001 31---- 48 10 8 9 9 6 3 3 ## n500_999 n1000 ## 1 0 0 ## 2 0 0 ## 4 1 0 ## 5 0 0 ## 8 0 0 ## 27 0 0  Now, use the midpoint method to estimate the number of jobs by establishment size. Note that the average size of 1-to-4 employee establishment is 2.5. For establishments with 1000 or more employees, the midpoint method sets the number of employees to 1000. (Do the 2002 data in steps and then 2012 all at once.) The within() command is one of several nifty ways to avoid writing zbp02$ before each variable.

zbp02 <- within(zbp02, jobs <- N1_4 *2.5 + N5_9 *7 + N10_19 * 14.5 +
N20_49 * 34.5 +  N50_99 *74.5 +
N100_249 * 174.5 + N250_499 *374.5 +
N500_999 * 749.5 +  N1000 *1000)


For this lab, drop all none PA Zip Codes to conserve space. You may want to go back and keep the other states as well at some point in time.PA’s Zip Codes run between 15000 and 19999. Note that you

zbp02<- subset(zbp02, as.numeric(zbp02$ZIP) >= 15000 & as.numeric(zbp02$ZIP) <= 19999)


And get rid of the variables that are not needed.

keepers <- c("ZIP", "NAICS", "jobs")
zbp02 <- zbp02[keepers]


And finally clean up the labeling of the NAICS codes.

zbp02$NAICS <- substring(zbp02$NAICS,1,2)
zbp02$NAICS[zbp02$NAICS=="--"] <- "tot"


Have a look

head(zbp02)

##          ZIP NAICS   jobs
## 452595 14001   tot 3180.5
## 452596 14001    23  224.0
## 452610 14001    31 1383.0
## 452629 14001    42  820.5
## 452638 14001    44  189.0
## 452654 14001    48   20.0


This type of data is often referred to as long data. There are multiple entries for each Zip Code and the NAICS column indicates the industry. For future work, it is going to be a little easier to have the data in a wide format, which you can do with ease using the reshape() command.

zbp02 <- reshape(zbp02,
timevar = "NAICS",
idvar = c("ZIP"),
direction = "wide")


Have a look

head(zbp02)

##          ZIP jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48 jobs.51
## 452595 14001   3180.5   224.0  1383.0   820.5   189.0    20.0       5
## 452714 14003     12.0      NA      NA      NA     7.0      NA      NA
## 452721 14004   2270.5   292.0   467.0    39.0   527.5    60.5      14
## 452854 14005    167.0     7.5    34.5     9.5    31.0    37.0      NA
## 452887 14006   1898.5    92.0   489.5    71.5   295.0    96.5      19
## 452991 14008     51.5    10.0      NA      NA    12.5      NA      NA
##        jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 452595   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 452714      NA      NA      NA      NA      NA      NA      NA     2.5
## 452721    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 452854     2.5      NA      NA     2.5      NA     2.5      NA      NA
## 452887    85.0       7    31.5    22.0      NA    97.0    39.5   440.0
## 452991      NA      NA      NA     2.5      NA    14.5      NA     7.0
##        jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 452595    56.0      49     7.5      NA      NA      NA      NA
## 452714     2.5      NA      NA      NA      NA      NA      NA
## 452721   169.5      NA      NA     9.5      NA      NA      NA
## 452854    19.0      NA      NA    14.0       7      NA      NA
## 452887   106.0      NA     7.0      NA      NA      NA      NA
## 452991     5.0      NA      NA      NA      NA      NA      NA


The default setting sets missing entries to NA, but you should convert these to zero, since there are no reported establishments rather than unknown reported establishments.

Note that zbp02[is.na(zbp02)==T] <- 0 is a logic statement that replaces all variables in the dataset with zeros if they currently equal NA. If you only wanted to do this for job totals, you could write: zbp02$jobs.tot[is.na(zbp02$jobs.tot)==T] <- 0

zbp02[is.na(zbp02)==T] <- 0


The following goes through the same steps for the 2012 data. Try to go through it slowly and understand each step of the process. Use ?command to get a description of functions if needed. Use internet searches as well.

zbp12<- subset(zbp12, as.numeric(zbp12$zip) >= 14000 & as.numeric(zbp12$zip) <= 19999)

#calculate job totals
zbp12 <- within(zbp12, jobs <- n1_4 *2.5 + n5_9 *7 + n10_19 * 14.5 +
n20_49 * 34.5 +  n50_99 *74.5 +
n100_249 * 174.5 + n250_499 *374.5 +
n500_999 * 749.5 +  n1000 *1000)

zbp12$ZIP <- zbp12$zip ##to make consistent with the 02 data
zbp12$NAICS <- zbp12$naics
##long to wide

zbp12 <- zbp12[keepers] #Keepers should still exist
zbp12$NAICS <- substring(zbp12$NAICS,1,2)
zbp12$NAICS[zbp12$NAICS=="--"] <- "tot"

#long to wide
zbp12 <- reshape(zbp12,
timevar = "NAICS",
idvar = c("ZIP"),
direction = "wide")

#switch NA to zero
zbp12[is.na(zbp12)==T] <- 0


Finally, add the total number of jobs from 2010 to the 2000 dataset to use for making predictions in a later lab. Create a new dataset so that zbp12 does not change.

zipmerge <- zbp12[c("ZIP", "jobs.tot")]
zipmerge$jobs_plus10 <- zipmerge$jobs.tot
zipmerge <- zipmerge[c("ZIP", "jobs_plus10")]

zbp02 <- merge(zipmerge, zbp02, by ="ZIP", all.x = T, all.y =T)
rm(zipmerge, keepers)


Look at the file. This time keep the NAs since they do in fact represent missing data. For example there was no Zip code 14003 in 2012.

head(zbp02)

##     ZIP jobs_plus10 jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48
## 1 14001      3015.0   3180.5   224.0  1383.0   820.5   189.0    20.0
## 2 14003          NA     12.0     0.0     0.0     0.0     7.0     0.0
## 3 14004      2368.0   2270.5   292.0   467.0    39.0   527.5    60.5
## 4 14005       182.5    167.0     7.5    34.5     9.5    31.0    37.0
## 5 14006      1370.0   1898.5    92.0   489.5    71.5   295.0    96.5
## 6 14008        73.0     51.5    10.0     0.0     0.0    12.5     0.0
##   jobs.51 jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 1       5   103.5      27    72.5    56.5     5.0    41.5     7.5   113.0
## 2       0     0.0       0     0.0     0.0     0.0     0.0     0.0     2.5
## 3      14    82.0      12   107.0    36.5    39.5   155.0     2.5   257.0
## 4       0     2.5       0     0.0     2.5     0.0     2.5     0.0     0.0
## 5      19    85.0       7    31.5    22.0     0.0    97.0    39.5   440.0
## 6       0     0.0       0     0.0     2.5     0.0    14.5     0.0     7.0
##   jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 1    56.0      49     7.5     0.0       0       0       0
## 2     2.5       0     0.0     0.0       0       0       0
## 3   169.5       0     0.0     9.5       0       0       0
## 4    19.0       0     0.0    14.0       7       0       0
## 5   106.0       0     7.0     0.0       0       0       0
## 6     5.0       0     0.0     0.0       0       0       0


If desired, save this as an R data file. In the future, you can double click on this file to open the PA 2002 and 2012 job estimates by 2-digit NAICS for PA.

save(zbp02, zbp12,file = "zip_jobs.Rda")


As you process your own data in the future, you will lean a number of different commands and packages to help you along the way. There are always dozens of different ways to do the same thing. If you are having trouble to get the program to do what you want it to do, it often makes sense to think about the problem differently. For example, maybe you want a nice summary table of jobs by industry and by state, but you are having trouble getting a single command to do it. Try instead to do each operation separately, and then combine them.

EXERCISE

1. Look up the 2-digit NAICS code for Manufacturing. What Zip Codes had the most gains or losses (both total and proportional) by State. Now, make a summary table by State that includes total estimated manufacturing jobs in 2002 and 2012 and the total and proportional change.
2. Make the same table as above but only include Zip Codes that have 100 or more jobs. Describe differences between the tables.
3. Make a table of the number of establishments that have 1000+ employees and the total number of establishments by 2-digit NAICS in 2012. (Hint. Either create two new datasets–one for 1000+ establishments by 2-digit NAICS, one for all establishments by 2-digit NAICS–and then merge them. Or use the sum() command on subsets of the data.) What industries have the highest proportion of 1000+ establishments.
4. Make the same table as in question 3, but list the estimated number of jobs instead of establishments. Describe similarities and differences between the two tables.

Posted in Planning Methods | Comments Off on Planning Methods: working with Zip Business Patterns data

This page describes how to upload and look at data in the R package. First, we will downlaod a CSV of UN population estimates on central African cities to import into R: Central Africa West.

The first thing to do is to let R know where to find the CSV. You can do this by navigating through the bottom right window in R Studio (Files/More/Set As Working Directory) or setting your working directory using the setwd command.

Note how the data are read directly printed on the command console. Next we will create a new object called pop so that we can call on the data more easily and interact with it.

Try: pop <- read.csv(“Central Africa West.csv”)

NB you can use = instead of <- but it is better coding practice to save the = sign for other uses.

You can also load data in Rstudio using a point-and-click interface. To set your working directory, click on Session at the top of the window and then click Set Working Directory/Choose Directory. To import a data file, click on File/Import Dataset.

To look at the data, try cutting and pasting the following commands into the command console and hitting enter.

str(pop)

## 'data.frame':    17 obs. of  12 variables:
##  $Year : int 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 ... ##$ Period                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $Douala : int 95 117 153 205 298 433 571 740 940 1184 ... ##$ Yaounde                : int  32 49 75 112 183 292 415 578 777 1025 ...
##  $Libreville : int 15 17 29 49 77 155 234 293 366 439 ... ##$ Brazzaville            : int  83 92 124 172 238 329 446 596 704 830 ...
##  $Pointe_Noire : int 16 35 64 89 116 154 217 300 363 439 ... ##$ Sub_Saharan_300kplus   : int  3923 4909 7083 10779 16335 24143 34813 47767 62327 75996 ...
##  $Central_Africa_300kplus: int 3660 4521 5651 7047 8921 11495 14465 18043 22566 28525 ... ##$ Cameroon_urban         : int  417 557 747 1011 1375 2112 2851 3761 4787 5930 ...
##  $Congo_urban : int 201 253 320 408 522 672 860 1086 1295 1535 ... ##$ Gabon_urban            : int  54 68 87 126 189 279 397 515 655 814 ...


This tells you the structure of the data. In this case, all of the variables are integers, but it is also common to see characters, factors, and other types of data.

names(pop)

##  [1] "Year"                    "Period"
##  [3] "Douala"                  "Yaounde"
##  [5] "Libreville"              "Brazzaville"
##  [7] "Pointe_Noire"            "Sub_Saharan_300kplus"
##  [9] "Central_Africa_300kplus" "Cameroon_urban"
## [11] "Congo_urban"             "Gabon_urban"


This command gives the names of all the variables in the data. The country, city, and regions contain population estimates in thousands.

summary(pop)

##       Year          Period       Douala          Yaounde
##  Min.   :1950   Min.   : 1   Min.   :  95.0   Min.   :  32.0
##  1st Qu.:1970   1st Qu.: 5   1st Qu.: 205.0   1st Qu.: 112.0
##  Median :1990   Median : 9   Median : 571.0   Median : 415.0
##  Mean   :1990   Mean   : 9   Mean   : 804.8   Mean   : 693.8
##  3rd Qu.:2010   3rd Qu.:13   3rd Qu.:1184.0   3rd Qu.:1025.0
##  Max.   :2030   Max.   :17   Max.   :2361.0   Max.   :2349.0
##                              NA's   :4        NA's   :4
##    Libreville     Brazzaville      Pointe_Noire   Sub_Saharan_300kplus
##  Min.   : 15.0   Min.   :  83.0   Min.   : 16.0   Min.   :  3923
##  1st Qu.: 49.0   1st Qu.: 172.0   1st Qu.: 89.0   1st Qu.: 16335
##  Median :234.0   Median : 446.0   Median :217.0   Median : 62327
##  Mean   :258.5   Mean   : 575.3   Mean   :293.1   Mean   : 91980
##  3rd Qu.:439.0   3rd Qu.: 830.0   3rd Qu.:439.0   3rd Qu.:137283
##  Max.   :631.0   Max.   :1574.0   Max.   :815.0   Max.   :300153
##  NA's   :4       NA's   :4        NA's   :4
##  Central_Africa_300kplus Cameroon_urban   Congo_urban    Gabon_urban
##  Min.   :  3660          Min.   :  417   Min.   : 201   Min.   :  54.0
##  1st Qu.:  8921          1st Qu.: 1375   1st Qu.: 522   1st Qu.: 189.0
##  Median : 22566          Median : 4787   Median :1295   Median : 655.0
##  Mean   : 34795          Mean   : 6835   Mean   :1723   Mean   : 819.6
##  3rd Qu.: 51883          3rd Qu.:10625   3rd Qu.:2600   3rd Qu.:1334.0
##  Max.   :107747          Max.   :20492   Max.   :4804   Max.   :2122.0
##


Another good way to get a sense for the data is to look at the first or last entries, using the head or tail commands.

head(pop)

##   Year Period Douala Yaounde Libreville Brazzaville Pointe_Noire
## 1 1950      1     95      32         15          83           16
## 2 1955      2    117      49         17          92           35
## 3 1960      3    153      75         29         124           64
## 4 1965      4    205     112         49         172           89
## 5 1970      5    298     183         77         238          116
## 6 1975      6    433     292        155         329          154
##   Sub_Saharan_300kplus Central_Africa_300kplus Cameroon_urban Congo_urban
## 1                 3923                    3660            417         201
## 2                 4909                    4521            557         253
## 3                 7083                    5651            747         320
## 4                10779                    7047           1011         408
## 5                16335                    8921           1375         522
## 6                24143                   11495           2112         672
##   Gabon_urban
## 1          54
## 2          68
## 3          87
## 4         126
## 5         189
## 6         279


For help with any of these commands, use the help function by typing ? before the command name. For example, try typing ?head into the command console

EXERCISE

Walk through this brief introduction to R and R Studio.

## Exploring US bus ridership with the National Transit Database

I frequently use the National Transit Database for analysis or quick summaries of key information about public transportation in the US. I also have my students use it to perform a back-of-the-envelope cost benefit analysis for a proposed light rail system in one of my transportation planning courses. In both instances, I find the downloadable excel sheets to be a bit clunky and slow to manipulate. Most recently I wanted a quick comparison of bus ridership per route kilometer for a system in Indonesia. As usual, I downloaded the excel sheets and spent about an hour playing with the data until I got a single reasonable answer. As usual, I’d have to do another bit of work if I wanted to change anything. Instead, I decided to clean up a subset of the data and rearrange it into a long-format panel. For a description of the data and some of the choices made (like when to make an entry a 0 or an NA), see here.

I’m also hoping that the clean data set will encourage my students to choose to work with the data in R. I use R in my Planning Methods and Transportation Planning courses because:

1. Data sets are ever larger and more readily available. Many are useful to planners and for planning. The ability to manage and work with large amounts of data is an important skill for planners, particularly transportation planners.
2. It’s a great program for managing, visualizing, and analyzing data. Many of its primary competitors are only really good at the third. Poor data management can be embarrassing.
3. It has a strong user community that builds add-on packages to accomplish a range of tasks, provides tutorials, and answers questions online. It’s also frequently used in free, online courses and is a program that grows with users as they learn more. Here are some links to great free courses from professors at Johns Hopkins and Stanford.
4. It’s free, so even the most cash-strapped municipality has access.

Unfortunately, the learning curve can be steep for new users. Processing data, in particular, can be slow and is rarely immediately rewarding. New users are unlikely to see much benefit to downloading, cleaning, and formatting data just to use it in a class assignment. I wouldn’t either.

Now back to my original question: how does the number of passengers per kilometer on my Indonesian city stack up to ridership in US cities?

After loading the file, this code does the trick. I only include directly operated routes (Service == “DO”) because the privately operated routes (Service == “PT”) are less accurately reported. Unlinked passenger trips (UPT) aren’t the same thing as total bus passengers–riders who transfer once count as two unlinked passenger trips–but they’re collected in the same way that corridor bus counts in Indonesia would likely have occurred. I divide directional route miles (DRM) by two because DRM counts 1 mile of bidirectional service as 2 miles and is closer to what the Indonesian counts measure.

summary(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ))  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.2915 14.9500 32.9000 67.7300 64.3100 1582.0000 162 And this provides even more information about the distribution of ridership (Unlinked Passenger Trips anyway) per kilometer of bus route. plot(density(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ), na.rm=T))

And this tells me how many systems perform better than my Indonesian system, which carries only 16 passengers per kilometer.

table(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )<16) ## ## FALSE TRUE ## 317 158 Before going any further, I recommend trying these three commands to get a better feel for the data: head(NTD.ts) str(NTD.ts) summary(NTD.ts) In the interest of space, I’m excluding the outputs. One of the handiest features of R is the ease with which it handles multiple data sets, so I’m going to add a couple variables and then create a new new data set that only looks at buses in 2013 that are directly operated by public agencies. I switched from Stata to R halfway into my dissertation (an otherwise foolish decision) because this made it so much easier to interact with and combine several large data sets. NTD.ts$passenger.km <- (NTD.ts$UPT/365) / ((1.61*NTD.ts$DRM)/2)
NTD.ts$fare.recovery <- NTD.ts$FARES / NTD.ts$OPEXP_TOTAL new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0 & Service == "DO")

Now let’s plot passengers per kilometer against fare recovery.

plot(new.dat$fare.recovery, new.dat$passenger.km)

I’m a bit suspicious of the systems drawing more fare revenue than they spend on operations, so I want to look at them more closely: two University systems, an airport shuttle, and one I’m not sure about.

subset(new.dat, new.dat$fare.recovery>1) ## Year TRS.ID Mode Service Last.Report.Year ## 58601 2013 2166 MB DO 2013 ## 59799 2013 2132 MB DO 2013 ## 59988 2013 4180 MB DO 2013 ## 60326 2013 8107 MB DO 2013 ## System.Name ## 58601 Orange-Newark-Elizabeth, Inc.(Coach USA) ## 59799 New Jersey Transit Corporation-45(NJTC-45) ## 59988 University of Georgia Transit System(UGA) ## 60326 The University of Montana - ASUM Transportation(ASUM OT) ## Small.Systems.Waiver City State Census.Year ## 58601 N Elizabeth NJ 2010 ## 59799 N Newark NJ 2010 ## 59988 N Athens GA 2010 ## 60326 N Missoula MT 2010 ## UZA.Name UZA UZA.Area.SQ.Miles UZA.Population ## 58601 New York-Newark, NY-NJ-CT 1 3450 18351295 ## 59799 New York-Newark, NY-NJ-CT 1 3450 18351295 ## 59988 Athens-Clarke County, GA 249 98 128754 ## 60326 Missoula, MT 348 45 82157 ## OPEXP_TOTAL OPEXP_VO OPEXP_VM OPEXP_NVM OPEXP_GA FARES VOMS ## 58601 15077606 9014072 4403391 708772 951371 16883604 52 ## 59799 8299771 5067517 2285398 217667 729189 10034922 41 ## 59988 5355484 4052395 791332 52669 459088 6281720 47 ## 60326 598190 327957 164685 13613 91935 601851 6 ## VRM VRH DRM UPT PMT CPI passenger.km fare.recovery ## 58601 1750683 191541 69.8 10294583 32942666 1 501.9548 1.119780 ## 59799 1079382 135429 68.9 5504389 8461884 1 271.8950 1.209060 ## 59988 853644 114959 96.0 11058944 4202399 1 392.0610 1.172951 ## 60326 153163 11578 6.3 421694 737965 1 227.8076 1.006120 In case I want to only look at a few items. subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$fare.recovery>1)
##       TRS.ID                                              System.Name
## 58601   2166                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799   2132               New Jersey Transit Corporation-45(NJTC-45)
## 59988   4180                University of Georgia Transit System(UGA)
## 60326   8107 The University of Montana - ASUM Transportation(ASUM OT)
##       OPEXP_TOTAL  DRM      UPT      PMT passenger.km fare.recovery
## 58601    15077606 69.8 10294583 32942666     501.9548      1.119780
## 59799     8299771 68.9  5504389  8461884     271.8950      1.209060
## 59988     5355484 96.0 11058944  4202399     392.0610      1.172951
## 60326      598190  6.3   421694   737965     227.8076      1.006120

Which systems are busiest?

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$passenger.km > 750) ## TRS.ID ## 58641 2008 ## 59050 5066 ## 59876 5158 ## System.Name ## 58641 MTA New York City Transit(NYCT) ## 59050 Chicago Transit Authority(CTA) ## 59876 University of Michigan Parking and Transportation Services(UMTS) ## OPEXP_TOTAL DRM UPT PMT passenger.km fare.recovery ## 58641 2487134393 1659.0 770962014 1614997081 1581.6043 0.3417458 ## 59050 764280757 1311.2 300116357 728561319 778.9902 0.3909879 ## 59876 6995443 20.7 7263234 20293476 1194.1832 0.2824377 I might also want to see how DRM and UPT look against eachother, since these two numbers for them basis of the calculation. It looks an awful lot like fare recovery against productivity. plot(new.dat$DRM, new.dat$UPT) Lastly I want to look at how passengers for kilometer has changed over time. First I’ll edit new.dat to include the other years. Then I’ll do a box plot by year. new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F)
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")

Clearly thereâ€™s some bad data. This merits further investigation, but for now Iâ€™ll just remove the worst offenders.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 ) boxplot(passenger.km ~ Year,data=new.dat, main="Bus ridership desnity over time", na.rm=T, xlab="Year", ylab="UPT per kilometer of busway") There’s so much variation that it’s hard to read so I’ll make two plots, one for high productivity systems and one for low. new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 & passenger.km > 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 200 ) boxplot(passenger.km ~ Year,data=new.dat, na.rm=T, xlab="Year", ylab="UPT per kilometer of busway") Posted in Transportation Planning | | Comments Off on Exploring US bus ridership with the National Transit Database ## National Transit Database panel by system, mode, and service This page contains links to a long-formatted panel of the National Transit Database’s TS2.1 – Service Data and Operating Expenses Time-Series by Mode. I maintain the NTD data conventions, so that the column names match the NTD’s data glossary. I also added a column labeled CPI that is generated from the Bureau of Labor Statistics’ Inflation Calculator and provides the 2013 value of one dollar in the reported year. I have not reformatted the other time series NTD data, which I use less frequently and which are not available at the same level of disaggregation. For example, the capital expenditures data include system, mode, and year, but not service type. Some data products have also been discontinued, such as the detailed sources of local revenues data. The reformatted panel is available in the following formats: R (password: NTD1), Stata (password: NTD2), and as a CSV (password: NTD3). I haven’t looked closely at the Stata file but exported it using the foreign package. The code used to format the data is available here (password: NTD4), if you want to make any changes. Below, I describe three important choices about when to treat missing values as an NA or a 0. The NTD data do not take a particularly clear approach to the problem. For example, entries for missing cost data vary between a null entry and$0 with no apparent pattern. For additional information about how the data were constructed, see the code above.

1. Keep zeros. In 1992, Denver did not have a light rail. Instead of replacing entries of zero with NAs, I keep the data. This will make it easier to analyze influence of new light rail in the future. It is, however, worth noting that this is somewhat inconsistent and introduces some problems. It is inconsistent because Honolulu, which as never had a light rail system, is excluded from the panel and effectively the data are treated as NAs not zeros. It is potentially problematic, because if I will miscalculate average light rail ridership or operating expenses in 1992, if I’m not careful:
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992), summary(OPEXP_TOTAL) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##        0        0        0  6334000  5050000 62260000
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992 & NTD.ts\$OPEXP_TOTAL > 0), summary(OPEXP_TOTAL) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##   261200  4800000 11440000 17730000 21360000 62260000
1. Replace known missing data with an NA. For example, the NTD does not report fare revenues earlier than 2002, so I replaced all fare entries with NAs.
2. Systematically coerce likely incorrect zeros into NAs. The NTD are frequently messy, particularly for small systems and privately operated systems. In one year, an agency might report unlinked passenger trips, but fail to report any operating costs or passenger miles. Since ridership is particularly important, I used PMT and UPT to identify unreported data treated as zeros. There were only 12 instances where an agency reported positive PMT but no UPT. By contrast there were numerous instances where an agency reported positive UPT, but no PMT. I first coerced these 12 cases of UPT to be NAs. I then replaced zeros in remaining columns (including PMT) with NAs whenever the agency reported positive UPT. Note that this does not remove any data, but makes it a lot less likely to generate bad data (if for example dividing total fare revenues or operating expenses by unlinked passenger trips or passenger miles).
Posted in Transportation Planning | | Comments Off on National Transit Database panel by system, mode, and service