Last fall, I took 8 graduate students to Queretaro, Mexico, for a bicycle planning studio. The final final planning document is available here: CYCLE_QRO_Small
A few group images from the trip:
Last fall, I took 8 graduate students to Queretaro, Mexico, for a bicycle planning studio. The final final planning document is available here: CYCLE_QRO_Small
A few group images from the trip:
Driverless cars are coming. Nearly all major car manufactures and several large technology firms are racing to develop vehicles that can operate with limited-to-no human interaction. Despite the probability of technological, legal, and political pitfalls along the way, fully autonomous vehicles (buses and cars) will likely be commercially available and driving themselves on city streets and highways in the next 20 years—well within a long-range planning horizon.
The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.
Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.
You can also find a description of the underlying tax assessor data here.
It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.
names(dat)
## [1] "fips" "tract"
## [3] "sfr.built.sf" "mfr.built.sf"
## [5] "retail.built.sf" "comm.built.sf"
## [7] "indust.built.sf" "vacant.built.sf"
## [9] "exempt.built.sf" "sfr.lot.sf"
## [11] "mfr.lot.sf" "retail.lot.sf"
## [13] "comm.lot.sf" "indust.lot.sf"
## [15] "vacant.lot.sf" "exempt.lot.sf"
## [17] "sfr.mark.val" "mfr.mark.val"
## [19] "retail.mark.val" "comm.mark.val"
## [21] "indust.mark.val" "vacant.mark.val"
## [23] "exempt.mark.val" "total.built.sf"
## [25] "total.lot.sf" "total.mark.val"
## [27] "ft.2.city.hall" "ft.2.train.st"
## [29] "ft.2.bus.st" "pop.2010"
## [31] "pop_density.2010" "square_miles.2010"
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"
## [35] "pop_asian.2010" "pop_hispanic.2010"
## [37] "households.2010" "households_family.2010"
## [39] "households_nonfamily.2010" "avg_hh_size.2010"
## [41] "pop_25_plus.2010" "edu_less_highschool.2010"
## [43] "edu_highschool.2010" "edu_collegeplus.2010"
## [45] "pop_civ_employed.2010" "pop_civ_unemployed.2010"
## [47] "median_hh_income.2010" "per_capita_income.2010"
## [49] "housing_vacant.2010" "housing_occupied.2010"
## [51] "housing_median_age.2010" "percent_poverty.2010"
## [53] "pop_change.2000" "pop_plus10.2000"
## [55] "pop.2000" "pop_density.2000"
## [57] "square_miles.2000" "pop_white_nonhispanic.2000"
## [59] "pop_black.2000" "pop_asian.2000"
## [61] "pop_hispanic.2000" "households.2000"
## [63] "households_family.2000" "households_nonfamily.2000"
## [65] "avg_hh_size.2000" "pop_25_plus.2000"
## [67] "edu_less_highschool.2000" "edu_highschool.2000"
## [69] "edu_collegeplus.2000" "pop_civ_employed.2000"
## [71] "pop_civ_unemployed.2000" "median_hh_income.2000"
## [73] "per_capita_income.2000" "housing_vacant.2000"
## [75] "housing_occupied.2000" "housing_median_age.2000"
## [77] "percent_poverty.2000"
Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.
summary(dat[c(3:26)])
## sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 848570 1st Qu.: 144324 1st Qu.: 22700 1st Qu.: 52053
## Median :1429116 Median : 270901 Median : 71180 Median : 120155
## Mean :1432991 Mean : 395834 Mean : 93474 Mean : 357440
## 3rd Qu.:1970797 3rd Qu.: 461973 3rd Qu.:133074 3rd Qu.: 239220
## Max. :4357472 Max. :3099629 Max. :780672 Max. :19805858
## indust.built.sf vacant.built.sf exempt.built.sf
## Min. : 0 Min. : 0.000 Min. : 3771
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.: 178846
## Median : 32572 Median : 0.000 Median : 319638
## Mean : 272199 Mean : 8.906 Mean : 712816
## 3rd Qu.: 198236 3rd Qu.: 0.000 3rd Qu.: 683018
## Max. :6372334 Max. :1520.000 Max. :28206707
## sfr.lot.sf mfr.lot.sf retail.lot.sf comm.lot.sf
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 955739 1st Qu.: 115597 1st Qu.: 23083 1st Qu.: 79158
## Median : 1845223 Median : 251603 Median : 54802 Median : 168255
## Mean : 2476646 Mean : 400486 Mean : 66522 Mean : 497081
## 3rd Qu.: 3139156 3rd Qu.: 516614 3rd Qu.: 89680 3rd Qu.: 455709
## Max. :33126247 Max. :3026616 Max. :633754 Max. :17741673
## indust.lot.sf vacant.lot.sf exempt.lot.sf
## Min. : 0 Min. : 0 Min. : 29248
## 1st Qu.: 0 1st Qu.: 68008 1st Qu.: 348156
## Median : 31122 Median : 167437 Median : 746535
## Mean : 548498 Mean : 448650 Mean : 1803958
## 3rd Qu.: 212020 3rd Qu.: 415549 3rd Qu.: 1628277
## Max. :26713019 Max. :13823323 Max. :40611433
## sfr.mark.val mfr.mark.val retail.mark.val
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 17434950 1st Qu.: 2777300 1st Qu.: 522825
## Median : 40370150 Median : 6816500 Median : 1359150
## Mean : 48218979 Mean : 13362086 Mean : 2320078
## 3rd Qu.: 70492725 3rd Qu.: 14893450 3rd Qu.: 2729175
## Max. :249615900 Max. :248868900 Max. :32522100
## comm.mark.val indust.mark.val vacant.mark.val
## Min. :0.000e+00 Min. : 0 Min. : 0
## 1st Qu.:1.369e+06 1st Qu.: 0 1st Qu.: 214075
## Median :3.658e+06 Median : 393350 Median : 481950
## Mean :1.887e+07 Mean : 4313410 Mean : 1652835
## 3rd Qu.:9.132e+06 3rd Qu.: 2400625 3rd Qu.: 1139750
## Max. :1.671e+09 Max. :154563400 Max. :36376000
## exempt.mark.val total.built.sf total.lot.sf
## Min. :5.657e+05 Min. : 17886 Min. : 605741
## 1st Qu.:6.623e+06 1st Qu.: 2190500 1st Qu.: 2539507
## Median :1.379e+07 Median : 2830458 Median : 4162716
## Mean :4.344e+07 Mean : 3264763 Mean : 6241842
## 3rd Qu.:3.132e+07 3rd Qu.: 3534784 3rd Qu.: 6811800
## Max. :1.502e+09 Max. :28206707 Max. :91325669
## total.mark.val
## Min. :4.546e+06
## 1st Qu.:5.499e+07
## Median :9.375e+07
## Mean :1.322e+08
## 3rd Qu.:1.392e+08
## Max. :2.286e+09
There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.
summary(dat[c(1,27:29)])
## fips ft.2.city.hall ft.2.train.st ft.2.bus.st
## Min. :4.21e+10 Min. : 2881 Min. : 258.3 Min. : 6.36
## 1st Qu.:4.21e+10 1st Qu.:16596 1st Qu.: 4087.1 1st Qu.: 265.63
## Median :4.21e+10 Median :28692 Median : 6777.2 Median : 628.73
## Mean :4.21e+10 Mean :31776 Mean : 7443.2 Mean : 781.29
## 3rd Qu.:4.21e+10 3rd Qu.:43484 3rd Qu.: 9832.9 3rd Qu.:1083.03
## Max. :4.21e+10 Max. :87673 Max. :26020.5 Max. :7140.68
The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.
Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.
sum (dat$square_miles.2010)
## [1] 134.1014
sum(dat$total.lot.sf)/ 27.8784e6 #divided by the number of square feet in a square miles
## [1] 85.97578
Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.
dat$far <- dat$total.built.sf / dat$total.lot.sf
summary(dat$far)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005557 0.419600 0.696300 0.875800 1.052000 13.330000
Try exluding vacant land from the FAR calculation.
summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005557 0.423300 0.748200 0.943900 1.165000 14.650000
And looking at how vacancy varies across census tracts
hist(dat$vacant.lot.sf/ dat$total.lot.sf )
Now, create some net FAR calculations by land use type.
summary(dat$comm.built.sf/ dat$comm.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00376 0.33490 0.61070 1.09600 1.16400 24.78000 9
summary(dat$indust.built.sf/ dat$indust.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00711 0.59420 1.04900 1.44700 1.66800 16.04000 104
summary(dat$retail.built.sf/ dat$retail.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0355 0.9365 1.3690 1.3830 1.8030 4.3190 29
summary(dat$exempt.built.sf/ dat$exempt.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000
It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation.
dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010
dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity
summary( dat$res.sf.per.hh )
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 104.1 1092.0 1220.0 1237.0 1367.0 4538.0 8
This looks about right with most households having between 1000 and 1400 square feet.
hist( dat$res.sf.per.hh )
And per capita
dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010
dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA
hist( dat$res.sf.per.person )
Look at the relationship between the population and square footage of housing.
plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )
Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood.
plot(dat$res.sf.per.person , dat$ft.2.city.hall)
There is a bit more of a pattern between location and the proportion of housing that is multifamily.
dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf )
plot(dat$per.mf , dat$ft.2.city.hall)
Which land uses tend to go together? Use a correlation table to examine total square footage.
cor(dat[c(3:7,9)])
## sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## sfr.built.sf 1.0000000 -0.227687793 0.208504423 -0.21049419
## mfr.built.sf -0.2276878 1.000000000 0.009489829 0.45513983
## retail.built.sf 0.2085044 0.009489829 1.000000000 0.04310543
## comm.built.sf -0.2104942 0.455139826 0.043105428 1.00000000
## indust.built.sf -0.1466598 -0.135759385 -0.038551580 0.04383247
## exempt.built.sf -0.2613878 0.103988183 -0.049405209 0.16346712
## indust.built.sf exempt.built.sf
## sfr.built.sf -0.14665984 -0.26138775
## mfr.built.sf -0.13575939 0.10398818
## retail.built.sf -0.03855158 -0.04940521
## comm.built.sf 0.04383247 0.16346712
## indust.built.sf 1.00000000 0.06878225
## exempt.built.sf 0.06878225 1.00000000
And land area
cor(dat[c(10:16)])
## sfr.lot.sf mfr.lot.sf retail.lot.sf comm.lot.sf
## sfr.lot.sf 1.000000000 0.174062104 -0.01660241 0.03719301
## mfr.lot.sf 0.174062104 1.000000000 -0.06027767 0.12670109
## retail.lot.sf -0.016602407 -0.060277674 1.00000000 -0.01112620
## comm.lot.sf 0.037193013 0.126701092 -0.01112620 1.00000000
## indust.lot.sf -0.071661971 -0.009073780 -0.10092578 0.39273649
## vacant.lot.sf -0.013189068 0.009647197 -0.05241089 0.71002277
## exempt.lot.sf 0.000523062 0.011982805 -0.16823137 0.37194754
## indust.lot.sf vacant.lot.sf exempt.lot.sf
## sfr.lot.sf -0.07166197 -0.013189068 0.000523062
## mfr.lot.sf -0.00907378 0.009647197 0.011982805
## retail.lot.sf -0.10092578 -0.052410890 -0.168231370
## comm.lot.sf 0.39273649 0.710022767 0.371947536
## indust.lot.sf 1.00000000 0.643286320 0.489299242
## vacant.lot.sf 0.64328632 1.000000000 0.531508105
## exempt.lot.sf 0.48929924 0.531508105 1.000000000
Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses.
cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))
## mfr.built.sf
## sfr.built.sf -0.2779901
cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))
## retail.built.sf
## sfr.built.sf 0.1160276
cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))
## comm.built.sf
## sfr.built.sf -0.4443231
cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))
## indust.built.sf
## sfr.built.sf -0.3305134
cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))
## vacant.built.sf
## sfr.built.sf 0.008721208
cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))
## exempt.built.sf
## sfr.built.sf -0.6608287
Try using regressions to better understanding partial correlations between land uses.
summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf
+ exempt.built.sf + vacant.lot.sf , dat))
##
## Call:
## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf +
## comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1548586 -500194 -17272 498795 2940319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.550e+06 6.833e+04 22.686 < 2e-16 ***
## mfr.built.sf -3.486e-01 1.005e-01 -3.469 0.000582 ***
## retail.built.sf 1.605e+00 3.848e-01 4.172 3.76e-05 ***
## comm.built.sf -5.509e-02 3.127e-02 -1.762 0.078859 .
## indust.built.sf -1.539e-01 6.400e-02 -2.405 0.016662 *
## exempt.built.sf -8.591e-02 1.978e-02 -4.344 1.80e-05 ***
## vacant.lot.sf -1.419e-02 3.871e-02 -0.367 0.714091
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 728300 on 377 degrees of freedom
## Multiple R-squared: 0.1791, Adjusted R-squared: 0.1661
## F-statistic: 13.71 on 6 and 377 DF, p-value: 4.216e-14
EXERCISE
Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation.
Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors?
Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses?
For this lab, upload the Zip Code data from the previous lab. This should include the following files wih the following dimensions. You can also download the data here.
dim(zbp02)
## [1] 2207 24
dim(zbp12)
## [1] 2125 22
The purposes of this lab is to reinforce the regression and visualization skills learned in lab 2 and lab 3 using the Zip Business Patterns data.
Start by summarzing the 2002 data, which you will use to preidct job changes in 2012.
summary(zbp02)
## ZIP jobs_plus10 jobs.tot jobs.23
## Min. :15001 Min. : 2.5 Min. : 2.50 Min. : 0.0
## 1st Qu.:16010 1st Qu.: 71.0 1st Qu.: 72.25 1st Qu.: 2.5
## Median :17235 Median : 411.0 Median : 387.50 Median : 22.0
## Mean :17318 Mean : 2654.3 Mean : 2597.33 Mean : 138.6
## 3rd Qu.:18618 3rd Qu.: 2327.0 3rd Qu.: 2234.25 3rd Qu.: 126.5
## Max. :19980 Max. :79883.5 Max. :77848.50 Max. :4476.5
## NA's :82 NA's :40 NA's :40
## jobs.31 jobs.42 jobs.44 jobs.48
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 2.5 1st Qu.: 0.00
## Median : 34.5 Median : 7.5 Median : 31.0 Median : 7.00
## Mean : 357.2 Mean : 128.8 Mean : 377.0 Mean : 73.68
## 3rd Qu.: 341.5 3rd Qu.: 74.5 3rd Qu.: 273.2 3rd Qu.: 47.25
## Max. :10076.0 Max. :5025.5 Max. :10491.0 Max. :5435.00
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.51 jobs.52 jobs.53 jobs.54
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0.00 Median : 7.00 Median : 0.00 Median : 5.0
## Mean : 72.96 Mean : 155.58 Mean : 39.46 Mean : 158.4
## 3rd Qu.: 14.50 3rd Qu.: 50.25 3rd Qu.: 17.50 3rd Qu.: 57.0
## Max. :4203.50 Max. :15671.50 Max. :1739.50 Max. :22421.0
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.56 jobs.61 jobs.62 jobs.71
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 5.0 Median : 0.00 Median : 16.5 Median : 0.00
## Mean : 145.6 Mean : 66.51 Mean : 362.8 Mean : 41.55
## 3rd Qu.: 56.0 3rd Qu.: 8.50 3rd Qu.: 208.2 3rd Qu.: 17.00
## Max. :9158.5 Max. :3734.50 Max. :8389.5 Max. :2858.00
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.72 jobs.81 jobs.95 jobs.99
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.: 2.5 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 19.5 Median : 19.5 Median : 0.00 Median : 0.000
## Mean : 213.5 Mean : 139.4 Mean : 26.79 Mean : 1.355
## 3rd Qu.: 154.2 3rd Qu.: 114.8 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :5445.5 Max. :4624.5 Max. :1961.00 Max. :214.500
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.21 jobs.11 jobs.22 jobs.55
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0.000 Median : 0.00 Median : 0.00 Median : 0.0
## Mean : 8.461 Mean : 2.05 Mean : 18.45 Mean : 69.4
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 2.5
## Max. :1538.500 Max. :194.00 Max. :1785.50 Max. :6602.0
## NA's :40 NA's :40 NA's :40 NA's :40
Like with the census tract data, not all of the zip coned match across the two years, resulting in missing data.
Plot the relationship between the number of jobs in 2002 and 2012.
plot(zbp02$jobs.tot, zbp02$jobs_plus10)
And look at the distribution of jobs across Zip Codes. Most Zip Codes have very few jobs, with a long tail of job-rich Zip Codes.
plot(density(zbp02$jobs_plus10, na.rm = T))
hist(zbp02$jobs_plus10)
Look at a specific Zip Code (19104).
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP == 19104], zbp02$jobs.tot[zbp02$ZIP==19104], col="red")
And all Philadelphia tracts, plus a few others. These tracts do not look particularly different from others in PA.
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP > 19018 & zbp02$ZIP < 19256],
zbp02$jobs.tot[zbp02$ZIP> 19018 & zbp02$ZIP < 19256], col="red")
Now, predict the 2012 jobs as a function of the number of jobs in 2002. The regression line fits the data almost perfectly.
reg1 <- lm(jobs_plus10 ~ jobs.tot, zbp02)
summary(reg1)
##
## Call:
## lm(formula = jobs_plus10 ~ jobs.tot, data = zbp02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7887.2 -158.2 -82.6 26.7 10332.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.743743 26.051260 3.176 0.00151 **
## jobs.tot 0.967799 0.004028 240.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1081 on 2083 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.9652, Adjusted R-squared: 0.9652
## F-statistic: 5.774e+04 on 1 and 2083 DF, p-value: < 2.2e-16
The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)
The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)
It also looks like there might be some systematic differences in prediction quality by the number of jobs.
plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)
Try to see which jobs are more or less likely to predict job losses or job increases.
summary(
lm(jobs_plus10 ~jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02)
)
##
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 +
## jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 +
## jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 +
## jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705.6 -159.0 -51.4 39.6 9224.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265 25.06739 1.905 0.056923 .
## jobs.23 1.48264 0.11445 12.954 < 2e-16 ***
## jobs.31 0.86957 0.03842 22.633 < 2e-16 ***
## jobs.42 1.75718 0.11463 15.329 < 2e-16 ***
## jobs.44 0.90575 0.05998 15.100 < 2e-16 ***
## jobs.48 0.66433 0.10652 6.237 5.41e-10 ***
## jobs.51 0.68900 0.13370 5.153 2.80e-07 ***
## jobs.52 0.59305 0.06731 8.811 < 2e-16 ***
## jobs.53 2.82325 0.31606 8.933 < 2e-16 ***
## jobs.54 1.51710 0.07039 21.552 < 2e-16 ***
## jobs.56 0.17040 0.08420 2.024 0.043112 *
## jobs.61 1.47967 0.11901 12.433 < 2e-16 ***
## jobs.62 0.85000 0.04409 19.281 < 2e-16 ***
## jobs.71 0.75343 0.20036 3.760 0.000174 ***
## jobs.72 1.52875 0.12252 12.478 < 2e-16 ***
## jobs.81 0.44943 0.18831 2.387 0.017095 *
## jobs.95 0.74847 0.19179 3.903 9.82e-05 ***
## jobs.99 5.25699 4.88099 1.077 0.281591
## jobs.21 1.13877 0.36736 3.100 0.001962 **
## jobs.11 3.63400 2.12568 1.710 0.087495 .
## jobs.22 0.88365 0.22376 3.949 8.11e-05 ***
## jobs.55 0.11145 0.11360 0.981 0.326664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.9714, Adjusted R-squared: 0.9711
## F-statistic: 3339 on 21 and 2063 DF, p-value: < 2.2e-16
Most of the paramter estimates are statistically different from zero with a high degree of confidence, but that is really not a very useful finding. You really want to know whether job types are statisticallyt different from 1 (i.e., are certain job types more or less likely to be decreasing over time?). You can roughly approximate this by looking at the coefficent estimate and the standard error. If the coefficient estimate plus or minus two standard errors crosses the number 1, than the estimate is not statistically different from one with 95% confidence.
You can aslo set up the regression to compare the coefficient estimate against the total number of jobs in 2002.
summary(
lm(jobs_plus10 ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02, offset= 1.00*jobs.tot)
)
##
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 +
## jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 +
## jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 +
## jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02,
## offset = 1 * jobs.tot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705.6 -159.0 -51.4 39.6 9224.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265 25.06739 1.905 0.056923 .
## jobs.23 0.48264 0.11445 4.217 2.58e-05 ***
## jobs.31 -0.13043 0.03842 -3.395 0.000699 ***
## jobs.42 0.75718 0.11463 6.605 5.03e-11 ***
## jobs.44 -0.09425 0.05998 -1.571 0.116262
## jobs.48 -0.33567 0.10652 -3.151 0.001650 **
## jobs.51 -0.31100 0.13370 -2.326 0.020107 *
## jobs.52 -0.40695 0.06731 -6.046 1.76e-09 ***
## jobs.53 1.82325 0.31606 5.769 9.20e-09 ***
## jobs.54 0.51710 0.07039 7.346 2.93e-13 ***
## jobs.56 -0.82960 0.08420 -9.853 < 2e-16 ***
## jobs.61 0.47967 0.11901 4.031 5.77e-05 ***
## jobs.62 -0.15000 0.04409 -3.402 0.000681 ***
## jobs.71 -0.24657 0.20036 -1.231 0.218608
## jobs.72 0.52875 0.12252 4.316 1.67e-05 ***
## jobs.81 -0.55057 0.18831 -2.924 0.003497 **
## jobs.95 -0.25153 0.19179 -1.311 0.189836
## jobs.99 4.25699 4.88099 0.872 0.383223
## jobs.21 0.13877 0.36736 0.378 0.705657
## jobs.11 2.63400 2.12568 1.239 0.215437
## jobs.22 -0.11635 0.22376 -0.520 0.603127
## jobs.55 -0.88855 0.11360 -7.822 8.22e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.9714, Adjusted R-squared: 0.9711
## F-statistic: 3339 on 21 and 2063 DF, p-value: < 2.2e-16
Each job in Real Estate in a Zip Code in 2002 (sector 53) correlates with another 1.62 jobs in 2012. Better, yet use the types of jobs in 2002 to predict the net change in jobs from 2002 to 2012.
reg2 <- lm(jobs_plus10-jobs.tot ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02)
summary(reg2)
##
## Call:
## lm(formula = jobs_plus10 - jobs.tot ~ jobs.23 + jobs.31 + jobs.42 +
## jobs.44 + jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 +
## jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 +
## jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55,
## data = zbp02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705.6 -159.0 -51.4 39.6 9224.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265 25.06739 1.905 0.056923 .
## jobs.23 0.48264 0.11445 4.217 2.58e-05 ***
## jobs.31 -0.13043 0.03842 -3.395 0.000699 ***
## jobs.42 0.75718 0.11463 6.605 5.03e-11 ***
## jobs.44 -0.09425 0.05998 -1.571 0.116262
## jobs.48 -0.33567 0.10652 -3.151 0.001650 **
## jobs.51 -0.31100 0.13370 -2.326 0.020107 *
## jobs.52 -0.40695 0.06731 -6.046 1.76e-09 ***
## jobs.53 1.82325 0.31606 5.769 9.20e-09 ***
## jobs.54 0.51710 0.07039 7.346 2.93e-13 ***
## jobs.56 -0.82960 0.08420 -9.853 < 2e-16 ***
## jobs.61 0.47967 0.11901 4.031 5.77e-05 ***
## jobs.62 -0.15000 0.04409 -3.402 0.000681 ***
## jobs.71 -0.24657 0.20036 -1.231 0.218608
## jobs.72 0.52875 0.12252 4.316 1.67e-05 ***
## jobs.81 -0.55057 0.18831 -2.924 0.003497 **
## jobs.95 -0.25153 0.19179 -1.311 0.189836
## jobs.99 4.25699 4.88099 0.872 0.383223
## jobs.21 0.13877 0.36736 0.378 0.705657
## jobs.11 2.63400 2.12568 1.239 0.215437
## jobs.22 -0.11635 0.22376 -0.520 0.603127
## jobs.55 -0.88855 0.11360 -7.822 8.22e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.2035, Adjusted R-squared: 0.1954
## F-statistic: 25.1 on 21 and 2063 DF, p-value: < 2.2e-16
Note how all the parameter estimates are the same, but there is now a much lower and much more useful R-squared. Instead of showing that Zip Codes with a lot of jobs in 2002 tend to have a lot of jobs in 2012, the model now shows that the number and types of jobs in each Zip Code can be used to predict whether a Zip Code gained or lost jobs. Instead of explaining 97% of the variation in the relationship, the model now claims to explain 17%.
The distribution of this dependent variable also looks a lot more normal with most Zip Codes not having changed very much.
plot(density(zbp02$jobs_plus10- zbp02$jobs.tot, na.rm=T))
The residual plit also looks a bit more homoschedastic.
plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)
EXERCISE
Try to make the most parsimonious model that does a good job of predicting the change in jobs from 2002 to 2012. Use a A full model vs. reduced model Anova test to compare this model to the fully specific model (with all job types) and descibe the results.
Try predicting the percent change in jobs instead of the absolute change. Compare and contrast the two models. Which model do you prefer and why? Hint: use the identity function to generate the percent change variable inside of your regression: I(jobs_plus10/jobs.tot-1). Input industry codes as a percentage of total jobs: I(jobs.23/jobs.tot).
What types of jobs are the best indicators of job growth and job losses? Desribe the quantitative relationship as expressed by the two models in question 2. Does this contradict or concur with your expectations?
Look at the differences in jobs in sector 99 across the two time periods. What do you think is happening?
In this lab, you will use Philadelphia's 2000 Census Tract data to make predictions about the 2010 and 2020 population, and to understand what types of Census Tracts have been growing or shrinking. Download the data here (password is PennDesign). You can load it by double clicking on the file or using the load command.
Start by summarizing the data:
summary(dat2000)
## Geo_FIPS pop_change pop_plus10 pop
## Min. :4.21e+10 Min. :-1310.00 Min. : 0 Min. : 0
## 1st Qu.:4.21e+10 1st Qu.: -394.50 1st Qu.:2721 1st Qu.: 2329
## Median :4.21e+10 Median : -85.00 Median :3774 Median : 4008
## Mean :4.21e+10 Mean : -43.79 Mean :3919 Mean : 3983
## 3rd Qu.:4.21e+10 3rd Qu.: 309.50 3rd Qu.:4987 3rd Qu.: 5630
## Max. :4.21e+10 Max. : 2311.00 Max. :9022 Max. :10479
## NA's :215 NA's :106 NA's :109
## pop_density square_miles pop_white_nonhispanic pop_black
## Min. : 0 Min. :0.06491 Min. : 0 Min. : 0
## 1st Qu.: 8462 1st Qu.:0.18056 1st Qu.: 114 1st Qu.: 166
## Median :15692 Median :0.25657 Median : 821 Median : 903
## Mean :16120 Mean :0.35457 Mean :1695 Mean :1721
## 3rd Qu.:22543 3rd Qu.:0.37694 3rd Qu.:2901 3rd Qu.:2959
## Max. :54824 Max. :3.26893 Max. :8948 Max. :8468
## NA's :109 NA's :109 NA's :109 NA's :109
## pop_asian pop_hispanic households households_family
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 13.0 1st Qu.: 33.0 1st Qu.: 852 1st Qu.: 428.0
## Median : 46.0 Median : 89.0 Median :1502 Median : 899.0
## Mean : 177.6 Mean : 336.7 Mean :1549 Mean : 924.8
## 3rd Qu.: 168.0 3rd Qu.: 230.0 3rd Qu.:2187 3rd Qu.:1323.0
## Max. :2625.0 Max. :6802.0 Max. :6115 Max. :2738.0
## NA's :109 NA's :109 NA's :109 NA's :109
## households_nonfamily avg_hh_size pop_25_plus edu_less_highschool
## Min. : 0 Min. :0.000 Min. : 0 Min. : 0.0
## 1st Qu.: 303 1st Qu.:2.220 1st Qu.:1366 1st Qu.: 286.0
## Median : 529 Median :2.520 Median :2422 Median : 661.0
## Mean : 624 Mean :2.424 Mean :2536 Mean : 729.9
## 3rd Qu.: 796 3rd Qu.:2.760 3rd Qu.:3624 3rd Qu.:1023.0
## Max. :4806 Max. :3.600 Max. :7155 Max. :2733.0
## NA's :109 NA's :109 NA's :109 NA's :109
## edu_highschool edu_collegeplus pop_civ_employed pop_civ_unemployed
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 362.0 1st Qu.: 119.0 1st Qu.: 780 1st Qu.: 75.0
## Median : 789.0 Median : 286.0 Median :1433 Median : 158.0
## Mean : 845.3 Mean : 453.1 Mean :1535 Mean : 187.9
## 3rd Qu.:1260.0 3rd Qu.: 557.0 3rd Qu.:2225 3rd Qu.: 266.0
## Max. :2809.0 Max. :5735.0 Max. :5692 Max. :1678.0
## NA's :109 NA's :109 NA's :109 NA's :109
## median_hh_income per_capita_income housing_vacant housing_occupied
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 21500 1st Qu.: 10817 1st Qu.: 55.0 1st Qu.: 852
## Median : 29839 Median : 15615 Median :149.0 Median :1502
## Mean : 32002 Mean : 17604 Mean :188.7 Mean :1549
## 3rd Qu.: 39528 3rd Qu.: 20613 3rd Qu.:279.0 3rd Qu.:2187
## Max. :200001 Max. :109633 Max. :834.0 Max. :6115
## NA's :109 NA's :109 NA's :109 NA's :109
## housing_median_age percent_poverty
## Min. : 0 Min. :0.00000
## 1st Qu.:1939 1st Qu.:0.09552
## Median :1942 Median :0.18480
## Mean :1885 Mean :0.21014
## 3rd Qu.:1952 3rd Qu.:0.30020
## Max. :1983 Max. :0.84337
## NA's :109 NA's :120
Notice how there are a number of missing data entries. This is because Census Tract numbers and borders change over time; sometimes combining, sometimes splitting. We have data for 106 2000 Census Tracts that are missing in 2010, and 109 2010 Census Tracts that are missing in 2000. There are several ways to combine Census Tracts to make them consistent over time. This website provides tools to estimate Census data from 1970 to 2000 within the 2010 boundaries.
For the purposes of this exercise, we will exclude the non-matching tracts. Look at the relationship between tracts population in 2000 and in 2010.
plot(dat2000$pop, dat2000$pop_plus10)
The density and histogram functions are convenient ways to understand the distribution of the data (in this case the 2010 population in each Census Tract). The population is somewhat normally distributed and ranges from around zero to 10,000.
plot(density(dat2000$pop_plus10, na.rm = T))
hist(dat2000$pop_plus10)
Before proceeding, compare the Census Tracts that are consistent between 2000 and 2010 with those that are not. As you would expect, low population and high population Census Tracts were most likely to change boundaries.
hist(dat2000$pop[is.na(dat2000$pop_plus10)==T])
hist(dat2000$pop[is.na(dat2000$pop_plus10)==F])
How well can you predict the 2010 Census Tract population from the 2000 Census Tract population?
reg1 <- lm(pop_plus10~pop, dat2000)
summary(reg1)
##
## Call:
## lm(formula = pop_plus10 ~ pop, data = dat2000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1271.96 -357.73 -45.54 355.51 2334.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -92.07364 88.72101 -1.038 0.3
## pop 1.01166 0.01986 50.938 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 552 on 273 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.9048, Adjusted R-squared: 0.9045
## F-statistic: 2595 on 1 and 273 DF, p-value: < 2.2e-16
The R-squared statistic indicates that the 2000 population explains about 90% of the variation in the 2000 population. The following shows how well the regression line fits the data.
plot(dat2000$pop, dat2000$pop_plus10)
abline(reg1)
Furthermore, each person in 2000 correlates with 1.01 person in 2010, though with a standard error of 0.02, this is hardly a ringing endorsement of population growth. Plotting the population change from 2000 to 2010 shows that almost the same number of Census Tracts gained and lost population.
plot(density(dat2000$pop_change, na.rm = T))
Plotting the error term shows that the regression line fits the low population tracts a bit better than the high population tracts. (Note how closely this tracks with the population change number, since the prediction of the 2010 population relies entirely on the 2000 population.)
plot(density(resid(reg1)))
Plotting the predicted population against the error terms reveals some additional patterns to help understand the regression. Ideally this pattern would look much more random with points equally and randomly distributed around the zero line.
plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)
One way to try to deal with under predicting the more populous Census Tracts is to introduce a quadratic term.
reg2 <- lm(pop_plus10~pop + I(pop^2), dat2000)
summary(reg2)
##
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2), data = dat2000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1228.67 -338.37 -48.65 365.88 2334.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.354e+02 1.750e+02 2.488 0.013460 *
## pop 7.114e-01 8.863e-02 8.027 3.02e-14 ***
## I(pop^2) 3.587e-05 1.033e-05 3.473 0.000599 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 541.2 on 272 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.9082
## F-statistic: 1356 on 2 and 272 DF, p-value: < 2.2e-16
Does this fit the data better? A full model vs. reduced model Anova test suggests that it does. So does the small improvement to the adjusted R-Squared.
anova(reg1,reg2)
## Analysis of Variance Table
##
## Model 1: pop_plus10 ~ pop
## Model 2: pop_plus10 ~ pop + I(pop^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 273 83185653
## 2 272 79653918 1 3531735 12.06 0.0005991 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residual plot looks better…slightly…perhaps…
plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)
Try testing some more interesting hypotheses. For example, do you expect higher income neighborhoods to grow more or less quickly than lower income neighborhoods? You could probably construct a reasonable theory in either direction, but my guess is that wealthier Census Tracts in Philadelphia were the most likely to grow.
reg3 <- lm(pop_plus10 ~ pop + I(pop^2)+ I(dat2000$median_hh_inc/1000), dat2000)
summary(reg3)
##
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(dat2000$median_hh_inc/1000),
## data = dat2000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1167.48 -361.12 -48.03 332.20 2313.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.836e+02 1.959e+02 1.448 0.148838
## pop 7.188e-01 8.843e-02 8.128 1.56e-14 ***
## I(pop^2) 3.512e-05 1.030e-05 3.408 0.000753 ***
## I(dat2000$median_hh_inc/1000) 4.375e+00 2.570e+00 1.702 0.089840 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 539.3 on 271 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.9098, Adjusted R-squared: 0.9088
## F-statistic: 911.2 on 3 and 271 DF, p-value: < 2.2e-16
Every $1,000 more in median income correlates with 4 more people in 2010, not a tremendously strong relationship. Furthermore, the coefficient is not statistically difference from 0 with 95% confidence (though it is with 90% confidence).
It also may be correlated with other variables that are associated with increasing or decreasing population, such as race.
reg4 <- lm(pop_plus10~ pop + I(pop^2)+ I(median_hh_income/1000) + I(100*pop_black/pop), dat2000)
summary(reg4)
##
## Call:
## lm(formula = pop_plus10 ~ pop + I(pop^2) + I(median_hh_income/1000) +
## I(100 * pop_black/pop), data = dat2000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1306.97 -327.63 -4.05 289.08 2213.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.204e+02 2.224e+02 2.789 0.005667 **
## pop 7.208e-01 9.168e-02 7.862 9.19e-14 ***
## I(pop^2) 3.365e-05 1.062e-05 3.170 0.001703 **
## I(median_hh_income/1000) -4.582e-01 2.871e+00 -0.160 0.873310
## I(100 * pop_black/pop) -3.354e+00 9.524e-01 -3.521 0.000505 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 528.9 on 269 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.9122, Adjusted R-squared: 0.9109
## F-statistic: 698.6 on 4 and 269 DF, p-value: < 2.2e-16
Each percentage point higher concentration of black residents in Census Tracts in 2000 correlates with 3 fewer people in 2010. Using this regression, let's use the predict function to estimate Philadelphia's 2010 population:
sum(predict(reg4))
## [1] 1126700
This is quite a bit lower than the actual 2010 population count:
sum(dat2000$pop_plus10, na.rm=T)
## [1] 1504950
The regression, however, excludes the hundred or so Census Tracts that changed between 2000 and 2010. Adding these back makes the prediction much closer to the actual population.
sum(predict(reg4, newdata = dat2000), na.rm =T)
## [1] 1543117
Finally, use a similar command to estimate the 2020 population by applying our regression equation to the 2010 Census data.
sum(predict(reg4, newdata = dat2010), na.rm =T)
## [1] 1488331
EXERCISE
Despite recent population growth, the model predicts that population will decline between 2010 and 2020. Do you agree with this prediction? What do you think is causing it?
Replace median income with a measure of the proportion of the population over 25 with a college degree of higher (edu_collegeplus), Do you expect neighborhoods with a higher proportion of well-educated residents to have grown or lost population between 2000 and 2010? Describe the results of the model in relationship to your expectations.
Does this change improve the model? Be sure to consider statistical fit, residual plots, and theory in providing your answer.
Develop a different set of models predicting population change (pop_change) instead of the population ten years later (pop_plus10). What factors do you expect to be good predictors of total population change? Describe the model, model fits, and residual plots.
In this lab, you will download data on job by industry from the Zip Business Patterns data. You can find the data on the US Census website. Be sure to download the data as a CSV file. For now, download the full 2002 and 2012 data. The links are:2002 & 2012
You want the Complete ZIP Code Industry Detail File.
Be sure to read the descriptions of the data and data collection methods on the website. The FAQ may also prove useful.
The objectives of the lab are to learn to work with ZBP data using the midpoint method and to learn a few data processing commands in R. Specifically, we will take subsets of data and do a simple data merge. Although this may seem a little complicated to do some simple file management and calculations, the pay offs are also big. By the end of the lab, you should be able to create mid-point methods of jobs for all US Census Tracts by any available NAICS code or subsets of NAICS codes with just a few changes to your code.
To load the data:
1) Set your working directory to wherever you saved your files. Note that the data are stored a bit differently across the two
2) Run the following two lines of code:
zbp02 <- read.csv(“zbp02detail.txt”)
zbp12 <- read.csv(“zbp12detail.txt”)
This will take a moment to load, if you’re working on a home computer. To make sure that your data loaded properly check and have the same dimensions as below,
dim(zbp02)
## [1] 3251827 12
dim(zbp12)
## [1] 3179204 12
look the same,
head(zbp02)
## ZIP NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499 N500_999
## 1 501 ------ 11 9 1 0 1 0 0 0 0
## 2 501 23---- 1 1 0 0 0 0 0 0 0
## 3 501 235310 1 1 0 0 0 0 0 0 0
## 4 501 42---- 3 3 0 0 0 0 0 0 0
## 5 501 421990 1 1 0 0 0 0 0 0 0
## 6 501 422920 1 1 0 0 0 0 0 0 0
## N1000
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
head(zbp12)
## zip naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1 501 ------ 2 2 0 0 0 0 0 0
## 2 501 81---- 2 2 0 0 0 0 0 0
## 3 501 813110 2 2 0 0 0 0 0 0
## 4 1001 ------ 453 206 77 75 55 22 14 3
## 5 1001 22---- 2 1 0 0 1 0 0 0
## 6 1001 221310 1 1 0 0 0 0 0 0
## n500_999 n1000
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 0
## 5 0 0
## 6 0 0
and have the same column types. Most importantly, only NAICS should be a character or factor.
str(zbp02)
## 'data.frame': 3251827 obs. of 12 variables:
## $ ZIP : int 501 501 501 501 501 501 501 501 501 501 ...
## $ NAICS : Factor w/ 1104 levels "------","11----",..: 1 59 73 562 599 627 631 632 655 859 ...
## $ EST : int 11 1 1 3 1 1 1 1 1 2 ...
## $ N1_4 : int 9 1 1 3 1 1 1 1 1 2 ...
## $ N5_9 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ N10_19 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ N20_49 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ N50_99 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ N100_249: int 0 0 0 0 0 0 0 0 0 0 ...
## $ N250_499: int 0 0 0 0 0 0 0 0 0 0 ...
## $ N500_999: int 0 0 0 0 0 0 0 0 0 0 ...
## $ N1000 : int 0 0 0 0 0 0 0 0 0 0 ...
str(zbp12)
## 'data.frame': 3179204 obs. of 12 variables:
## $ zip : int 501 501 501 1001 1001 1001 1001 1001 1001 1001 ...
## $ naics : Factor w/ 999 levels "------","11----",..: 1 950 986 1 48 60 61 63 65 67 ...
## $ est : int 2 2 2 453 2 1 1 47 1 2 ...
## $ n1_4 : int 2 2 2 206 1 1 0 30 1 1 ...
## $ n5_9 : int 0 0 0 77 0 0 0 7 0 0 ...
## $ n10_19 : int 0 0 0 75 0 0 0 3 0 0 ...
## $ n20_49 : int 0 0 0 55 1 0 1 7 0 1 ...
## $ n50_99 : int 0 0 0 22 0 0 0 0 0 0 ...
## $ n100_249: int 0 0 0 14 0 0 0 0 0 0 ...
## $ n250_499: int 0 0 0 3 0 0 0 0 0 0 ...
## $ n500_999: int 0 0 0 1 0 0 0 0 0 0 ...
## $ n1000 : int 0 0 0 0 0 0 0 0 0 0 ...
If for some reason, one of the other variables did not load as you want, you can change it like so:
zbp02$EST <- as.numeric(zbp02$EST) ##as.integer() is also fine.
Note that the 2002 data include the same zip codes multiple times. These are the counts by different sector. (The 2012 data only includes totals.) To better understand this structure, look at the first Zip Code from the 2002 data:
subset(zbp02, ZIP == 501)
## ZIP NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1 501 ------ 11 9 1 0 1 0 0 0
## 2 501 23---- 1 1 0 0 0 0 0 0
## 3 501 235310 1 1 0 0 0 0 0 0
## 4 501 42---- 3 3 0 0 0 0 0 0
## 5 501 421990 1 1 0 0 0 0 0 0
## 6 501 422920 1 1 0 0 0 0 0 0
## 7 501 422990 1 1 0 0 0 0 0 0
## 8 501 44---- 1 1 0 0 0 0 0 0
## 9 501 445110 1 1 0 0 0 0 0 0
## 10 501 54---- 2 2 0 0 0 0 0 0
## 11 501 541110 1 1 0 0 0 0 0 0
## 12 501 541211 1 1 0 0 0 0 0 0
## 13 501 62---- 1 1 0 0 0 0 0 0
## 14 501 621111 1 1 0 0 0 0 0 0
## 15 501 81---- 2 0 1 0 1 0 0 0
## 16 501 813110 2 0 1 0 1 0 0 0
## 17 501 99---- 1 1 0 0 0 0 0 0
## N500_999 N1000
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 0
The NAICS includes the total number of establishments by all sectors [NAICS==“——”], by two-digit industry code, and by six-digit code. EST is the total number of establishments, and the columns to the right are the number of establishments by number of employees (1 to 4, 5 to 9, etc.) To check that the numbers add up try adding the establishments by sector.
zbp02$EST[zbp02$NAICS == "------" & zbp02$ZIP == 501]
## [1] 11
And check it against the sum of the other columns.
sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501] )
## [1] 21
Whoops! What happened? The data includes six digit and two digit counts, which led to some double counting. Try including only NAICS that end in “—-” with the substring command.
sum(zbp02$EST[zbp02$NAICS != "------" & zbp02$ZIP == 501 & substring(zbp02$NAICS,3,6)== "----"] )
## [1] 11
To clean up the data a bit, create a a new file that includes only the 2-digit codes and totals. If you want to save a more complete file for future use, you can ignore this step. Note that no matter what you do, the original CSV files remain unchanged unless you specifically overwrite them. Thus you can run and rerun your code without ever changing the underlying data.
zbp02 <- subset(zbp02, substring(zbp02$NAICS,3,6)=="----")
zbp12 <- subset(zbp12, substring(zbp12$naics,3,6)=="----")
Again check your dimensions against the below to make sure that the commands worked.
dim(zbp02)
## [1] 478497 12
dim(zbp12)
## [1] 454477 12
And look at the data.
head(zbp02)
## ZIP NAICS EST N1_4 N5_9 N10_19 N20_49 N50_99 N100_249 N250_499
## 1 501 ------ 11 9 1 0 1 0 0 0
## 2 501 23---- 1 1 0 0 0 0 0 0
## 4 501 42---- 3 3 0 0 0 0 0 0
## 8 501 44---- 1 1 0 0 0 0 0 0
## 10 501 54---- 2 2 0 0 0 0 0 0
## 13 501 62---- 1 1 0 0 0 0 0 0
## N500_999 N1000
## 1 0 0
## 2 0 0
## 4 0 0
## 8 0 0
## 10 0 0
## 13 0 0
head(zbp12)
## zip naics est n1_4 n5_9 n10_19 n20_49 n50_99 n100_249 n250_499
## 1 501 ------ 2 2 0 0 0 0 0 0
## 2 501 81---- 2 2 0 0 0 0 0 0
## 4 1001 ------ 453 206 77 75 55 22 14 3
## 5 1001 22---- 2 1 0 0 1 0 0 0
## 8 1001 23---- 47 30 7 3 7 0 0 0
## 27 1001 31---- 48 10 8 9 9 6 3 3
## n500_999 n1000
## 1 0 0
## 2 0 0
## 4 1 0
## 5 0 0
## 8 0 0
## 27 0 0
Now, use the midpoint method to estimate the number of jobs by establishment size. Note that the average size of 1-to-4 employee establishment is 2.5. For establishments with 1000 or more employees, the midpoint method sets the number of employees to 1000. (Do the 2002 data in steps and then 2012 all at once.) The within() command is one of several nifty ways to avoid writing zbp02$ before each variable.
zbp02 <- within(zbp02, jobs <- N1_4 *2.5 + N5_9 *7 + N10_19 * 14.5 +
N20_49 * 34.5 + N50_99 *74.5 +
N100_249 * 174.5 + N250_499 *374.5 +
N500_999 * 749.5 + N1000 *1000)
For this lab, drop all none PA Zip Codes to conserve space. You may want to go back and keep the other states as well at some point in time.PA’s Zip Codes run between 15000 and 19999. Note that you
zbp02<- subset(zbp02, as.numeric(zbp02$ZIP)
>= 15000 & as.numeric(zbp02$ZIP) <= 19999)
And get rid of the variables that are not needed.
keepers <- c("ZIP", "NAICS", "jobs")
zbp02 <- zbp02[keepers]
And finally clean up the labeling of the NAICS codes.
zbp02$NAICS <- substring(zbp02$NAICS,1,2)
zbp02$NAICS[zbp02$NAICS=="--"] <- "tot"
Have a look
head(zbp02)
## ZIP NAICS jobs
## 452595 14001 tot 3180.5
## 452596 14001 23 224.0
## 452610 14001 31 1383.0
## 452629 14001 42 820.5
## 452638 14001 44 189.0
## 452654 14001 48 20.0
This type of data is often referred to as long data. There are multiple entries for each Zip Code and the NAICS column indicates the industry. For future work, it is going to be a little easier to have the data in a wide format, which you can do with ease using the reshape() command.
zbp02 <- reshape(zbp02,
timevar = "NAICS",
idvar = c("ZIP"),
direction = "wide")
Have a look
head(zbp02)
## ZIP jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48 jobs.51
## 452595 14001 3180.5 224.0 1383.0 820.5 189.0 20.0 5
## 452714 14003 12.0 NA NA NA 7.0 NA NA
## 452721 14004 2270.5 292.0 467.0 39.0 527.5 60.5 14
## 452854 14005 167.0 7.5 34.5 9.5 31.0 37.0 NA
## 452887 14006 1898.5 92.0 489.5 71.5 295.0 96.5 19
## 452991 14008 51.5 10.0 NA NA 12.5 NA NA
## jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 452595 103.5 27 72.5 56.5 5.0 41.5 7.5 113.0
## 452714 NA NA NA NA NA NA NA 2.5
## 452721 82.0 12 107.0 36.5 39.5 155.0 2.5 257.0
## 452854 2.5 NA NA 2.5 NA 2.5 NA NA
## 452887 85.0 7 31.5 22.0 NA 97.0 39.5 440.0
## 452991 NA NA NA 2.5 NA 14.5 NA 7.0
## jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 452595 56.0 49 7.5 NA NA NA NA
## 452714 2.5 NA NA NA NA NA NA
## 452721 169.5 NA NA 9.5 NA NA NA
## 452854 19.0 NA NA 14.0 7 NA NA
## 452887 106.0 NA 7.0 NA NA NA NA
## 452991 5.0 NA NA NA NA NA NA
The default setting sets missing entries to NA, but you should convert these to zero, since there are no reported establishments rather than unknown reported establishments.
Note that zbp02[is.na(zbp02)==T] <- 0 is a logic statement that replaces all variables in the dataset with zeros if they currently equal NA. If you only wanted to do this for job totals, you could write: zbp02$jobs.tot[is.na(zbp02$jobs.tot)==T] <- 0
zbp02[is.na(zbp02)==T] <- 0
The following goes through the same steps for the 2012 data. Try to go through it slowly and understand each step of the process. Use ?command to get a description of functions if needed. Use internet searches as well.
zbp12<- subset(zbp12, as.numeric(zbp12$zip)
>= 14000 & as.numeric(zbp12$zip) <= 19999)
#calculate job totals
zbp12 <- within(zbp12, jobs <- n1_4 *2.5 + n5_9 *7 + n10_19 * 14.5 +
n20_49 * 34.5 + n50_99 *74.5 +
n100_249 * 174.5 + n250_499 *374.5 +
n500_999 * 749.5 + n1000 *1000)
zbp12$ZIP <- zbp12$zip ##to make consistent with the 02 data
zbp12$NAICS <- zbp12$naics
##long to wide
zbp12 <- zbp12[keepers] #Keepers should still exist
zbp12$NAICS <- substring(zbp12$NAICS,1,2)
zbp12$NAICS[zbp12$NAICS=="--"] <- "tot"
#long to wide
zbp12 <- reshape(zbp12,
timevar = "NAICS",
idvar = c("ZIP"),
direction = "wide")
#switch NA to zero
zbp12[is.na(zbp12)==T] <- 0
Finally, add the total number of jobs from 2010 to the 2000 dataset to use for making predictions in a later lab. Create a new dataset so that zbp12 does not change.
zipmerge <- zbp12[c("ZIP", "jobs.tot")]
zipmerge$jobs_plus10 <- zipmerge$jobs.tot
zipmerge <- zipmerge[c("ZIP", "jobs_plus10")]
zbp02 <- merge(zipmerge, zbp02, by ="ZIP", all.x = T, all.y =T)
rm(zipmerge, keepers)
Look at the file. This time keep the NAs since they do in fact represent missing data. For example there was no Zip code 14003 in 2012.
head(zbp02)
## ZIP jobs_plus10 jobs.tot jobs.23 jobs.31 jobs.42 jobs.44 jobs.48
## 1 14001 3015.0 3180.5 224.0 1383.0 820.5 189.0 20.0
## 2 14003 NA 12.0 0.0 0.0 0.0 7.0 0.0
## 3 14004 2368.0 2270.5 292.0 467.0 39.0 527.5 60.5
## 4 14005 182.5 167.0 7.5 34.5 9.5 31.0 37.0
## 5 14006 1370.0 1898.5 92.0 489.5 71.5 295.0 96.5
## 6 14008 73.0 51.5 10.0 0.0 0.0 12.5 0.0
## jobs.51 jobs.52 jobs.53 jobs.54 jobs.56 jobs.61 jobs.62 jobs.71 jobs.72
## 1 5 103.5 27 72.5 56.5 5.0 41.5 7.5 113.0
## 2 0 0.0 0 0.0 0.0 0.0 0.0 0.0 2.5
## 3 14 82.0 12 107.0 36.5 39.5 155.0 2.5 257.0
## 4 0 2.5 0 0.0 2.5 0.0 2.5 0.0 0.0
## 5 19 85.0 7 31.5 22.0 0.0 97.0 39.5 440.0
## 6 0 0.0 0 0.0 2.5 0.0 14.5 0.0 7.0
## jobs.81 jobs.95 jobs.99 jobs.21 jobs.11 jobs.22 jobs.55
## 1 56.0 49 7.5 0.0 0 0 0
## 2 2.5 0 0.0 0.0 0 0 0
## 3 169.5 0 0.0 9.5 0 0 0
## 4 19.0 0 0.0 14.0 7 0 0
## 5 106.0 0 7.0 0.0 0 0 0
## 6 5.0 0 0.0 0.0 0 0 0
If desired, save this as an R data file. In the future, you can double click on this file to open the PA 2002 and 2012 job estimates by 2-digit NAICS for PA.
save(zbp02, zbp12,file = "zip_jobs.Rda")
As you process your own data in the future, you will lean a number of different commands and packages to help you along the way. There are always dozens of different ways to do the same thing. If you are having trouble to get the program to do what you want it to do, it often makes sense to think about the problem differently. For example, maybe you want a nice summary table of jobs by industry and by state, but you are having trouble getting a single command to do it. Try instead to do each operation separately, and then combine them.
EXERCISE
This page describes how to upload and look at data in the R package. First, we will downlaod a CSV of UN population estimates on central African cities to import into R: Central Africa West.
The first thing to do is to let R know where to find the CSV. You can do this by navigating through the bottom right window in R Studio (Files/More/Set As Working Directory) or setting your working directory using the setwd command.
Then read the CSV file: read.csv(“Central Africa West.csv”)
Note how the data are read directly printed on the command console. Next we will create a new object called pop so that we can call on the data more easily and interact with it.
Try: pop <- read.csv(“Central Africa West.csv”)
NB you can use = instead of <- but it is better coding practice to save the = sign for other uses.
You can also load data in Rstudio using a point-and-click interface. To set your working directory, click on Session at the top of the window and then click Set Working Directory/Choose Directory. To import a data file, click on File/Import Dataset.
To look at the data, try cutting and pasting the following commands into the command console and hitting enter.
str(pop)
## 'data.frame': 17 obs. of 12 variables:
## $ Year : int 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 ...
## $ Period : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Douala : int 95 117 153 205 298 433 571 740 940 1184 ...
## $ Yaounde : int 32 49 75 112 183 292 415 578 777 1025 ...
## $ Libreville : int 15 17 29 49 77 155 234 293 366 439 ...
## $ Brazzaville : int 83 92 124 172 238 329 446 596 704 830 ...
## $ Pointe_Noire : int 16 35 64 89 116 154 217 300 363 439 ...
## $ Sub_Saharan_300kplus : int 3923 4909 7083 10779 16335 24143 34813 47767 62327 75996 ...
## $ Central_Africa_300kplus: int 3660 4521 5651 7047 8921 11495 14465 18043 22566 28525 ...
## $ Cameroon_urban : int 417 557 747 1011 1375 2112 2851 3761 4787 5930 ...
## $ Congo_urban : int 201 253 320 408 522 672 860 1086 1295 1535 ...
## $ Gabon_urban : int 54 68 87 126 189 279 397 515 655 814 ...
This tells you the structure of the data. In this case, all of the variables are integers, but it is also common to see characters, factors, and other types of data.
names(pop)
## [1] "Year" "Period"
## [3] "Douala" "Yaounde"
## [5] "Libreville" "Brazzaville"
## [7] "Pointe_Noire" "Sub_Saharan_300kplus"
## [9] "Central_Africa_300kplus" "Cameroon_urban"
## [11] "Congo_urban" "Gabon_urban"
This command gives the names of all the variables in the data. The country, city, and regions contain population estimates in thousands.
summary(pop)
## Year Period Douala Yaounde
## Min. :1950 Min. : 1 Min. : 95.0 Min. : 32.0
## 1st Qu.:1970 1st Qu.: 5 1st Qu.: 205.0 1st Qu.: 112.0
## Median :1990 Median : 9 Median : 571.0 Median : 415.0
## Mean :1990 Mean : 9 Mean : 804.8 Mean : 693.8
## 3rd Qu.:2010 3rd Qu.:13 3rd Qu.:1184.0 3rd Qu.:1025.0
## Max. :2030 Max. :17 Max. :2361.0 Max. :2349.0
## NA's :4 NA's :4
## Libreville Brazzaville Pointe_Noire Sub_Saharan_300kplus
## Min. : 15.0 Min. : 83.0 Min. : 16.0 Min. : 3923
## 1st Qu.: 49.0 1st Qu.: 172.0 1st Qu.: 89.0 1st Qu.: 16335
## Median :234.0 Median : 446.0 Median :217.0 Median : 62327
## Mean :258.5 Mean : 575.3 Mean :293.1 Mean : 91980
## 3rd Qu.:439.0 3rd Qu.: 830.0 3rd Qu.:439.0 3rd Qu.:137283
## Max. :631.0 Max. :1574.0 Max. :815.0 Max. :300153
## NA's :4 NA's :4 NA's :4
## Central_Africa_300kplus Cameroon_urban Congo_urban Gabon_urban
## Min. : 3660 Min. : 417 Min. : 201 Min. : 54.0
## 1st Qu.: 8921 1st Qu.: 1375 1st Qu.: 522 1st Qu.: 189.0
## Median : 22566 Median : 4787 Median :1295 Median : 655.0
## Mean : 34795 Mean : 6835 Mean :1723 Mean : 819.6
## 3rd Qu.: 51883 3rd Qu.:10625 3rd Qu.:2600 3rd Qu.:1334.0
## Max. :107747 Max. :20492 Max. :4804 Max. :2122.0
##
Another good way to get a sense for the data is to look at the first or last entries, using the head or tail commands.
head(pop)
## Year Period Douala Yaounde Libreville Brazzaville Pointe_Noire
## 1 1950 1 95 32 15 83 16
## 2 1955 2 117 49 17 92 35
## 3 1960 3 153 75 29 124 64
## 4 1965 4 205 112 49 172 89
## 5 1970 5 298 183 77 238 116
## 6 1975 6 433 292 155 329 154
## Sub_Saharan_300kplus Central_Africa_300kplus Cameroon_urban Congo_urban
## 1 3923 3660 417 201
## 2 4909 4521 557 253
## 3 7083 5651 747 320
## 4 10779 7047 1011 408
## 5 16335 8921 1375 522
## 6 24143 11495 2112 672
## Gabon_urban
## 1 54
## 2 68
## 3 87
## 4 126
## 5 189
## 6 279
For help with any of these commands, use the help function by typing ? before the command name. For example, try typing ?head into the command console
EXERCISE
Walk through this brief introduction to R and R Studio.
I frequently use the National Transit Database for analysis or quick summaries of key information about public transportation in the US. I also have my students use it to perform a back-of-the-envelope cost benefit analysis for a proposed light rail system in one of my transportation planning courses. In both instances, I find the downloadable excel sheets to be a bit clunky and slow to manipulate. Most recently I wanted a quick comparison of bus ridership per route kilometer for a system in Indonesia. As usual, I downloaded the excel sheets and spent about an hour playing with the data until I got a single reasonable answer. As usual, I’d have to do another bit of work if I wanted to change anything. Instead, I decided to clean up a subset of the data and rearrange it into a long-format panel. For a description of the data and some of the choices made (like when to make an entry a 0 or an NA), see here.
I’m also hoping that the clean data set will encourage my students to choose to work with the data in R. I use R in my Planning Methods and Transportation Planning courses because:
Unfortunately, the learning curve can be steep for new users. Processing data, in particular, can be slow and is rarely immediately rewarding. New users are unlikely to see much benefit to downloading, cleaning, and formatting data just to use it in a class assignment. I wouldn’t either.
Now back to my original question: how does the number of passengers per kilometer on my Indonesian city stack up to ridership in US cities?
After loading the file, this code does the trick. I only include directly operated routes (Service == “DO”) because the privately operated routes (Service == “PT”) are less accurately reported. Unlinked passenger trips (UPT) aren’t the same thing as total bus passengers–riders who transfer once count as two unlinked passenger trips–but they’re collected in the same way that corridor bus counts in Indonesia would likely have occurred. I divide directional route miles (DRM) by two because DRM counts 1 mile of bidirectional service as 2 miles and is closer to what the Indonesian counts measure.
summary(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.2915 14.9500 32.9000 67.7300 64.3100 1582.0000 162
And this provides even more information about the distribution of ridership (Unlinked Passenger Trips anyway) per kilometer of bus route.
plot(density(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ), na.rm=T))
And this tells me how many systems perform better than my Indonesian system, which carries only 16 passengers per kilometer.
table(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )<16)
##
## FALSE TRUE
## 317 158
Before going any further, I recommend trying these three commands to get a better feel for the data: head(NTD.ts) str(NTD.ts) summary(NTD.ts) In the interest of space, I’m excluding the outputs.
One of the handiest features of R is the ease with which it handles multiple data sets, so I’m going to add a couple variables and then create a new new data set that only looks at buses in 2013 that are directly operated by public agencies. I switched from Stata to R halfway into my dissertation (an otherwise foolish decision) because this made it so much easier to interact with and combine several large data sets.
NTD.ts$passenger.km <- (NTD.ts$UPT/365) / ((1.61*NTD.ts$DRM)/2)
NTD.ts$fare.recovery <- NTD.ts$FARES / NTD.ts$OPEXP_TOTAL
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0 & Service == "DO")
Now let’s plot passengers per kilometer against fare recovery.
plot(new.dat$fare.recovery, new.dat$passenger.km)
I’m a bit suspicious of the systems drawing more fare revenue than they spend on operations, so I want to look at them more closely: two University systems, an airport shuttle, and one I’m not sure about.
subset(new.dat, new.dat$fare.recovery>1)
## Year TRS.ID Mode Service Last.Report.Year
## 58601 2013 2166 MB DO 2013
## 59799 2013 2132 MB DO 2013
## 59988 2013 4180 MB DO 2013
## 60326 2013 8107 MB DO 2013
## System.Name
## 58601 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799 New Jersey Transit Corporation-45(NJTC-45)
## 59988 University of Georgia Transit System(UGA)
## 60326 The University of Montana - ASUM Transportation(ASUM OT)
## Small.Systems.Waiver City State Census.Year
## 58601 N Elizabeth NJ 2010
## 59799 N Newark NJ 2010
## 59988 N Athens GA 2010
## 60326 N Missoula MT 2010
## UZA.Name UZA UZA.Area.SQ.Miles UZA.Population
## 58601 New York-Newark, NY-NJ-CT 1 3450 18351295
## 59799 New York-Newark, NY-NJ-CT 1 3450 18351295
## 59988 Athens-Clarke County, GA 249 98 128754
## 60326 Missoula, MT 348 45 82157
## OPEXP_TOTAL OPEXP_VO OPEXP_VM OPEXP_NVM OPEXP_GA FARES VOMS
## 58601 15077606 9014072 4403391 708772 951371 16883604 52
## 59799 8299771 5067517 2285398 217667 729189 10034922 41
## 59988 5355484 4052395 791332 52669 459088 6281720 47
## 60326 598190 327957 164685 13613 91935 601851 6
## VRM VRH DRM UPT PMT CPI passenger.km fare.recovery
## 58601 1750683 191541 69.8 10294583 32942666 1 501.9548 1.119780
## 59799 1079382 135429 68.9 5504389 8461884 1 271.8950 1.209060
## 59988 853644 114959 96.0 11058944 4202399 1 392.0610 1.172951
## 60326 153163 11578 6.3 421694 737965 1 227.8076 1.006120
In case I want to only look at a few items.
subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$fare.recovery>1)
## TRS.ID System.Name
## 58601 2166 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799 2132 New Jersey Transit Corporation-45(NJTC-45)
## 59988 4180 University of Georgia Transit System(UGA)
## 60326 8107 The University of Montana - ASUM Transportation(ASUM OT)
## OPEXP_TOTAL DRM UPT PMT passenger.km fare.recovery
## 58601 15077606 69.8 10294583 32942666 501.9548 1.119780
## 59799 8299771 68.9 5504389 8461884 271.8950 1.209060
## 59988 5355484 96.0 11058944 4202399 392.0610 1.172951
## 60326 598190 6.3 421694 737965 227.8076 1.006120
Which systems are busiest?
subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$passenger.km > 750)
## TRS.ID
## 58641 2008
## 59050 5066
## 59876 5158
## System.Name
## 58641 MTA New York City Transit(NYCT)
## 59050 Chicago Transit Authority(CTA)
## 59876 University of Michigan Parking and Transportation Services(UMTS)
## OPEXP_TOTAL DRM UPT PMT passenger.km fare.recovery
## 58641 2487134393 1659.0 770962014 1614997081 1581.6043 0.3417458
## 59050 764280757 1311.2 300116357 728561319 778.9902 0.3909879
## 59876 6995443 20.7 7263234 20293476 1194.1832 0.2824377
I might also want to see how DRM and UPT look against eachother, since these two numbers for them basis of the calculation. It looks an awful lot like fare recovery against productivity.
plot(new.dat$DRM, new.dat$UPT)
Lastly I want to look at how passengers for kilometer has changed over time. First I’ll edit new.dat to include the other years. Then I’ll do a box plot by year.
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F)
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
Clearly thereâ€™s some bad data. This merits further investigation, but for now Iâ€™ll just remove the worst offenders.
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 )
boxplot(passenger.km ~ Year,data=new.dat, main="Bus ridership desnity over time", na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
There’s so much variation that it’s hard to read so I’ll make two plots, one for high productivity systems and one for low.
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 & passenger.km > 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
This page contains links to a long-formatted panel of the National Transit Database’s TS2.1 – Service Data and Operating Expenses Time-Series by Mode. I maintain the NTD data conventions, so that the column names match the NTD’s data glossary. I also added a column labeled CPI that is generated from the Bureau of Labor Statistics’ Inflation Calculator and provides the 2013 value of one dollar in the reported year. I have not reformatted the other time series NTD data, which I use less frequently and which are not available at the same level of disaggregation. For example, the capital expenditures data include system, mode, and year, but not service type. Some data products have also been discontinued, such as the detailed sources of local revenues data. The reformatted panel is available in the following formats:
and as a CSV (password: NTD3).
I haven’t looked closely at the Stata file but exported it using the foreign package. The code used to format the data is available here (password: NTD4), if you want to make any changes. Below, I describe three important choices about when to treat missing values as an NA or a 0. The NTD data do not take a particularly clear approach to the problem. For example, entries for missing cost data vary between a null entry and $0 with no apparent pattern. For additional information about how the data were constructed, see the code above.
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992), summary(OPEXP_TOTAL) )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 6334000 5050000 62260000
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992 & NTD.ts$OPEXP_TOTAL > 0), summary(OPEXP_TOTAL) )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 261200 4800000 11440000 17730000 21360000 62260000