The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.
Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.
You can also find a description of the underlying tax assessor data here.
It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.
names(dat)
## [1] "fips" "tract"
## [3] "sfr.built.sf" "mfr.built.sf"
## [5] "retail.built.sf" "comm.built.sf"
## [7] "indust.built.sf" "vacant.built.sf"
## [9] "exempt.built.sf" "sfr.lot.sf"
## [11] "mfr.lot.sf" "retail.lot.sf"
## [13] "comm.lot.sf" "indust.lot.sf"
## [15] "vacant.lot.sf" "exempt.lot.sf"
## [17] "sfr.mark.val" "mfr.mark.val"
## [19] "retail.mark.val" "comm.mark.val"
## [21] "indust.mark.val" "vacant.mark.val"
## [23] "exempt.mark.val" "total.built.sf"
## [25] "total.lot.sf" "total.mark.val"
## [27] "ft.2.city.hall" "ft.2.train.st"
## [29] "ft.2.bus.st" "pop.2010"
## [31] "pop_density.2010" "square_miles.2010"
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"
## [35] "pop_asian.2010" "pop_hispanic.2010"
## [37] "households.2010" "households_family.2010"
## [39] "households_nonfamily.2010" "avg_hh_size.2010"
## [41] "pop_25_plus.2010" "edu_less_highschool.2010"
## [43] "edu_highschool.2010" "edu_collegeplus.2010"
## [45] "pop_civ_employed.2010" "pop_civ_unemployed.2010"
## [47] "median_hh_income.2010" "per_capita_income.2010"
## [49] "housing_vacant.2010" "housing_occupied.2010"
## [51] "housing_median_age.2010" "percent_poverty.2010"
## [53] "pop_change.2000" "pop_plus10.2000"
## [55] "pop.2000" "pop_density.2000"
## [57] "square_miles.2000" "pop_white_nonhispanic.2000"
## [59] "pop_black.2000" "pop_asian.2000"
## [61] "pop_hispanic.2000" "households.2000"
## [63] "households_family.2000" "households_nonfamily.2000"
## [65] "avg_hh_size.2000" "pop_25_plus.2000"
## [67] "edu_less_highschool.2000" "edu_highschool.2000"
## [69] "edu_collegeplus.2000" "pop_civ_employed.2000"
## [71] "pop_civ_unemployed.2000" "median_hh_income.2000"
## [73] "per_capita_income.2000" "housing_vacant.2000"
## [75] "housing_occupied.2000" "housing_median_age.2000"
## [77] "percent_poverty.2000"
Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.
summary(dat[c(3:26)])
## sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 848570 1st Qu.: 144324 1st Qu.: 22700 1st Qu.: 52053
## Median :1429116 Median : 270901 Median : 71180 Median : 120155
## Mean :1432991 Mean : 395834 Mean : 93474 Mean : 357440
## 3rd Qu.:1970797 3rd Qu.: 461973 3rd Qu.:133074 3rd Qu.: 239220
## Max. :4357472 Max. :3099629 Max. :780672 Max. :19805858
## indust.built.sf vacant.built.sf exempt.built.sf
## Min. : 0 Min. : 0.000 Min. : 3771
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.: 178846
## Median : 32572 Median : 0.000 Median : 319638
## Mean : 272199 Mean : 8.906 Mean : 712816
## 3rd Qu.: 198236 3rd Qu.: 0.000 3rd Qu.: 683018
## Max. :6372334 Max. :1520.000 Max. :28206707
## sfr.lot.sf mfr.lot.sf retail.lot.sf comm.lot.sf
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 955739 1st Qu.: 115597 1st Qu.: 23083 1st Qu.: 79158
## Median : 1845223 Median : 251603 Median : 54802 Median : 168255
## Mean : 2476646 Mean : 400486 Mean : 66522 Mean : 497081
## 3rd Qu.: 3139156 3rd Qu.: 516614 3rd Qu.: 89680 3rd Qu.: 455709
## Max. :33126247 Max. :3026616 Max. :633754 Max. :17741673
## indust.lot.sf vacant.lot.sf exempt.lot.sf
## Min. : 0 Min. : 0 Min. : 29248
## 1st Qu.: 0 1st Qu.: 68008 1st Qu.: 348156
## Median : 31122 Median : 167437 Median : 746535
## Mean : 548498 Mean : 448650 Mean : 1803958
## 3rd Qu.: 212020 3rd Qu.: 415549 3rd Qu.: 1628277
## Max. :26713019 Max. :13823323 Max. :40611433
## sfr.mark.val mfr.mark.val retail.mark.val
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 17434950 1st Qu.: 2777300 1st Qu.: 522825
## Median : 40370150 Median : 6816500 Median : 1359150
## Mean : 48218979 Mean : 13362086 Mean : 2320078
## 3rd Qu.: 70492725 3rd Qu.: 14893450 3rd Qu.: 2729175
## Max. :249615900 Max. :248868900 Max. :32522100
## comm.mark.val indust.mark.val vacant.mark.val
## Min. :0.000e+00 Min. : 0 Min. : 0
## 1st Qu.:1.369e+06 1st Qu.: 0 1st Qu.: 214075
## Median :3.658e+06 Median : 393350 Median : 481950
## Mean :1.887e+07 Mean : 4313410 Mean : 1652835
## 3rd Qu.:9.132e+06 3rd Qu.: 2400625 3rd Qu.: 1139750
## Max. :1.671e+09 Max. :154563400 Max. :36376000
## exempt.mark.val total.built.sf total.lot.sf
## Min. :5.657e+05 Min. : 17886 Min. : 605741
## 1st Qu.:6.623e+06 1st Qu.: 2190500 1st Qu.: 2539507
## Median :1.379e+07 Median : 2830458 Median : 4162716
## Mean :4.344e+07 Mean : 3264763 Mean : 6241842
## 3rd Qu.:3.132e+07 3rd Qu.: 3534784 3rd Qu.: 6811800
## Max. :1.502e+09 Max. :28206707 Max. :91325669
## total.mark.val
## Min. :4.546e+06
## 1st Qu.:5.499e+07
## Median :9.375e+07
## Mean :1.322e+08
## 3rd Qu.:1.392e+08
## Max. :2.286e+09
There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.
summary(dat[c(1,27:29)])
## fips ft.2.city.hall ft.2.train.st ft.2.bus.st
## Min. :4.21e+10 Min. : 2881 Min. : 258.3 Min. : 6.36
## 1st Qu.:4.21e+10 1st Qu.:16596 1st Qu.: 4087.1 1st Qu.: 265.63
## Median :4.21e+10 Median :28692 Median : 6777.2 Median : 628.73
## Mean :4.21e+10 Mean :31776 Mean : 7443.2 Mean : 781.29
## 3rd Qu.:4.21e+10 3rd Qu.:43484 3rd Qu.: 9832.9 3rd Qu.:1083.03
## Max. :4.21e+10 Max. :87673 Max. :26020.5 Max. :7140.68
The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.
Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.
sum (dat$square_miles.2010)
## [1] 134.1014
sum(dat$total.lot.sf)/ 27.8784e6 #divided by the number of square feet in a square miles
## [1] 85.97578
Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.
dat$far <- dat$total.built.sf / dat$total.lot.sf
summary(dat$far)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005557 0.419600 0.696300 0.875800 1.052000 13.330000
Try exluding vacant land from the FAR calculation.
summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005557 0.423300 0.748200 0.943900 1.165000 14.650000
And looking at how vacancy varies across census tracts
hist(dat$vacant.lot.sf/ dat$total.lot.sf )
Now, create some net FAR calculations by land use type.
summary(dat$comm.built.sf/ dat$comm.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00376 0.33490 0.61070 1.09600 1.16400 24.78000 9
summary(dat$indust.built.sf/ dat$indust.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00711 0.59420 1.04900 1.44700 1.66800 16.04000 104
summary(dat$retail.built.sf/ dat$retail.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0355 0.9365 1.3690 1.3830 1.8030 4.3190 29
summary(dat$exempt.built.sf/ dat$exempt.lot.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000
It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation.
dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010
dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity
summary( dat$res.sf.per.hh )
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 104.1 1092.0 1220.0 1237.0 1367.0 4538.0 8
This looks about right with most households having between 1000 and 1400 square feet.
hist( dat$res.sf.per.hh )
And per capita
dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010
dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA
hist( dat$res.sf.per.person )
Look at the relationship between the population and square footage of housing.
plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )
Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood.
plot(dat$res.sf.per.person , dat$ft.2.city.hall)
There is a bit more of a pattern between location and the proportion of housing that is multifamily.
dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf )
plot(dat$per.mf , dat$ft.2.city.hall)
Which land uses tend to go together? Use a correlation table to examine total square footage.
cor(dat[c(3:7,9)])
## sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## sfr.built.sf 1.0000000 -0.227687793 0.208504423 -0.21049419
## mfr.built.sf -0.2276878 1.000000000 0.009489829 0.45513983
## retail.built.sf 0.2085044 0.009489829 1.000000000 0.04310543
## comm.built.sf -0.2104942 0.455139826 0.043105428 1.00000000
## indust.built.sf -0.1466598 -0.135759385 -0.038551580 0.04383247
## exempt.built.sf -0.2613878 0.103988183 -0.049405209 0.16346712
## indust.built.sf exempt.built.sf
## sfr.built.sf -0.14665984 -0.26138775
## mfr.built.sf -0.13575939 0.10398818
## retail.built.sf -0.03855158 -0.04940521
## comm.built.sf 0.04383247 0.16346712
## indust.built.sf 1.00000000 0.06878225
## exempt.built.sf 0.06878225 1.00000000
And land area
cor(dat[c(10:16)])
## sfr.lot.sf mfr.lot.sf retail.lot.sf comm.lot.sf
## sfr.lot.sf 1.000000000 0.174062104 -0.01660241 0.03719301
## mfr.lot.sf 0.174062104 1.000000000 -0.06027767 0.12670109
## retail.lot.sf -0.016602407 -0.060277674 1.00000000 -0.01112620
## comm.lot.sf 0.037193013 0.126701092 -0.01112620 1.00000000
## indust.lot.sf -0.071661971 -0.009073780 -0.10092578 0.39273649
## vacant.lot.sf -0.013189068 0.009647197 -0.05241089 0.71002277
## exempt.lot.sf 0.000523062 0.011982805 -0.16823137 0.37194754
## indust.lot.sf vacant.lot.sf exempt.lot.sf
## sfr.lot.sf -0.07166197 -0.013189068 0.000523062
## mfr.lot.sf -0.00907378 0.009647197 0.011982805
## retail.lot.sf -0.10092578 -0.052410890 -0.168231370
## comm.lot.sf 0.39273649 0.710022767 0.371947536
## indust.lot.sf 1.00000000 0.643286320 0.489299242
## vacant.lot.sf 0.64328632 1.000000000 0.531508105
## exempt.lot.sf 0.48929924 0.531508105 1.000000000
Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses.
cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))
## mfr.built.sf
## sfr.built.sf -0.2779901
cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))
## retail.built.sf
## sfr.built.sf 0.1160276
cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))
## comm.built.sf
## sfr.built.sf -0.4443231
cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))
## indust.built.sf
## sfr.built.sf -0.3305134
cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))
## vacant.built.sf
## sfr.built.sf 0.008721208
cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))
## exempt.built.sf
## sfr.built.sf -0.6608287
Try using regressions to better understanding partial correlations between land uses.
summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf
+ exempt.built.sf + vacant.lot.sf , dat))
##
## Call:
## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf +
## comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1548586 -500194 -17272 498795 2940319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.550e+06 6.833e+04 22.686 < 2e-16 ***
## mfr.built.sf -3.486e-01 1.005e-01 -3.469 0.000582 ***
## retail.built.sf 1.605e+00 3.848e-01 4.172 3.76e-05 ***
## comm.built.sf -5.509e-02 3.127e-02 -1.762 0.078859 .
## indust.built.sf -1.539e-01 6.400e-02 -2.405 0.016662 *
## exempt.built.sf -8.591e-02 1.978e-02 -4.344 1.80e-05 ***
## vacant.lot.sf -1.419e-02 3.871e-02 -0.367 0.714091
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 728300 on 377 degrees of freedom
## Multiple R-squared: 0.1791, Adjusted R-squared: 0.1661
## F-statistic: 13.71 on 6 and 377 DF, p-value: 4.216e-14
EXERCISE
-
Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation.
-
Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors?
-
Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses?