Examining Philadelphia Land Use Data by Census Tract





The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.

Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.

You can also find a description of the underlying tax assessor data here.

It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.

names(dat)
##  [1] "fips"                       "tract"                     
##  [3] "sfr.built.sf"               "mfr.built.sf"              
##  [5] "retail.built.sf"            "comm.built.sf"             
##  [7] "indust.built.sf"            "vacant.built.sf"           
##  [9] "exempt.built.sf"            "sfr.lot.sf"                
## [11] "mfr.lot.sf"                 "retail.lot.sf"             
## [13] "comm.lot.sf"                "indust.lot.sf"             
## [15] "vacant.lot.sf"              "exempt.lot.sf"             
## [17] "sfr.mark.val"               "mfr.mark.val"              
## [19] "retail.mark.val"            "comm.mark.val"             
## [21] "indust.mark.val"            "vacant.mark.val"           
## [23] "exempt.mark.val"            "total.built.sf"            
## [25] "total.lot.sf"               "total.mark.val"            
## [27] "ft.2.city.hall"             "ft.2.train.st"             
## [29] "ft.2.bus.st"                "pop.2010"                  
## [31] "pop_density.2010"           "square_miles.2010"         
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"            
## [35] "pop_asian.2010"             "pop_hispanic.2010"         
## [37] "households.2010"            "households_family.2010"    
## [39] "households_nonfamily.2010"  "avg_hh_size.2010"          
## [41] "pop_25_plus.2010"           "edu_less_highschool.2010"  
## [43] "edu_highschool.2010"        "edu_collegeplus.2010"      
## [45] "pop_civ_employed.2010"      "pop_civ_unemployed.2010"   
## [47] "median_hh_income.2010"      "per_capita_income.2010"    
## [49] "housing_vacant.2010"        "housing_occupied.2010"     
## [51] "housing_median_age.2010"    "percent_poverty.2010"      
## [53] "pop_change.2000"            "pop_plus10.2000"           
## [55] "pop.2000"                   "pop_density.2000"          
## [57] "square_miles.2000"          "pop_white_nonhispanic.2000"
## [59] "pop_black.2000"             "pop_asian.2000"            
## [61] "pop_hispanic.2000"          "households.2000"           
## [63] "households_family.2000"     "households_nonfamily.2000" 
## [65] "avg_hh_size.2000"           "pop_25_plus.2000"          
## [67] "edu_less_highschool.2000"   "edu_highschool.2000"       
## [69] "edu_collegeplus.2000"       "pop_civ_employed.2000"     
## [71] "pop_civ_unemployed.2000"    "median_hh_income.2000"     
## [73] "per_capita_income.2000"     "housing_vacant.2000"       
## [75] "housing_occupied.2000"      "housing_median_age.2000"   
## [77] "percent_poverty.2000"

Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.

summary(dat[c(3:26)])
##   sfr.built.sf      mfr.built.sf     retail.built.sf  comm.built.sf     
##  Min.   :      0   Min.   :      0   Min.   :     0   Min.   :       0  
##  1st Qu.: 848570   1st Qu.: 144324   1st Qu.: 22700   1st Qu.:   52053  
##  Median :1429116   Median : 270901   Median : 71180   Median :  120155  
##  Mean   :1432991   Mean   : 395834   Mean   : 93474   Mean   :  357440  
##  3rd Qu.:1970797   3rd Qu.: 461973   3rd Qu.:133074   3rd Qu.:  239220  
##  Max.   :4357472   Max.   :3099629   Max.   :780672   Max.   :19805858  
##  indust.built.sf   vacant.built.sf    exempt.built.sf   
##  Min.   :      0   Min.   :   0.000   Min.   :    3771  
##  1st Qu.:      0   1st Qu.:   0.000   1st Qu.:  178846  
##  Median :  32572   Median :   0.000   Median :  319638  
##  Mean   : 272199   Mean   :   8.906   Mean   :  712816  
##  3rd Qu.: 198236   3rd Qu.:   0.000   3rd Qu.:  683018  
##  Max.   :6372334   Max.   :1520.000   Max.   :28206707  
##    sfr.lot.sf         mfr.lot.sf      retail.lot.sf     comm.lot.sf      
##  Min.   :       0   Min.   :      0   Min.   :     0   Min.   :       0  
##  1st Qu.:  955739   1st Qu.: 115597   1st Qu.: 23083   1st Qu.:   79158  
##  Median : 1845223   Median : 251603   Median : 54802   Median :  168255  
##  Mean   : 2476646   Mean   : 400486   Mean   : 66522   Mean   :  497081  
##  3rd Qu.: 3139156   3rd Qu.: 516614   3rd Qu.: 89680   3rd Qu.:  455709  
##  Max.   :33126247   Max.   :3026616   Max.   :633754   Max.   :17741673  
##  indust.lot.sf      vacant.lot.sf      exempt.lot.sf     
##  Min.   :       0   Min.   :       0   Min.   :   29248  
##  1st Qu.:       0   1st Qu.:   68008   1st Qu.:  348156  
##  Median :   31122   Median :  167437   Median :  746535  
##  Mean   :  548498   Mean   :  448650   Mean   : 1803958  
##  3rd Qu.:  212020   3rd Qu.:  415549   3rd Qu.: 1628277  
##  Max.   :26713019   Max.   :13823323   Max.   :40611433  
##   sfr.mark.val        mfr.mark.val       retail.mark.val   
##  Min.   :        0   Min.   :        0   Min.   :       0  
##  1st Qu.: 17434950   1st Qu.:  2777300   1st Qu.:  522825  
##  Median : 40370150   Median :  6816500   Median : 1359150  
##  Mean   : 48218979   Mean   : 13362086   Mean   : 2320078  
##  3rd Qu.: 70492725   3rd Qu.: 14893450   3rd Qu.: 2729175  
##  Max.   :249615900   Max.   :248868900   Max.   :32522100  
##  comm.mark.val       indust.mark.val     vacant.mark.val   
##  Min.   :0.000e+00   Min.   :        0   Min.   :       0  
##  1st Qu.:1.369e+06   1st Qu.:        0   1st Qu.:  214075  
##  Median :3.658e+06   Median :   393350   Median :  481950  
##  Mean   :1.887e+07   Mean   :  4313410   Mean   : 1652835  
##  3rd Qu.:9.132e+06   3rd Qu.:  2400625   3rd Qu.: 1139750  
##  Max.   :1.671e+09   Max.   :154563400   Max.   :36376000  
##  exempt.mark.val     total.built.sf      total.lot.sf     
##  Min.   :5.657e+05   Min.   :   17886   Min.   :  605741  
##  1st Qu.:6.623e+06   1st Qu.: 2190500   1st Qu.: 2539507  
##  Median :1.379e+07   Median : 2830458   Median : 4162716  
##  Mean   :4.344e+07   Mean   : 3264763   Mean   : 6241842  
##  3rd Qu.:3.132e+07   3rd Qu.: 3534784   3rd Qu.: 6811800  
##  Max.   :1.502e+09   Max.   :28206707   Max.   :91325669  
##  total.mark.val     
##  Min.   :4.546e+06  
##  1st Qu.:5.499e+07  
##  Median :9.375e+07  
##  Mean   :1.322e+08  
##  3rd Qu.:1.392e+08  
##  Max.   :2.286e+09

There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.

summary(dat[c(1,27:29)])
##       fips          ft.2.city.hall  ft.2.train.st      ft.2.bus.st     
##  Min.   :4.21e+10   Min.   : 2881   Min.   :  258.3   Min.   :   6.36  
##  1st Qu.:4.21e+10   1st Qu.:16596   1st Qu.: 4087.1   1st Qu.: 265.63  
##  Median :4.21e+10   Median :28692   Median : 6777.2   Median : 628.73  
##  Mean   :4.21e+10   Mean   :31776   Mean   : 7443.2   Mean   : 781.29  
##  3rd Qu.:4.21e+10   3rd Qu.:43484   3rd Qu.: 9832.9   3rd Qu.:1083.03  
##  Max.   :4.21e+10   Max.   :87673   Max.   :26020.5   Max.   :7140.68

The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.

Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.

sum (dat$square_miles.2010)
## [1] 134.1014
sum(dat$total.lot.sf)/  27.8784e6 #divided by the number of square feet in a square miles 
## [1] 85.97578

Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.

dat$far <- dat$total.built.sf / dat$total.lot.sf
summary(dat$far)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.005557  0.419600  0.696300  0.875800  1.052000 13.330000

Try exluding vacant land from the FAR calculation.

summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.005557  0.423300  0.748200  0.943900  1.165000 14.650000

And looking at how vacancy varies across census tracts

hist(dat$vacant.lot.sf/ dat$total.lot.sf )

plot of chunk unnamed-chunk-8

Now, create some net FAR calculations by land use type.

summary(dat$comm.built.sf/ dat$comm.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00376  0.33490  0.61070  1.09600  1.16400 24.78000        9
summary(dat$indust.built.sf/ dat$indust.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00711  0.59420  1.04900  1.44700  1.66800 16.04000      104
summary(dat$retail.built.sf/ dat$retail.lot.sf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0355  0.9365  1.3690  1.3830  1.8030  4.3190      29
summary(dat$exempt.built.sf/ dat$exempt.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000

It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation.

dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010
dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity
summary( dat$res.sf.per.hh )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   104.1  1092.0  1220.0  1237.0  1367.0  4538.0       8

This looks about right with most households having between 1000 and 1400 square feet.

hist( dat$res.sf.per.hh )

plot of chunk unnamed-chunk-11

And per capita

dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010
dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA
hist( dat$res.sf.per.person )

plot of chunk unnamed-chunk-12

Look at the relationship between the population and square footage of housing.

plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )

plot of chunk unnamed-chunk-13

Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood.

plot(dat$res.sf.per.person , dat$ft.2.city.hall)

plot of chunk unnamed-chunk-14

There is a bit more of a pattern between location and the proportion of housing that is multifamily.

dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf )
plot(dat$per.mf , dat$ft.2.city.hall)

plot of chunk unnamed-chunk-15

Which land uses tend to go together? Use a correlation table to examine total square footage.

cor(dat[c(3:7,9)])
##                 sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## sfr.built.sf       1.0000000 -0.227687793     0.208504423   -0.21049419
## mfr.built.sf      -0.2276878  1.000000000     0.009489829    0.45513983
## retail.built.sf    0.2085044  0.009489829     1.000000000    0.04310543
## comm.built.sf     -0.2104942  0.455139826     0.043105428    1.00000000
## indust.built.sf   -0.1466598 -0.135759385    -0.038551580    0.04383247
## exempt.built.sf   -0.2613878  0.103988183    -0.049405209    0.16346712
##                 indust.built.sf exempt.built.sf
## sfr.built.sf        -0.14665984     -0.26138775
## mfr.built.sf        -0.13575939      0.10398818
## retail.built.sf     -0.03855158     -0.04940521
## comm.built.sf        0.04383247      0.16346712
## indust.built.sf      1.00000000      0.06878225
## exempt.built.sf      0.06878225      1.00000000

And land area

cor(dat[c(10:16)])
##                 sfr.lot.sf   mfr.lot.sf retail.lot.sf comm.lot.sf
## sfr.lot.sf     1.000000000  0.174062104   -0.01660241  0.03719301
## mfr.lot.sf     0.174062104  1.000000000   -0.06027767  0.12670109
## retail.lot.sf -0.016602407 -0.060277674    1.00000000 -0.01112620
## comm.lot.sf    0.037193013  0.126701092   -0.01112620  1.00000000
## indust.lot.sf -0.071661971 -0.009073780   -0.10092578  0.39273649
## vacant.lot.sf -0.013189068  0.009647197   -0.05241089  0.71002277
## exempt.lot.sf  0.000523062  0.011982805   -0.16823137  0.37194754
##               indust.lot.sf vacant.lot.sf exempt.lot.sf
## sfr.lot.sf      -0.07166197  -0.013189068   0.000523062
## mfr.lot.sf      -0.00907378   0.009647197   0.011982805
## retail.lot.sf   -0.10092578  -0.052410890  -0.168231370
## comm.lot.sf      0.39273649   0.710022767   0.371947536
## indust.lot.sf    1.00000000   0.643286320   0.489299242
## vacant.lot.sf    0.64328632   1.000000000   0.531508105
## exempt.lot.sf    0.48929924   0.531508105   1.000000000

Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses.

cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))
##              mfr.built.sf
## sfr.built.sf   -0.2779901
cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))
##              retail.built.sf
## sfr.built.sf       0.1160276
cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))
##              comm.built.sf
## sfr.built.sf    -0.4443231
cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))
##              indust.built.sf
## sfr.built.sf      -0.3305134
cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))
##              vacant.built.sf
## sfr.built.sf     0.008721208
cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))
##              exempt.built.sf
## sfr.built.sf      -0.6608287

Try using regressions to better understanding partial correlations between land uses.

summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf 
           +  exempt.built.sf +  vacant.lot.sf  , dat))
## 
## Call:
## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf + 
##     comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1548586  -500194   -17272   498795  2940319 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.550e+06  6.833e+04  22.686  < 2e-16 ***
## mfr.built.sf    -3.486e-01  1.005e-01  -3.469 0.000582 ***
## retail.built.sf  1.605e+00  3.848e-01   4.172 3.76e-05 ***
## comm.built.sf   -5.509e-02  3.127e-02  -1.762 0.078859 .  
## indust.built.sf -1.539e-01  6.400e-02  -2.405 0.016662 *  
## exempt.built.sf -8.591e-02  1.978e-02  -4.344 1.80e-05 ***
## vacant.lot.sf   -1.419e-02  3.871e-02  -0.367 0.714091    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728300 on 377 degrees of freedom
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1661 
## F-statistic: 13.71 on 6 and 377 DF,  p-value: 4.216e-14

EXERCISE

  1. Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation.

  2. Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors?

  3. Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses?

This entry was posted in Planning Methods. Bookmark the permalink.