## Moving to access report

Brookings Institution report with Gilles Duranton:

Developing a common narrative on urban accessibility: An urban planning perspective

## Beyond Mobility book cover

Sneak peak at the cover of my forthcoming Island Press book with Robert Cervero and Stefan Al:

Traditionally, transport planning has sought to enhance mobility. Shifting the focus from movement to accessing destinations leads to an entirely different framework for planning and designing cities and the infrastructure that serves them. This book illustrate principles and practices of linking sustainable transport programs with place-making.

## Queretaro Bike Planning Studio

Last fall, I took 8 graduate students to Queretaro, Mexico, for a bicycle planning studio. The final final planning document is available here: CYCLE_QRO_Small

A few group images from the trip:

## Planning for cars that drive themselves

Driverless cars are coming. Nearly all major car manufactures and several large technology firms are racing to develop vehicles that can operate with limited-to-no human interaction. Despite the probability of technological, legal, and political pitfalls along the way, fully autonomous vehicles (buses and cars) will likely be commercially available and driving themselves on city streets and highways in the next 20 years—well within a long-range planning horizon.

## Examining Philadelphia Land Use Data by Census Tract

The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.

Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.

You can also find a description of the underlying tax assessor data here.

It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.

names(dat)

##  [1] "fips"                       "tract"
##  [3] "sfr.built.sf"               "mfr.built.sf"
##  [5] "retail.built.sf"            "comm.built.sf"
##  [7] "indust.built.sf"            "vacant.built.sf"
##  [9] "exempt.built.sf"            "sfr.lot.sf"
## [11] "mfr.lot.sf"                 "retail.lot.sf"
## [13] "comm.lot.sf"                "indust.lot.sf"
## [15] "vacant.lot.sf"              "exempt.lot.sf"
## [17] "sfr.mark.val"               "mfr.mark.val"
## [19] "retail.mark.val"            "comm.mark.val"
## [21] "indust.mark.val"            "vacant.mark.val"
## [23] "exempt.mark.val"            "total.built.sf"
## [25] "total.lot.sf"               "total.mark.val"
## [27] "ft.2.city.hall"             "ft.2.train.st"
## [29] "ft.2.bus.st"                "pop.2010"
## [31] "pop_density.2010"           "square_miles.2010"
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"
## [35] "pop_asian.2010"             "pop_hispanic.2010"
## [37] "households.2010"            "households_family.2010"
## [39] "households_nonfamily.2010"  "avg_hh_size.2010"
## [41] "pop_25_plus.2010"           "edu_less_highschool.2010"
## [43] "edu_highschool.2010"        "edu_collegeplus.2010"
## [45] "pop_civ_employed.2010"      "pop_civ_unemployed.2010"
## [47] "median_hh_income.2010"      "per_capita_income.2010"
## [49] "housing_vacant.2010"        "housing_occupied.2010"
## [51] "housing_median_age.2010"    "percent_poverty.2010"
## [53] "pop_change.2000"            "pop_plus10.2000"
## [55] "pop.2000"                   "pop_density.2000"
## [57] "square_miles.2000"          "pop_white_nonhispanic.2000"
## [59] "pop_black.2000"             "pop_asian.2000"
## [61] "pop_hispanic.2000"          "households.2000"
## [63] "households_family.2000"     "households_nonfamily.2000"
## [65] "avg_hh_size.2000"           "pop_25_plus.2000"
## [67] "edu_less_highschool.2000"   "edu_highschool.2000"
## [69] "edu_collegeplus.2000"       "pop_civ_employed.2000"
## [71] "pop_civ_unemployed.2000"    "median_hh_income.2000"
## [73] "per_capita_income.2000"     "housing_vacant.2000"
## [75] "housing_occupied.2000"      "housing_median_age.2000"
## [77] "percent_poverty.2000"


Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.

summary(dat[c(3:26)])

##   sfr.built.sf      mfr.built.sf     retail.built.sf  comm.built.sf
##  Min.   :      0   Min.   :      0   Min.   :     0   Min.   :       0
##  1st Qu.: 848570   1st Qu.: 144324   1st Qu.: 22700   1st Qu.:   52053
##  Median :1429116   Median : 270901   Median : 71180   Median :  120155
##  Mean   :1432991   Mean   : 395834   Mean   : 93474   Mean   :  357440
##  3rd Qu.:1970797   3rd Qu.: 461973   3rd Qu.:133074   3rd Qu.:  239220
##  Max.   :4357472   Max.   :3099629   Max.   :780672   Max.   :19805858
##  indust.built.sf   vacant.built.sf    exempt.built.sf
##  Min.   :      0   Min.   :   0.000   Min.   :    3771
##  1st Qu.:      0   1st Qu.:   0.000   1st Qu.:  178846
##  Median :  32572   Median :   0.000   Median :  319638
##  Mean   : 272199   Mean   :   8.906   Mean   :  712816
##  3rd Qu.: 198236   3rd Qu.:   0.000   3rd Qu.:  683018
##  Max.   :6372334   Max.   :1520.000   Max.   :28206707
##    sfr.lot.sf         mfr.lot.sf      retail.lot.sf     comm.lot.sf
##  Min.   :       0   Min.   :      0   Min.   :     0   Min.   :       0
##  1st Qu.:  955739   1st Qu.: 115597   1st Qu.: 23083   1st Qu.:   79158
##  Median : 1845223   Median : 251603   Median : 54802   Median :  168255
##  Mean   : 2476646   Mean   : 400486   Mean   : 66522   Mean   :  497081
##  3rd Qu.: 3139156   3rd Qu.: 516614   3rd Qu.: 89680   3rd Qu.:  455709
##  Max.   :33126247   Max.   :3026616   Max.   :633754   Max.   :17741673
##  indust.lot.sf      vacant.lot.sf      exempt.lot.sf
##  Min.   :       0   Min.   :       0   Min.   :   29248
##  1st Qu.:       0   1st Qu.:   68008   1st Qu.:  348156
##  Median :   31122   Median :  167437   Median :  746535
##  Mean   :  548498   Mean   :  448650   Mean   : 1803958
##  3rd Qu.:  212020   3rd Qu.:  415549   3rd Qu.: 1628277
##  Max.   :26713019   Max.   :13823323   Max.   :40611433
##   sfr.mark.val        mfr.mark.val       retail.mark.val
##  Min.   :        0   Min.   :        0   Min.   :       0
##  1st Qu.: 17434950   1st Qu.:  2777300   1st Qu.:  522825
##  Median : 40370150   Median :  6816500   Median : 1359150
##  Mean   : 48218979   Mean   : 13362086   Mean   : 2320078
##  3rd Qu.: 70492725   3rd Qu.: 14893450   3rd Qu.: 2729175
##  Max.   :249615900   Max.   :248868900   Max.   :32522100
##  comm.mark.val       indust.mark.val     vacant.mark.val
##  Min.   :0.000e+00   Min.   :        0   Min.   :       0
##  1st Qu.:1.369e+06   1st Qu.:        0   1st Qu.:  214075
##  Median :3.658e+06   Median :   393350   Median :  481950
##  Mean   :1.887e+07   Mean   :  4313410   Mean   : 1652835
##  3rd Qu.:9.132e+06   3rd Qu.:  2400625   3rd Qu.: 1139750
##  Max.   :1.671e+09   Max.   :154563400   Max.   :36376000
##  exempt.mark.val     total.built.sf      total.lot.sf
##  Min.   :5.657e+05   Min.   :   17886   Min.   :  605741
##  1st Qu.:6.623e+06   1st Qu.: 2190500   1st Qu.: 2539507
##  Median :1.379e+07   Median : 2830458   Median : 4162716
##  Mean   :4.344e+07   Mean   : 3264763   Mean   : 6241842
##  3rd Qu.:3.132e+07   3rd Qu.: 3534784   3rd Qu.: 6811800
##  Max.   :1.502e+09   Max.   :28206707   Max.   :91325669
##  total.mark.val
##  Min.   :4.546e+06
##  1st Qu.:5.499e+07
##  Median :9.375e+07
##  Mean   :1.322e+08
##  3rd Qu.:1.392e+08
##  Max.   :2.286e+09


There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.

summary(dat[c(1,27:29)])

##       fips          ft.2.city.hall  ft.2.train.st      ft.2.bus.st
##  Min.   :4.21e+10   Min.   : 2881   Min.   :  258.3   Min.   :   6.36
##  1st Qu.:4.21e+10   1st Qu.:16596   1st Qu.: 4087.1   1st Qu.: 265.63
##  Median :4.21e+10   Median :28692   Median : 6777.2   Median : 628.73
##  Mean   :4.21e+10   Mean   :31776   Mean   : 7443.2   Mean   : 781.29
##  3rd Qu.:4.21e+10   3rd Qu.:43484   3rd Qu.: 9832.9   3rd Qu.:1083.03
##  Max.   :4.21e+10   Max.   :87673   Max.   :26020.5   Max.   :7140.68


The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.

Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.

sum (dat$square_miles.2010)  ## [1] 134.1014  sum(dat$total.lot.sf)/  27.8784e6 #divided by the number of square feet in a square miles

## [1] 85.97578


Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.

dat$far <- dat$total.built.sf / dat$total.lot.sf summary(dat$far)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
##  0.005557  0.419600  0.696300  0.875800  1.052000 13.330000


Try exluding vacant land from the FAR calculation.

summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))  ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.005557 0.423300 0.748200 0.943900 1.165000 14.650000  And looking at how vacancy varies across census tracts hist(dat$vacant.lot.sf/ dat$total.lot.sf )  Now, create some net FAR calculations by land use type. summary(dat$comm.built.sf/ dat$comm.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.00376 0.33490 0.61070 1.09600 1.16400 24.78000 9  summary(dat$indust.built.sf/ dat$indust.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.00711 0.59420 1.04900 1.44700 1.66800 16.04000 104  summary(dat$retail.built.sf/ dat$retail.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0355 0.9365 1.3690 1.3830 1.8030 4.3190 29  summary(dat$exempt.built.sf/ dat$exempt.lot.sf)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000  It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation. dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010 dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity summary( dat$res.sf.per.hh )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##   104.1  1092.0  1220.0  1237.0  1367.0  4538.0       8


This looks about right with most households having between 1000 and 1400 square feet.

hist( dat$res.sf.per.hh )  And per capita dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010 dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA hist( dat$res.sf.per.person )


Look at the relationship between the population and square footage of housing.

plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )  Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood. plot(dat$res.sf.per.person , dat$ft.2.city.hall)  There is a bit more of a pattern between location and the proportion of housing that is multifamily. dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf ) plot(dat$per.mf , dat$ft.2.city.hall)  Which land uses tend to go together? Use a correlation table to examine total square footage. cor(dat[c(3:7,9)])  ## sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf ## sfr.built.sf 1.0000000 -0.227687793 0.208504423 -0.21049419 ## mfr.built.sf -0.2276878 1.000000000 0.009489829 0.45513983 ## retail.built.sf 0.2085044 0.009489829 1.000000000 0.04310543 ## comm.built.sf -0.2104942 0.455139826 0.043105428 1.00000000 ## indust.built.sf -0.1466598 -0.135759385 -0.038551580 0.04383247 ## exempt.built.sf -0.2613878 0.103988183 -0.049405209 0.16346712 ## indust.built.sf exempt.built.sf ## sfr.built.sf -0.14665984 -0.26138775 ## mfr.built.sf -0.13575939 0.10398818 ## retail.built.sf -0.03855158 -0.04940521 ## comm.built.sf 0.04383247 0.16346712 ## indust.built.sf 1.00000000 0.06878225 ## exempt.built.sf 0.06878225 1.00000000  And land area cor(dat[c(10:16)])  ## sfr.lot.sf mfr.lot.sf retail.lot.sf comm.lot.sf ## sfr.lot.sf 1.000000000 0.174062104 -0.01660241 0.03719301 ## mfr.lot.sf 0.174062104 1.000000000 -0.06027767 0.12670109 ## retail.lot.sf -0.016602407 -0.060277674 1.00000000 -0.01112620 ## comm.lot.sf 0.037193013 0.126701092 -0.01112620 1.00000000 ## indust.lot.sf -0.071661971 -0.009073780 -0.10092578 0.39273649 ## vacant.lot.sf -0.013189068 0.009647197 -0.05241089 0.71002277 ## exempt.lot.sf 0.000523062 0.011982805 -0.16823137 0.37194754 ## indust.lot.sf vacant.lot.sf exempt.lot.sf ## sfr.lot.sf -0.07166197 -0.013189068 0.000523062 ## mfr.lot.sf -0.00907378 0.009647197 0.011982805 ## retail.lot.sf -0.10092578 -0.052410890 -0.168231370 ## comm.lot.sf 0.39273649 0.710022767 0.371947536 ## indust.lot.sf 1.00000000 0.643286320 0.489299242 ## vacant.lot.sf 0.64328632 1.000000000 0.531508105 ## exempt.lot.sf 0.48929924 0.531508105 1.000000000  Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses. cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))  ## mfr.built.sf ## sfr.built.sf -0.2779901  cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))  ## retail.built.sf ## sfr.built.sf 0.1160276  cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))  ## comm.built.sf ## sfr.built.sf -0.4443231  cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))  ## indust.built.sf ## sfr.built.sf -0.3305134  cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))  ## vacant.built.sf ## sfr.built.sf 0.008721208  cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))  ## exempt.built.sf ## sfr.built.sf -0.6608287  Try using regressions to better understanding partial correlations between land uses. summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf , dat))  ## ## Call: ## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf + ## comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf, ## data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1548586 -500194 -17272 498795 2940319 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.550e+06 6.833e+04 22.686 < 2e-16 *** ## mfr.built.sf -3.486e-01 1.005e-01 -3.469 0.000582 *** ## retail.built.sf 1.605e+00 3.848e-01 4.172 3.76e-05 *** ## comm.built.sf -5.509e-02 3.127e-02 -1.762 0.078859 . ## indust.built.sf -1.539e-01 6.400e-02 -2.405 0.016662 * ## exempt.built.sf -8.591e-02 1.978e-02 -4.344 1.80e-05 *** ## vacant.lot.sf -1.419e-02 3.871e-02 -0.367 0.714091 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 728300 on 377 degrees of freedom ## Multiple R-squared: 0.1791, Adjusted R-squared: 0.1661 ## F-statistic: 13.71 on 6 and 377 DF, p-value: 4.216e-14  EXERCISE 1. Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation. 2. Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors? 3. Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses? Posted in Planning Methods | Comments Off on Examining Philadelphia Land Use Data by Census Tract ## Working with the Philadelphia Region Household Travel Survey The purpose of this lab is to learn to work with the Philadelphia region 2012 household travel survey. You will learn how to upload data into R, summarize information from the data, combine datasheets, and examine subsets of the data. By the end of the lab, you should feel comfortable looking at the travel survey and be able to complete the relevant questions from assignment 1. Many large cities collect important individual-level data to understand and model regional travel behavior and predict the impacts of transportation investments and new commercial and housing developments. The United States also collects household travel data for the entire country. The data for surveys from 1983 to 2009 can be found here. To work with the Philadelphia household travel survey, download the data from the Delaware Valley Regional Planning Commission website. You can find links to the 2002 data here as well. The 2012 files come as an Access database or in excel files. For this exercise, you need to open the excel files and save the sheets as a CSV. In Excel, click on File/Save as and then scroll down to CSV (comma delimited). Save or move the files to the folder where you want to work. First, clear objects from the memory with rm() command, and set up the working directory to the folder containing your files. Note that if you are using a PC, you need to change all back slashes to forward slashes. For a brief tutorial, click here. Next, upload the data into your work session with the read.csv() command. This will likely work with no additional options, but to be safe, use the following: Strip.white parameter gets rid of blank spaces in the table, sep parameter sets the separator, blank.lines.skip skips the blank lines, na.strings specifies the symbol for missing observations. Any of the parameters can be changed. Depending on how your data are saved to a CSV, you may need to change the seperator. Note that you may have to change the names of the sub-folder and files, depending on how you saved the CSVs. If you are having trouble reading the data into R, there is a handy interface in RStudio. Click on Tools in the top right, then Import Dataset, then From a Text File. hh <- read.csv("DVRPC HTS Database Files/1_Household_Public.csv", sep=";", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA") per <- read.csv("DVRPC HTS Database Files/2_Person_Public.csv", sep=";", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA") veh <- read.csv("DVRPC HTS Database Files/3_Vehicle_Public.csv", sep=";", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA") trip <- read.csv("DVRPC HTS Database Files/4_Trip_Public.csv", sep=";", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA") Consult the documentation about the function parameters when you load the files. ?read.csv You should now have four objects in the top right window of R Studio. You can also check that these loaded properly using the ls() command. ls() ## [1] "hh" "per" "trip" "veh" Again, type ?command to understand the function. ?ls Finally, check your dimension to make sure that you have the right number of observations and variables. dim(hh) ## [1] 9235 38 dim(per) ## [1] 20216 72 dim(trip) ## [1] 81940 81 dim(veh) ## [1] 14903 13 Str() uncovers the structure of the data, as well as head(). Tail() takes the last 6 and is particularly useful for making sure that your data loaded properly. The summary() function provides descriptive statistics such as mean or range, depending on the data type (for example, factors and characters do not have mean values). The brackets used with the summary command (hh[,1:5]), indicate that you only want to look at the first five variables. If you move the comma after 1:5, then you will see a summary of the first five entries for all of the variables. str(hh)  ## 'data.frame': 9235 obs. of 38 variables: ##$ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $HH_ID : int 100140 100206 100291 100355 100374 100449 100503 100601 100663 100677 ... ##$ HH_WEIGHT     : num  133 204 288 222 365 ...
##  $HH_GEO_ACC : int 1 1 1 1 1 1 1 1 1 1 ... ##$ H_STATE       : int  34 34 34 34 34 34 34 34 34 34 ...
##  $H_COUNTY : int 34005 34005 34005 34005 34005 34005 34005 34005 34005 34005 ... ##$ H_CPA         : Factor w/ 84 levels "005_701","005_702",..: 2 2 1 1 1 1 1 1 4 4 ...
##  $H_MCD : num 3.4e+09 3.4e+09 3.4e+09 3.4e+09 3.4e+09 ... ##$ H_TRACT       : num  3.4e+10 3.4e+10 3.4e+10 3.4e+10 3.4e+10 ...
##  $H_TAZ : int 20225 20249 20048 20042 20002 20020 20026 20011 20637 20629 ... ##$ A_TYPE        : int  4 4 4 4 4 4 4 5 5 5 ...
##  $HH_TOT_TRIPS : int 8 11 9 5 24 9 9 18 4 5 ... ##$ HH_MO_TRIPS   : int  8 5 9 4 24 9 7 16 4 5 ...
##  $HH_NM_TRIPS : int 0 6 0 1 0 0 2 2 0 0 ... ##$ GPS_Factor    : num  1.12 1.17 1.12 1.17 1.17 1.17 1.12 1.12 1.13 1.17 ...
##  $F_HH_TOT_TRIPS: num 8.96 12.87 10.08 5.85 28.08 ... ##$ F_HH_MO_TRIPS : num  8.96 5.85 10.08 4.68 28.08 ...
##  $F_HH_NM_TRIPS : num 0 7.02 0 1.17 0 0 2.24 2.24 0 0 ... ##$ HH_SIZE       : int  2 3 5 1 4 2 2 4 1 2 ...
##  $HH_WORK : int 2 2 2 1 1 0 1 2 0 2 ... ##$ TOT_VEH       : int  2 2 2 1 2 2 2 2 1 1 ...
##  $OP_VEH : int 2 2 2 1 2 2 2 2 1 1 ... ##$ VEH_OUT       : int  20 9000 11 9000 3 20 50 15 1 30 ...
##  $TOT_BIKE : int 1 3 7 0 4 0 2 4 0 0 ... ##$ TOLL_ACCNT    : int  1 1 1 2 1 1 1 1 2 2 ...
##  $CAR_SHARE : int 2 2 2 2 2 2 2 2 2 2 ... ##$ RES_TYPE      : int  1 1 1 1 1 1 2 1 1 1 ...
##  $INCOME : int 6 5 7 4 5 5 6 7 99 4 ... ##$ TRAV_DATE     : Factor w/ 271 levels "1/1/2013","1/10/2013",..: 236 230 215 243 230 213 243 213 215 213 ...
##  $TRAV_DOW : int 5 4 2 3 4 1 3 1 2 1 ... ##$ HOLIDAY       : Factor w/ 21 levels "","Birthday of Martin Luther King, Jr.",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $HOL_TYPE : int 0 0 0 0 0 0 0 0 0 0 ... ##$ OV_SAMPLE     : int  1 0 0 0 1 0 1 0 0 0 ...
##  $STUDY : int 1 1 1 1 1 1 1 1 1 1 ... ##$ RECRUIT       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $RETRIEVE : int 3 3 3 2 3 3 2 3 3 2 ... ##$ R_LANG        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $SURV_LANG : int 1 1 1 1 1 1 1 1 1 1 ... head(hh) ## ID HH_ID HH_WEIGHT HH_GEO_ACC H_STATE H_COUNTY H_CPA H_MCD ## 1 1 100140 133.46032 1 34 34005 005_702 3400543740 ## 2 2 100206 204.10617 1 34 34005 005_702 3400547880 ## 3 3 100291 287.91055 1 34 34005 005_701 3400517440 ## 4 4 100355 222.44414 1 34 34005 005_701 3400517080 ## 5 5 100374 365.14646 1 34 34005 005_701 3400505740 ## 6 6 100449 74.05932 1 34 34005 005_701 3400508950 ## H_TRACT H_TAZ A_TYPE HH_TOT_TRIPS HH_MO_TRIPS HH_NM_TRIPS GPS_Factor ## 1 34005700405 20225 4 8 8 0 1.12 ## 2 34005700501 20249 4 11 5 6 1.17 ## 3 34005700603 20048 4 9 9 0 1.12 ## 4 34005700800 20042 4 5 4 1 1.17 ## 5 34005700900 20002 4 24 24 0 1.17 ## 6 34005701104 20020 4 9 9 0 1.17 ## F_HH_TOT_TRIPS F_HH_MO_TRIPS F_HH_NM_TRIPS HH_SIZE HH_WORK TOT_VEH ## 1 8.96 8.96 0.00 2 2 2 ## 2 12.87 5.85 7.02 3 2 2 ## 3 10.08 10.08 0.00 5 2 2 ## 4 5.85 4.68 1.17 1 1 1 ## 5 28.08 28.08 0.00 4 1 2 ## 6 10.53 10.53 0.00 2 0 2 ## OP_VEH VEH_OUT TOT_BIKE TOLL_ACCNT CAR_SHARE RES_TYPE INCOME TRAV_DATE ## 1 2 20 1 1 2 1 6 8/3/2012 ## 2 2 9000 3 1 2 1 5 8/2/2012 ## 3 2 11 7 1 2 1 7 7/31/2012 ## 4 1 9000 0 2 2 1 4 8/8/2012 ## 5 2 3 4 1 2 1 5 8/2/2012 ## 6 2 20 0 1 2 1 5 7/30/2012 ## TRAV_DOW HOLIDAY HOL_TYPE OV_SAMPLE STUDY RECRUIT RETRIEVE R_LANG ## 1 5 0 1 1 1 3 1 ## 2 4 0 0 1 1 3 1 ## 3 2 0 0 1 1 3 1 ## 4 3 0 0 1 1 2 1 ## 5 4 0 1 1 1 3 1 ## 6 1 0 0 1 1 3 1 ## SURV_LANG ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 summary(hh[,1:5]) ## ID HH_ID HH_WEIGHT HH_GEO_ACC ## Min. : 1 Min. : 100140 Min. : 40.32 Min. :1 ## 1st Qu.:2361 1st Qu.: 397484 1st Qu.: 109.56 1st Qu.:1 ## Median :4746 Median : 510518 Median : 165.11 Median :1 ## Mean :4781 Mean : 629041 Mean : 227.09 Mean :1 ## 3rd Qu.:7198 3rd Qu.: 703038 3rd Qu.: 279.67 3rd Qu.:1 ## Max. :9628 Max. :1736693 Max. :1086.64 Max. :1 ## H_STATE ## Min. :34.00 ## 1st Qu.:42.00 ## Median :42.00 ## Mean :40.04 ## 3rd Qu.:42.00 ## Max. :42.00 For this purpose functions names() and colnames() are almost identical – they provide the names of all columns (variables) in the data-set. Rownames provides the list of names of rows 100 to 1005. Be sure to check the data dictionary provided with the data to understand each variable. names(hh) ## [1] "ID" "HH_ID" "HH_WEIGHT" "HH_GEO_ACC" ## [5] "H_STATE" "H_COUNTY" "H_CPA" "H_MCD" ## [9] "H_TRACT" "H_TAZ" "A_TYPE" "HH_TOT_TRIPS" ## [13] "HH_MO_TRIPS" "HH_NM_TRIPS" "GPS_Factor" "F_HH_TOT_TRIPS" ## [17] "F_HH_MO_TRIPS" "F_HH_NM_TRIPS" "HH_SIZE" "HH_WORK" ## [21] "TOT_VEH" "OP_VEH" "VEH_OUT" "TOT_BIKE" ## [25] "TOLL_ACCNT" "CAR_SHARE" "RES_TYPE" "INCOME" ## [29] "TRAV_DATE" "TRAV_DOW" "HOLIDAY" "HOL_TYPE" ## [33] "OV_SAMPLE" "STUDY" "RECRUIT" "RETRIEVE" ## [37] "R_LANG" "SURV_LANG" colnames(hh) ## [1] "ID" "HH_ID" "HH_WEIGHT" "HH_GEO_ACC" ## [5] "H_STATE" "H_COUNTY" "H_CPA" "H_MCD" ## [9] "H_TRACT" "H_TAZ" "A_TYPE" "HH_TOT_TRIPS" ## [13] "HH_MO_TRIPS" "HH_NM_TRIPS" "GPS_Factor" "F_HH_TOT_TRIPS" ## [17] "F_HH_MO_TRIPS" "F_HH_NM_TRIPS" "HH_SIZE" "HH_WORK" ## [21] "TOT_VEH" "OP_VEH" "VEH_OUT" "TOT_BIKE" ## [25] "TOLL_ACCNT" "CAR_SHARE" "RES_TYPE" "INCOME" ## [29] "TRAV_DATE" "TRAV_DOW" "HOLIDAY" "HOL_TYPE" ## [33] "OV_SAMPLE" "STUDY" "RECRUIT" "RETRIEVE" ## [37] "R_LANG" "SURV_LANG" rownames(hh[100:105,]) ## [1] "100" "101" "102" "103" "104" "105" Functions View() and fix() provide a useful tabular view of the data. View() opens a separate tab in RStudio. It is important to remember to capitalize the first letter of this function, otherwise it will not run. Fix() (not capitalized) opens a new window and allows for editing, just like Excel. In order to reuse the edits the R object has to be saved as a new file. View(hh) fix(hh) Function table() creates useful summary table and multidimensional cross-tables. In this case, passing the TOT_VEH variable to the function creates a one-dimensional table of the number of households by the number of vehicles. Dividing it by the length of the variable produces the percentages. table(hh$TOT_VEH)
##
##    0    1    2    3    4    5    6    7    8   10   15
##  833 3109 3726 1120  318   93   22    8    4    1    1
table(hh$TOT_VEH)/length(hh$TOT_VEH)
##
##            0            1            2            3            4
## 0.0902003249 0.3366540336 0.4034650785 0.1212777477 0.0344342177
##            5            6            7            8           10
## 0.0100703844 0.0023822415 0.0008662696 0.0004331348 0.0001082837
##           15
## 0.0001082837

Using the round() function we rounded the values and multiplied them by 100 to make them more understandable. 40.4% of the households have two cars.

round(table(hh$TOT_VEH)/length(hh$TOT_VEH)*100, digits=3)
##
##      0      1      2      3      4      5      6      7      8     10
##  9.020 33.665 40.347 12.128  3.443  1.007  0.238  0.087  0.043  0.011
##     15
##  0.011

Passing two arguments creates a two-dimensional table of the number of households by the number of cars and the household income. Note that the brackets are removing data where the income is unknown (98) or not given (99). You will also need to consult the data dictionary to understand the income levels represented by the numbers 1 through 10.

table(hh$TOT_VEH[hh$INCOME < 97], hh$INCOME[hh$INCOME < 97])
##
##        1   2   3   4   5   6   7   8   9  10
##   0  179 240  93  90  95  33  30  17  15   0
##   1  114 504 405 510 587 335 236  84  82  27
##   2   19  90 140 333 646 685 812 343 210 121
##   3    3  14  25  57 147 187 295 152  78  64
##   4    2   4   6  15  47  53  70  47  23  19
##   5    1   1   2   3  10  10  23  16   7   6
##   6    0   0   0   0   3   5   5   5   1   3
##   7    0   0   0   0   1   4   1   0   1   0
##   8    0   0   0   0   2   1   0   0   0   1
##   10   0   0   0   0   0   0   0   1   0   0
##   15   0   0   0   0   1   0   0   0   0   0

Functions hist() and plot() are used for plotting and let us see how the data is distributed. Hist() creates a histogram, embedding the function density() in plot() will plot the density of the observed distribution. In this case the density plot is not particularly useful, but it frequently is quite helpful with larger, more continuous data.

hist(hh$TOT_VEH) plot(density(hh$TOT_VEH))

More on functions table() and hist()

?table
?hist

You see that many of the observations have a value of 98 or 99. This is the value used for those households that refused to report their income. Using the square brackets or the function subset() allows us to only select valid observations. In general, use brackets for a single variable, but subset for multiple variables (including the entire data frame).

table(hh$INCOME) ## ## 1 2 3 4 5 6 7 8 9 10 98 99 ## 318 853 671 1008 1539 1313 1472 665 417 241 13 725 hist(hh$INCOM[hh$INCOME < 98]) table(hh$TOT_VEH[hh$INCOME < 98]) ## ## 0 1 2 3 4 5 6 7 8 10 15 ## 792 2884 3399 1022 286 79 22 7 4 1 1 summary(subset(hh[,1:5], hh$INCOME < 98))
##        ID           HH_ID           HH_WEIGHT         HH_GEO_ACC
##  Min.   :   1   Min.   : 100140   Min.   :  40.32   Min.   :1
##  1st Qu.:2366   1st Qu.: 397693   1st Qu.: 113.42   1st Qu.:1
##  Median :4783   Median : 512032   Median : 169.94   Median :1
##  Mean   :4815   Mean   : 636321   Mean   : 233.20   Mean   :1
##  3rd Qu.:7288   3rd Qu.: 708975   3rd Qu.: 291.01   3rd Qu.:1
##  Max.   :9628   Max.   :1736693   Max.   :1086.64   Max.   :1
##     H_STATE
##  Min.   :34.00
##  1st Qu.:42.00
##  Median :42.00
##  Mean   :40.03
##  3rd Qu.:42.00
##  Max.   :42.00

More on function subset()

?subset

You can also recode the missing observations, so that they can be displayed nicely in the histogram.

hh$inc <- hh$INCOM #same variable
hh$inc[hh$inc==98]<- 11  #recoding missing from 98 to 11
hh$inc[hh$inc==99]<- 12 #recoding refused from 99 to 12
hist(hh$inc) #looks much better now There are also 3 other data-frames from DVRPC. Often you will want to combine information. In order to combine the data, you first need to understand the different data values. Notice that the person-level file has a column called HH_ID. This will be the variable to use to join the person data and the household data names(per) ## [1] "ID" "HH_ID" "PERSON_NUM" "PERSON_ID" ## [5] "HH_WEIGHT" "P_WEIGHT" "P_TOT_TRIPS" "P_MO_TRIPS" ## [9] "P_NM_TRIPS" "GPS_Factor" "F_P_TOT_TRIPS" "F_P_MO_TRIPS" ## [13] "F_P_NM_TRIPS" "GEND" "AGECAT" "RACE" ## [17] "EDUCA" "RELAT" "LIC" "EMPLY" ## [21] "WK_STAT" "SEMPLY" "JOBS" "WK_LOC" ## [25] "W_GEO_ACC" "W_STATE" "W_REGION" "W_COUNTY" ## [29] "W_CPA" "W_MCD" "W_TRACT" "W_TAZ" ## [33] "HOURS" "OCCUP" "WK_MODE" "WK_MODETMS" ## [37] "ARRV_WRK" "ARRV_TMS" "LV_WRK" "LV_TMS" ## [41] "TCOMM" "PARK_SUB" "PARK_SUB_AMT" "TRAN_SUB" ## [45] "TRAN_SUB_AMT" "WDAY_1" "WDAY_2" "WDAY_3" ## [49] "WDAY_4" "WDAY_5" "WDAY_6" "WDAY_7" ## [53] "STUDE" "SCHOL" "S_HOME" "S_ONLN" ## [57] "S_GEO_ACC" "S_STATE" "S_REGION" "S_COUNTY" ## [61] "S_CPA" "S_MCD" "S_TRACT" "S_TAZ" ## [65] "PRESCH_1" "PRESCH_2" "PRESCH_3" "T_TRIP" ## [69] "PAY_TYPE1" "PAY_TYPE2" "PAY_TYPE3" "TAP" Use the length() and unique() commands to check how well the data will combine. The function length() allows you to extract the length of any vector; unique() creates a vector of all unique observations within another vector or a variable. Note that this should be equal to the number of observations in the household data. By contrast, the length of the person id should be equal to the number of observations in the person data. length(unique(per$HH_ID))
## [1] 9235
length(unique(per$PERSON_ID)) ## [1] 20216 You should also notice that the person id is the same as the household id with the person number added to it. head(per[1:4]) ## ID HH_ID PERSON_NUM PERSON_ID ## 1 279 100140 1 10014001 ## 2 280 100140 2 10014002 ## 3 281 100206 1 10020601 ## 4 282 100206 2 10020602 ## 5 283 100206 3 10020603 ## 6 284 100291 1 10029101 Now that you have found the correct identification numbers, you can merge the files using the merge command. Add all of the household data to the person file and create a new data-frame. In this case, there is a perfect merger, but if there were some people with missing household data, you could keep all entries by setting all.x = True instead of False. per_hh <- merge(per, hh, by = "HH_ID", all.x = FALSE, all.y=FALSE, sort = FALSE) Always be careful to check your data before and after a merge to make sure that you are getting what you wants. In this case, we want the dimensions of the data-frame to have the same number of observations as the person data. The next data-set contains data about individual trips that people made during one day. Note that you made the head() command show 10 entries instead of the default of 6. The trips data-frame has an id to match both the household and person data. head(trip[,1:12], 10) Try making a subset of all trips made by members of one household using the subset() function. The output of the function is a data frame, which can be checked using class() function. Some functions return objects in matrix form, which makes the data processing a little bit different. test <- subset(trip, trip$HH_ID == 100374 )
class(test)
## [1] "data.frame"

Now look at two variables of this subset. The variable MODE contains information about the mode that was used in each trip. MODE_AGG combines some of the modes into larger categories. Again refer to the data dictionary.

test$MODE ## [1] 5 5 5 5 NA 5 5 5 5 5 5 5 5 5 5 NA 5 5 5 5 NA 5 5 ## [24] 5 5 5 5 NA test$MODE_AGG
##  [1]  3  3  3  3 NA  3  3  3  3  3  3  3  3  3  3 NA  3  3  3  3 NA  3  3
## [24]  3  3  3  3 NA

You can find out how many trips do not have any information about mode using is.na function. It produces a vector of the same length as trip$mode, and only contain TRUE or FALSE values. Summary() function counts how many different values are contained in the logical vector. The number of TRUEs is about 20,000, therefore there are 20,000 trips with no information about mode. summary(is.na(trip$MODE))
##    Mode   FALSE    TRUE    NA's
## logical   61739   20201       0

Then you can select only the observations that have information about mode.

head(subset(trip,is.na(trip$MODE)==FALSE))[,5:10] ## HH_WEIGHT P_WEIGHT TOUR_NUM WKSUB_NUM WKSUB_ID TRIP_NUM ## 1 133.4603 132.5431 1 NA 1.0014e+11 1 ## 2 133.4603 132.5431 1 NA 1.0014e+11 2 ## 3 133.4603 132.5431 1 NA 1.0014e+11 3 ## 4 133.4603 132.5431 1 NA 1.0014e+11 4 ## 5 133.4603 132.5431 1 NA 1.0014e+11 5 ## 7 133.4603 115.1988 1 NA 1.0014e+11 1 You may also wish to have a better idea of what is happening with the data with no reported mode. Notice that the trip number is equal to 0 (no trips) or 97 (final destination). table(trip$TRIP_NUM[is.na(trip$MODE)==TRUE]) ## ## 0 97 ## 3611 16590 Finally, look at the vehicle data-set. head(veh) ## ID HH_ID HH_WEIGHT TOT_VEH OP_VEH VEH_NUM VEH_ID VEH_OWN YEAR MAKE ## 1 79 100140 133.4603 2 2 1 10014001 1 2009 15 ## 2 80 100140 133.4603 2 2 2 10014002 1 2010 8 ## 3 81 100206 204.1062 2 2 1 10020601 1 1999 14 ## 4 82 100206 204.1062 2 2 2 10020602 1 2009 37 ## 5 83 100291 287.9105 2 2 1 10029101 2 2010 39 ## 6 84 100291 287.9105 2 2 2 10029102 2 2011 39 ## MODEL BODY VEH_TYPE ## 1 SONATA 1 2 ## 2 GRAND CARAVAN 8 2 ## 3 CRV 2 2 ## 4 FORRESTER 2 2 ## 5 PRIUS 1 1 ## 6 SEIANNA 8 2 Notice that there are more vehicles than households. And that there are fewer unique household ids in the vehicle data than there are households. This is because not all households own cars, but some own more than 1. dim(veh) ## [1] 14903 13 length(unique(veh$HH_ID))
## [1] 8307

Now that you have looked at all the data, try to answer these two questions: How many trips do people take on average? And how does this vary by income?

The first should be relatively straightforward, since the person data already contains a variable on each persons total trips.

table(per$P_TOT_TRIPS) ## ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 3625 629 6025 2467 2880 1544 1251 698 478 292 182 76 33 21 6 ## 15 16 19 20 ## 4 2 1 2 mean(per$P_TOT_TRIPS)
## [1] 3.053225

This, however, is not entirely correct. The data come from a sample of households that vary systematically from households in Philadelphia. This is represented by weights. Look at the value of weights and then use the weighted.mean() function to apply them.

summary(per$P_WEIGHT) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 73.82 88.67 117.50 278.30 233.10 3665.00 weighted.mean(per$P_TOT_TRIPS, per$P_WEIGHT) ## [1] 2.902891 It is often interesting to look at who gets under-sampled in these types of surveys. First, plot the density of person weights. Notice how there is a long tail of people with very high weights. plot(density(per$P_WEIGHT))

plot(density(per$P_WEIGHT[per$P_WEIGHT<500]))

Now compare the racial differences across the high and low weights. Notice how the under-sampled population (those given a high weight) is less white than the over-sampled population. In general, these types of surveys tend to over-sample white households, households that own there own homes, large households, and households in single-family homes. Without applying the weights, the summaries are more likely to reflect the sample than the population of the Philadelphia region.

round(table(per$RACE[per$P_WEIGHT<=278.3])/length(per$RACE[per$P_WEIGHT<=278.3]),3)
##
##     1     2     3     4     5     6    97    98    99   100   988
## 0.872 0.052 0.008 0.001 0.020 0.000 0.002 0.001 0.011 0.008 0.024
round(table(per$RACE[per$P_WEIGHT>278.3])/length(per$RACE[per$P_WEIGHT>278.3]),3)
##
##     1     2     3     4     5    97    98    99   100   988
## 0.712 0.157 0.039 0.003 0.027 0.004 0.001 0.011 0.018 0.027

Note that each weight is equal to the number of people or households that the observation is supposed to represent. So that you can sum them for an estimate of the number of people, households, or vehicles in the region.

sum(per$P_WEIGHT) ## [1] 5627010 sum(hh$HH_WEIGHT)
## [1] 2097203
sum(hh$HH_WEIGHT*hh$TOT_VEH)
## [1] 3330849

Now look at how the number of trips vary by income. This is a little bit trickier. First, you have to have merge the person and household data (which we already did). Second, you need to make the estimate for each income group. There are a few ways to automate this (if you are interested look at the apply functions and the doBy package), but for now do it manually for the highest and lowest income groups.

weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==1], per_hh$P_WEIGHT[per_hh$INCOME==1]) 
## [1] 2.489796
weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==10], per_hh$P_WEIGHT[per_hh$INCOME==10])
## [1] 3.339281

Finally, you may want to create and save a new data-frame that only contains what you want. First create a new data-frame that contains the variables you want to keep.

dat <- per_hh[c("HH_ID", "PERSON_ID", "P_TOT_TRIPS",  "GEND" , "AGECAT", "RACE", "INCOME", "TOT_VEH")]

You can save it as a CSV or an R data file. A CSV is particularly useful if you have created a summary file that you want to open to make a nicely formatted table in Excel or send your file to someone who does not use R.

The first argument specifies the data frame that needs to be saved, argument specifies the file name. By default, R will write the files to your current working directory.

write.csv(dat, file="per_hh.csv")
save(dat, file="per_hh.Rda")

EXERCISE

1. Describe in detail the daily activities and trips of the first person in the trips data.
4. Give the weighted value too. (Hint: use the trip files with the person weights).
5. Compare the regional mode share for work and non-work trips. (For convenience, use the TOUR_TYPE variable instead of the activities.)
6. Compare the mode share for households with income under $35,000 with those above or equal to$35,000.
7. Plot the number of trips against income and describe any patterns that you see.
8. What percent of driving trips had to pay for parking?
9. What was the average hourly parking rate? (Hint: not all prices are presented in hours, so convert these)

Posted in Transportation Planning | Comments Off on Working with the Philadelphia Region Household Travel Survey

## Planning Methods: Predicting changes in jobs by Zip Code

For this lab, upload the Zip Code data from the previous lab. This should include the following files wih the following dimensions. You can also download the data here.

dim(zbp02)

## [1] 2207   24

dim(zbp12)

## [1] 2125   22


The purposes of this lab is to reinforce the regression and visualization skills learned in lab 2 and lab 3 using the Zip Business Patterns data.

Start by summarzing the 2002 data, which you will use to preidct job changes in 2012.

summary(zbp02)

##       ZIP         jobs_plus10         jobs.tot           jobs.23
##  Min.   :15001   Min.   :    2.5   Min.   :    2.50   Min.   :   0.0
##  1st Qu.:16010   1st Qu.:   71.0   1st Qu.:   72.25   1st Qu.:   2.5
##  Median :17235   Median :  411.0   Median :  387.50   Median :  22.0
##  Mean   :17318   Mean   : 2654.3   Mean   : 2597.33   Mean   : 138.6
##  3rd Qu.:18618   3rd Qu.: 2327.0   3rd Qu.: 2234.25   3rd Qu.: 126.5
##  Max.   :19980   Max.   :79883.5   Max.   :77848.50   Max.   :4476.5
##                  NA's   :82        NA's   :40         NA's   :40
##     jobs.31           jobs.42          jobs.44           jobs.48
##  Min.   :    0.0   Min.   :   0.0   Min.   :    0.0   Min.   :   0.00
##  1st Qu.:    0.0   1st Qu.:   0.0   1st Qu.:    2.5   1st Qu.:   0.00
##  Median :   34.5   Median :   7.5   Median :   31.0   Median :   7.00
##  Mean   :  357.2   Mean   : 128.8   Mean   :  377.0   Mean   :  73.68
##  3rd Qu.:  341.5   3rd Qu.:  74.5   3rd Qu.:  273.2   3rd Qu.:  47.25
##  Max.   :10076.0   Max.   :5025.5   Max.   :10491.0   Max.   :5435.00
##  NA's   :40        NA's   :40       NA's   :40        NA's   :40
##     jobs.51           jobs.52            jobs.53           jobs.54
##  Min.   :   0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.0
##  1st Qu.:   0.00   1st Qu.:    0.00   1st Qu.:   0.00   1st Qu.:    0.0
##  Median :   0.00   Median :    7.00   Median :   0.00   Median :    5.0
##  Mean   :  72.96   Mean   :  155.58   Mean   :  39.46   Mean   :  158.4
##  3rd Qu.:  14.50   3rd Qu.:   50.25   3rd Qu.:  17.50   3rd Qu.:   57.0
##  Max.   :4203.50   Max.   :15671.50   Max.   :1739.50   Max.   :22421.0
##  NA's   :40        NA's   :40         NA's   :40        NA's   :40
##     jobs.56          jobs.61           jobs.62          jobs.71
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.00
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.:   0.0   1st Qu.:   0.00
##  Median :   5.0   Median :   0.00   Median :  16.5   Median :   0.00
##  Mean   : 145.6   Mean   :  66.51   Mean   : 362.8   Mean   :  41.55
##  3rd Qu.:  56.0   3rd Qu.:   8.50   3rd Qu.: 208.2   3rd Qu.:  17.00
##  Max.   :9158.5   Max.   :3734.50   Max.   :8389.5   Max.   :2858.00
##  NA's   :40       NA's   :40        NA's   :40       NA's   :40
##     jobs.72          jobs.81          jobs.95           jobs.99
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :  0.000
##  1st Qu.:   0.0   1st Qu.:   2.5   1st Qu.:   0.00   1st Qu.:  0.000
##  Median :  19.5   Median :  19.5   Median :   0.00   Median :  0.000
##  Mean   : 213.5   Mean   : 139.4   Mean   :  26.79   Mean   :  1.355
##  3rd Qu.: 154.2   3rd Qu.: 114.8   3rd Qu.:   0.00   3rd Qu.:  0.000
##  Max.   :5445.5   Max.   :4624.5   Max.   :1961.00   Max.   :214.500
##  NA's   :40       NA's   :40       NA's   :40        NA's   :40
##     jobs.21            jobs.11          jobs.22           jobs.55
##  Min.   :   0.000   Min.   :  0.00   Min.   :   0.00   Min.   :   0.0
##  1st Qu.:   0.000   1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:   0.0
##  Median :   0.000   Median :  0.00   Median :   0.00   Median :   0.0
##  Mean   :   8.461   Mean   :  2.05   Mean   :  18.45   Mean   :  69.4
##  3rd Qu.:   0.000   3rd Qu.:  0.00   3rd Qu.:   0.00   3rd Qu.:   2.5
##  Max.   :1538.500   Max.   :194.00   Max.   :1785.50   Max.   :6602.0
##  NA's   :40         NA's   :40       NA's   :40        NA's   :40


Like with the census tract data, not all of the zip coned match across the two years, resulting in missing data.

Plot the relationship between the number of jobs in 2002 and 2012.

plot(zbp02$jobs.tot, zbp02$jobs_plus10)


And look at the distribution of jobs across Zip Codes. Most Zip Codes have very few jobs, with a long tail of job-rich Zip Codes.

plot(density(zbp02$jobs_plus10, na.rm = T))  hist(zbp02$jobs_plus10)


Look at a specific Zip Code (19104).

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP == 19104], zbp02$jobs.tot[zbp02$ZIP==19104], col="red")


And all Philadelphia tracts, plus a few others. These tracts do not look particularly different from others in PA.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP > 19018 & zbp02$ZIP < 19256], zbp02$jobs.tot[zbp02$ZIP> 19018 & zbp02$ZIP < 19256], col="red")


Now, predict the 2012 jobs as a function of the number of jobs in 2002. The regression line fits the data almost perfectly.

reg1 <- lm(jobs_plus10 ~ jobs.tot, zbp02)
summary(reg1)

##
## Call:
## lm(formula = jobs_plus10 ~ jobs.tot, data = zbp02)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -7887.2  -158.2   -82.6    26.7 10332.6
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.743743  26.051260   3.176  0.00151 **
## jobs.tot     0.967799   0.004028 240.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1081 on 2083 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9652, Adjusted R-squared:  0.9652
## F-statistic: 5.774e+04 on 1 and 2083 DF,  p-value: < 2.2e-16


The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)


The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.

plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)


It also looks like there might be some systematic differences in prediction quality by the number of jobs.

plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)


Try to see which jobs are more or less likely to predict job losses or job increases.

summary(
lm(jobs_plus10 ~jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02)
)

##
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 +
##     jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 +
##     jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 +
##     jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -7705.6  -159.0   -51.4    39.6  9224.1
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265   25.06739   1.905 0.056923 .
## jobs.23      1.48264    0.11445  12.954  < 2e-16 ***
## jobs.31      0.86957    0.03842  22.633  < 2e-16 ***
## jobs.42      1.75718    0.11463  15.329  < 2e-16 ***
## jobs.44      0.90575    0.05998  15.100  < 2e-16 ***
## jobs.48      0.66433    0.10652   6.237 5.41e-10 ***
## jobs.51      0.68900    0.13370   5.153 2.80e-07 ***
## jobs.52      0.59305    0.06731   8.811  < 2e-16 ***
## jobs.53      2.82325    0.31606   8.933  < 2e-16 ***
## jobs.54      1.51710    0.07039  21.552  < 2e-16 ***
## jobs.56      0.17040    0.08420   2.024 0.043112 *
## jobs.61      1.47967    0.11901  12.433  < 2e-16 ***
## jobs.62      0.85000    0.04409  19.281  < 2e-16 ***
## jobs.71      0.75343    0.20036   3.760 0.000174 ***
## jobs.72      1.52875    0.12252  12.478  < 2e-16 ***
## jobs.81      0.44943    0.18831   2.387 0.017095 *
## jobs.95      0.74847    0.19179   3.903 9.82e-05 ***
## jobs.99      5.25699    4.88099   1.077 0.281591
## jobs.21      1.13877    0.36736   3.100 0.001962 **
## jobs.11      3.63400    2.12568   1.710 0.087495 .
## jobs.22      0.88365    0.22376   3.949 8.11e-05 ***
## jobs.55      0.11145    0.11360   0.981 0.326664
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9711
## F-statistic:  3339 on 21 and 2063 DF,  p-value: < 2.2e-16


Most of the paramter estimates are statistically different from zero with a high degree of confidence, but that is really not a very useful finding. You really want to know whether job types are statisticallyt different from 1 (i.e., are certain job types more or less likely to be decreasing over time?). You can roughly approximate this by looking at the coefficent estimate and the standard error. If the coefficient estimate plus or minus two standard errors crosses the number 1, than the estimate is not statistically different from one with 95% confidence.

You can aslo set up the regression to compare the coefficient estimate against the total number of jobs in 2002.

summary(
lm(jobs_plus10 ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02, offset= 1.00*jobs.tot)
)

##
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 +
##     jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 +
##     jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 +
##     jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02,
##     offset = 1 * jobs.tot)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -7705.6  -159.0   -51.4    39.6  9224.1
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265   25.06739   1.905 0.056923 .
## jobs.23      0.48264    0.11445   4.217 2.58e-05 ***
## jobs.31     -0.13043    0.03842  -3.395 0.000699 ***
## jobs.42      0.75718    0.11463   6.605 5.03e-11 ***
## jobs.44     -0.09425    0.05998  -1.571 0.116262
## jobs.48     -0.33567    0.10652  -3.151 0.001650 **
## jobs.51     -0.31100    0.13370  -2.326 0.020107 *
## jobs.52     -0.40695    0.06731  -6.046 1.76e-09 ***
## jobs.53      1.82325    0.31606   5.769 9.20e-09 ***
## jobs.54      0.51710    0.07039   7.346 2.93e-13 ***
## jobs.56     -0.82960    0.08420  -9.853  < 2e-16 ***
## jobs.61      0.47967    0.11901   4.031 5.77e-05 ***
## jobs.62     -0.15000    0.04409  -3.402 0.000681 ***
## jobs.71     -0.24657    0.20036  -1.231 0.218608
## jobs.72      0.52875    0.12252   4.316 1.67e-05 ***
## jobs.81     -0.55057    0.18831  -2.924 0.003497 **
## jobs.95     -0.25153    0.19179  -1.311 0.189836
## jobs.99      4.25699    4.88099   0.872 0.383223
## jobs.21      0.13877    0.36736   0.378 0.705657
## jobs.11      2.63400    2.12568   1.239 0.215437
## jobs.22     -0.11635    0.22376  -0.520 0.603127
## jobs.55     -0.88855    0.11360  -7.822 8.22e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9711
## F-statistic:  3339 on 21 and 2063 DF,  p-value: < 2.2e-16


Each job in Real Estate in a Zip Code in 2002 (sector 53) correlates with another 1.62 jobs in 2012. Better, yet use the types of jobs in 2002 to predict the net change in jobs from 2002 to 2012.

reg2 <- lm(jobs_plus10-jobs.tot ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56  + jobs.61 + jobs.62 + jobs.71  + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22   + jobs.55, zbp02)
summary(reg2)

##
## Call:
## lm(formula = jobs_plus10 - jobs.tot ~ jobs.23 + jobs.31 + jobs.42 +
##     jobs.44 + jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 +
##     jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 +
##     jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55,
##     data = zbp02)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -7705.6  -159.0   -51.4    39.6  9224.1
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265   25.06739   1.905 0.056923 .
## jobs.23      0.48264    0.11445   4.217 2.58e-05 ***
## jobs.31     -0.13043    0.03842  -3.395 0.000699 ***
## jobs.42      0.75718    0.11463   6.605 5.03e-11 ***
## jobs.44     -0.09425    0.05998  -1.571 0.116262
## jobs.48     -0.33567    0.10652  -3.151 0.001650 **
## jobs.51     -0.31100    0.13370  -2.326 0.020107 *
## jobs.52     -0.40695    0.06731  -6.046 1.76e-09 ***
## jobs.53      1.82325    0.31606   5.769 9.20e-09 ***
## jobs.54      0.51710    0.07039   7.346 2.93e-13 ***
## jobs.56     -0.82960    0.08420  -9.853  < 2e-16 ***
## jobs.61      0.47967    0.11901   4.031 5.77e-05 ***
## jobs.62     -0.15000    0.04409  -3.402 0.000681 ***
## jobs.71     -0.24657    0.20036  -1.231 0.218608
## jobs.72      0.52875    0.12252   4.316 1.67e-05 ***
## jobs.81     -0.55057    0.18831  -2.924 0.003497 **
## jobs.95     -0.25153    0.19179  -1.311 0.189836
## jobs.99      4.25699    4.88099   0.872 0.383223
## jobs.21      0.13877    0.36736   0.378 0.705657
## jobs.11      2.63400    2.12568   1.239 0.215437
## jobs.22     -0.11635    0.22376  -0.520 0.603127
## jobs.55     -0.88855    0.11360  -7.822 8.22e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
##   (122 observations deleted due to missingness)
## Multiple R-squared:  0.2035, Adjusted R-squared:  0.1954
## F-statistic:  25.1 on 21 and 2063 DF,  p-value: < 2.2e-16


Note how all the parameter estimates are the same, but there is now a much lower and much more useful R-squared. Instead of showing that Zip Codes with a lot of jobs in 2002 tend to have a lot of jobs in 2012, the model now shows that the number and types of jobs in each Zip Code can be used to predict whether a Zip Code gained or lost jobs. Instead of explaining 97% of the variation in the relationship, the model now claims to explain 17%.

The distribution of this dependent variable also looks a lot more normal with most Zip Codes not having changed very much.

plot(density(zbp02$jobs_plus10- zbp02$jobs.tot, na.rm=T))


The residual plit also looks a bit more homoschedastic.

plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)


EXERCISE

1. Try to make the most parsimonious model that does a good job of predicting the change in jobs from 2002 to 2012. Use a A full model vs. reduced model Anova test to compare this model to the fully specific model (with all job types) and descibe the results.

2. Try predicting the percent change in jobs instead of the absolute change. Compare and contrast the two models. Which model do you prefer and why? Hint: use the identity function to generate the percent change variable inside of your regression: I(jobs_plus10/jobs.tot-1). Input industry codes as a percentage of total jobs: I(jobs.23/jobs.tot).

3. What types of jobs are the best indicators of job growth and job losses? Desribe the quantitative relationship as expressed by the two models in question 2. Does this contradict or concur with your expectations?

4. Look at the differences in jobs in sector 99 across the two time periods. What do you think is happening?

## Planning Methods: Predicting Census Tract change

In this lab, you will use Philadelphia's 2000 Census Tract data to make predictions about the 2010 and 2020 population, and to understand what types of Census Tracts have been growing or shrinking. Download the data here (password is PennDesign). You can load it by double clicking on the file or using the load command.

Start by summarizing the data:

summary(dat2000)

##     Geo_FIPS          pop_change         pop_plus10        pop
##  Min.   :4.21e+10   Min.   :-1310.00   Min.   :   0   Min.   :    0
##  1st Qu.:4.21e+10   1st Qu.: -394.50   1st Qu.:2721   1st Qu.: 2329
##  Median :4.21e+10   Median :  -85.00   Median :3774   Median : 4008
##  Mean   :4.21e+10   Mean   :  -43.79   Mean   :3919   Mean   : 3983
##  3rd Qu.:4.21e+10   3rd Qu.:  309.50   3rd Qu.:4987   3rd Qu.: 5630
##  Max.   :4.21e+10   Max.   : 2311.00   Max.   :9022   Max.   :10479
##                     NA's   :215        NA's   :106    NA's   :109
##   pop_density     square_miles     pop_white_nonhispanic   pop_black
##  Min.   :    0   Min.   :0.06491   Min.   :   0          Min.   :   0
##  1st Qu.: 8462   1st Qu.:0.18056   1st Qu.: 114          1st Qu.: 166
##  Median :15692   Median :0.25657   Median : 821          Median : 903
##  Mean   :16120   Mean   :0.35457   Mean   :1695          Mean   :1721
##  3rd Qu.:22543   3rd Qu.:0.37694   3rd Qu.:2901          3rd Qu.:2959
##  Max.   :54824   Max.   :3.26893   Max.   :8948          Max.   :8468
##  NA's   :109     NA's   :109       NA's   :109           NA's   :109
##    pop_asian       pop_hispanic      households   households_family
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0   Min.   :   0.0
##  1st Qu.:  13.0   1st Qu.:  33.0   1st Qu.: 852   1st Qu.: 428.0
##  Median :  46.0   Median :  89.0   Median :1502   Median : 899.0
##  Mean   : 177.6   Mean   : 336.7   Mean   :1549   Mean   : 924.8
##  3rd Qu.: 168.0   3rd Qu.: 230.0   3rd Qu.:2187   3rd Qu.:1323.0
##  Max.   :2625.0   Max.   :6802.0   Max.   :6115   Max.   :2738.0
##  NA's   :109      NA's   :109      NA's   :109    NA's   :109
##  households_nonfamily  avg_hh_size     pop_25_plus   edu_less_highschool
##  Min.   :   0         Min.   :0.000   Min.   :   0   Min.   :   0.0
##  1st Qu.: 303         1st Qu.:2.220   1st Qu.:1366   1st Qu.: 286.0
##  Median : 529         Median :2.520   Median :2422   Median : 661.0
##  Mean   : 624         Mean   :2.424   Mean   :2536   Mean   : 729.9
##  3rd Qu.: 796         3rd Qu.:2.760   3rd Qu.:3624   3rd Qu.:1023.0
##  Max.   :4806         Max.   :3.600   Max.   :7155   Max.   :2733.0
##  NA's   :109          NA's   :109     NA's   :109    NA's   :109
##  edu_highschool   edu_collegeplus  pop_civ_employed pop_civ_unemployed
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0     Min.   :   0.0
##  1st Qu.: 362.0   1st Qu.: 119.0   1st Qu.: 780     1st Qu.:  75.0
##  Median : 789.0   Median : 286.0   Median :1433     Median : 158.0
##  Mean   : 845.3   Mean   : 453.1   Mean   :1535     Mean   : 187.9
##  3rd Qu.:1260.0   3rd Qu.: 557.0   3rd Qu.:2225     3rd Qu.: 266.0
##  Max.   :2809.0   Max.   :5735.0   Max.   :5692     Max.   :1678.0
##  NA's   :109      NA's   :109      NA's   :109      NA's   :109
##  median_hh_income per_capita_income housing_vacant  housing_occupied
##  Min.   :     0   Min.   :     0    Min.   :  0.0   Min.   :   0
##  1st Qu.: 21500   1st Qu.: 10817    1st Qu.: 55.0   1st Qu.: 852
##  Median : 29839   Median : 15615    Median :149.0   Median :1502
##  Mean   : 32002   Mean   : 17604    Mean   :188.7   Mean   :1549
##  3rd Qu.: 39528   3rd Qu.: 20613    3rd Qu.:279.0   3rd Qu.:2187
##  Max.   :200001   Max.   :109633    Max.   :834.0   Max.   :6115
##  NA's   :109      NA's   :109       NA's   :109     NA's   :109
##  housing_median_age percent_poverty
##  Min.   :   0       Min.   :0.00000
##  1st Qu.:1939       1st Qu.:0.09552
##  Median :1942       Median :0.18480
##  Mean   :1885       Mean   :0.21014
##  3rd Qu.:1952       3rd Qu.:0.30020
##  Max.   :1983       Max.   :0.84337
##  NA's   :109        NA's   :120


Notice how there are a number of missing data entries. This is because Census Tract numbers and borders change over time; sometimes combining, sometimes splitting. We have data for 106 2000 Census Tracts that are missing in 2010, and 109 2010 Census Tracts that are missing in 2000. There are several ways to combine Census Tracts to make them consistent over time. This website provides tools to estimate Census data from 1970 to 2000 within the 2010 boundaries.

For the purposes of this exercise, we will exclude the non-matching tracts. Look at the relationship between tracts population in 2000 and in 2010.

plot(dat2000$pop, dat2000$pop_plus10)


The density and histogram functions are convenient ways to understand the distribution of the data (in this case the 2010 population in each Census Tract). The population is somewhat normally distributed and ranges from around zero to 10,000.

plot(density(dat2000$pop_plus10, na.rm = T))  hist(dat2000$pop_plus10)


Before proceeding, compare the Census Tracts that are consistent between 2000 and 2010 with those that are not. As you would expect, low population and high population Census Tracts were most likely to change boundaries.

hist(dat2000$pop[is.na(dat2000$pop_plus10)==T])


hist(dat2000$pop[is.na(dat2000$pop_plus10)==F])


How well can you predict the 2010 Census Tract population from the 2000 Census Tract population?

reg1 <- lm(pop_plus10~pop, dat2000)
summary(reg1)

##
## Call:
## lm(formula = pop_plus10 ~ pop, data = dat2000)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1271.96  -357.73   -45.54   355.51  2334.68
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -92.07364   88.72101  -1.038      0.3
## pop           1.01166    0.01986  50.938   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 552 on 273 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.9048, Adjusted R-squared:  0.9045
## F-statistic:  2595 on 1 and 273 DF,  p-value: < 2.2e-16


The R-squared statistic indicates that the 2000 population explains about 90% of the variation in the 2000 population. The following shows how well the regression line fits the data.

plot(dat2000$pop, dat2000$pop_plus10)
abline(reg1)


Furthermore, each person in 2000 correlates with 1.01 person in 2010, though with a standard error of 0.02, this is hardly a ringing endorsement of population growth. Plotting the population change from 2000 to 2010 shows that almost the same number of Census Tracts gained and lost population.

plot(density(dat2000\$pop_change, na.rm = T))


Plotting the error term shows that the regression line fits the low population tracts a bit better than the high population tracts. (Note how closely this tracks with the population change number, since the prediction of the 2010 population relies entirely on the 2000 population.)

plot(density(resid(reg1)))