Working with the Philadelphia Region Household Travel Survey

The purpose of this lab is to learn to work with the Philadelphia region 2012 household travel survey. You will learn how to upload data into R, summarize information from the data, combine datasheets, and examine subsets of the data. By the end of the lab, you should feel comfortable looking at the travel survey and be able to complete the relevant questions from assignment 1.

Many large cities collect important individual-level data to understand and model regional travel behavior and predict the impacts of transportation investments and new commercial and housing developments. The United States also collects household travel data for the entire country. The data for surveys from 1983 to 2009 can be found here.

To work with the Philadelphia household travel survey, download the data from the Delaware Valley Regional Planning Commission website. You can find links to the 2002 data here as well. The 2012 files come as an Access database or in excel files. For this exercise, you need to open the excel files and save the sheets as a CSV. In Excel, click on File/Save as and then scroll down to CSV (comma delimited). Save or move the files to the folder where you want to work.

First, clear objects from the memory with rm() command, and set up the working directory to the folder containing your files. Note that if you are using a PC, you need to change all back slashes to forward slashes. For a brief tutorial, click here.

Next, upload the data into your work session with the read.csv() command. This will likely work with no additional options, but to be safe, use the following: Strip.white parameter gets rid of blank spaces in the table, sep parameter sets the separator, blank.lines.skip skips the blank lines, na.strings specifies the symbol for missing observations. Any of the parameters can be changed. Depending on how your data are saved to a CSV, you may need to change the seperator.

Note that you may have to change the names of the sub-folder and files, depending on how you saved the CSVs.

If you are having trouble reading the data into R, there is a handy interface in RStudio. Click on Tools in the top right, then Import Dataset, then From a Text File.

hh <- read.csv("DVRPC HTS Database Files/1_Household_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

per <- read.csv("DVRPC HTS Database Files/2_Person_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

veh <- read.csv("DVRPC HTS Database Files/3_Vehicle_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

trip <- read.csv("DVRPC HTS Database Files/4_Trip_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

Consult the documentation about the function parameters when you load the files.

?read.csv

You should now have four objects in the top right window of R Studio. You can also check that these loaded properly using the ls() command.

ls()
## [1] "hh"   "per"  "trip" "veh"

Again, type ?command to understand the function.

?ls

Finally, check your dimension to make sure that you have the right number of observations and variables.

dim(hh)
## [1] 9235   38
dim(per)
## [1] 20216    72
dim(trip)
## [1] 81940    81
dim(veh)
## [1] 14903    13

Str() uncovers the structure of the data, as well as head(). Tail() takes the last 6 and is particularly useful for making sure that your data loaded properly. The summary() function provides descriptive statistics such as mean or range, depending on the data type (for example, factors and characters do not have mean values). The brackets used with the summary command (hh[,1:5]), indicate that you only want to look at the first five variables. If you move the comma after 1:5, then you will see a summary of the first five entries for all of the variables.

str(hh) 
## 'data.frame':    9235 obs. of  38 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ HH_ID         : int  100140 100206 100291 100355 100374 100449 100503 100601 100663 100677 ...
##  $ HH_WEIGHT     : num  133 204 288 222 365 ...
##  $ HH_GEO_ACC    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ H_STATE       : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ H_COUNTY      : int  34005 34005 34005 34005 34005 34005 34005 34005 34005 34005 ...
##  $ H_CPA         : Factor w/ 84 levels "005_701","005_702",..: 2 2 1 1 1 1 1 1 4 4 ...
##  $ H_MCD         : num  3.4e+09 3.4e+09 3.4e+09 3.4e+09 3.4e+09 ...
##  $ H_TRACT       : num  3.4e+10 3.4e+10 3.4e+10 3.4e+10 3.4e+10 ...
##  $ H_TAZ         : int  20225 20249 20048 20042 20002 20020 20026 20011 20637 20629 ...
##  $ A_TYPE        : int  4 4 4 4 4 4 4 5 5 5 ...
##  $ HH_TOT_TRIPS  : int  8 11 9 5 24 9 9 18 4 5 ...
##  $ HH_MO_TRIPS   : int  8 5 9 4 24 9 7 16 4 5 ...
##  $ HH_NM_TRIPS   : int  0 6 0 1 0 0 2 2 0 0 ...
##  $ GPS_Factor    : num  1.12 1.17 1.12 1.17 1.17 1.17 1.12 1.12 1.13 1.17 ...
##  $ F_HH_TOT_TRIPS: num  8.96 12.87 10.08 5.85 28.08 ...
##  $ F_HH_MO_TRIPS : num  8.96 5.85 10.08 4.68 28.08 ...
##  $ F_HH_NM_TRIPS : num  0 7.02 0 1.17 0 0 2.24 2.24 0 0 ...
##  $ HH_SIZE       : int  2 3 5 1 4 2 2 4 1 2 ...
##  $ HH_WORK       : int  2 2 2 1 1 0 1 2 0 2 ...
##  $ TOT_VEH       : int  2 2 2 1 2 2 2 2 1 1 ...
##  $ OP_VEH        : int  2 2 2 1 2 2 2 2 1 1 ...
##  $ VEH_OUT       : int  20 9000 11 9000 3 20 50 15 1 30 ...
##  $ TOT_BIKE      : int  1 3 7 0 4 0 2 4 0 0 ...
##  $ TOLL_ACCNT    : int  1 1 1 2 1 1 1 1 2 2 ...
##  $ CAR_SHARE     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ RES_TYPE      : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ INCOME        : int  6 5 7 4 5 5 6 7 99 4 ...
##  $ TRAV_DATE     : Factor w/ 271 levels "1/1/2013","1/10/2013",..: 236 230 215 243 230 213 243 213 215 213 ...
##  $ TRAV_DOW      : int  5 4 2 3 4 1 3 1 2 1 ...
##  $ HOLIDAY       : Factor w/ 21 levels "","Birthday of Martin Luther King, Jr.",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HOL_TYPE      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OV_SAMPLE     : int  1 0 0 0 1 0 1 0 0 0 ...
##  $ STUDY         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RECRUIT       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RETRIEVE      : int  3 3 3 2 3 3 2 3 3 2 ...
##  $ R_LANG        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SURV_LANG     : int  1 1 1 1 1 1 1 1 1 1 ...
head(hh)
##   ID  HH_ID HH_WEIGHT HH_GEO_ACC H_STATE H_COUNTY   H_CPA      H_MCD
## 1  1 100140 133.46032          1      34    34005 005_702 3400543740
## 2  2 100206 204.10617          1      34    34005 005_702 3400547880
## 3  3 100291 287.91055          1      34    34005 005_701 3400517440
## 4  4 100355 222.44414          1      34    34005 005_701 3400517080
## 5  5 100374 365.14646          1      34    34005 005_701 3400505740
## 6  6 100449  74.05932          1      34    34005 005_701 3400508950
##       H_TRACT H_TAZ A_TYPE HH_TOT_TRIPS HH_MO_TRIPS HH_NM_TRIPS GPS_Factor
## 1 34005700405 20225      4            8           8           0       1.12
## 2 34005700501 20249      4           11           5           6       1.17
## 3 34005700603 20048      4            9           9           0       1.12
## 4 34005700800 20042      4            5           4           1       1.17
## 5 34005700900 20002      4           24          24           0       1.17
## 6 34005701104 20020      4            9           9           0       1.17
##   F_HH_TOT_TRIPS F_HH_MO_TRIPS F_HH_NM_TRIPS HH_SIZE HH_WORK TOT_VEH
## 1           8.96          8.96          0.00       2       2       2
## 2          12.87          5.85          7.02       3       2       2
## 3          10.08         10.08          0.00       5       2       2
## 4           5.85          4.68          1.17       1       1       1
## 5          28.08         28.08          0.00       4       1       2
## 6          10.53         10.53          0.00       2       0       2
##   OP_VEH VEH_OUT TOT_BIKE TOLL_ACCNT CAR_SHARE RES_TYPE INCOME TRAV_DATE
## 1      2      20        1          1         2        1      6  8/3/2012
## 2      2    9000        3          1         2        1      5  8/2/2012
## 3      2      11        7          1         2        1      7 7/31/2012
## 4      1    9000        0          2         2        1      4  8/8/2012
## 5      2       3        4          1         2        1      5  8/2/2012
## 6      2      20        0          1         2        1      5 7/30/2012
##   TRAV_DOW HOLIDAY HOL_TYPE OV_SAMPLE STUDY RECRUIT RETRIEVE R_LANG
## 1        5                0         1     1       1        3      1
## 2        4                0         0     1       1        3      1
## 3        2                0         0     1       1        3      1
## 4        3                0         0     1       1        2      1
## 5        4                0         1     1       1        3      1
## 6        1                0         0     1       1        3      1
##   SURV_LANG
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1
summary(hh[,1:5])
##        ID           HH_ID           HH_WEIGHT         HH_GEO_ACC
##  Min.   :   1   Min.   : 100140   Min.   :  40.32   Min.   :1   
##  1st Qu.:2361   1st Qu.: 397484   1st Qu.: 109.56   1st Qu.:1   
##  Median :4746   Median : 510518   Median : 165.11   Median :1   
##  Mean   :4781   Mean   : 629041   Mean   : 227.09   Mean   :1   
##  3rd Qu.:7198   3rd Qu.: 703038   3rd Qu.: 279.67   3rd Qu.:1   
##  Max.   :9628   Max.   :1736693   Max.   :1086.64   Max.   :1   
##     H_STATE     
##  Min.   :34.00  
##  1st Qu.:42.00  
##  Median :42.00  
##  Mean   :40.04  
##  3rd Qu.:42.00  
##  Max.   :42.00

For this purpose functions names() and colnames() are almost identical – they provide the names of all columns (variables) in the data-set. Rownames provides the list of names of rows 100 to 1005. Be sure to check the data dictionary provided with the data to understand each variable.

names(hh)
##  [1] "ID"             "HH_ID"          "HH_WEIGHT"      "HH_GEO_ACC"    
##  [5] "H_STATE"        "H_COUNTY"       "H_CPA"          "H_MCD"         
##  [9] "H_TRACT"        "H_TAZ"          "A_TYPE"         "HH_TOT_TRIPS"  
## [13] "HH_MO_TRIPS"    "HH_NM_TRIPS"    "GPS_Factor"     "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS"  "F_HH_NM_TRIPS"  "HH_SIZE"        "HH_WORK"       
## [21] "TOT_VEH"        "OP_VEH"         "VEH_OUT"        "TOT_BIKE"      
## [25] "TOLL_ACCNT"     "CAR_SHARE"      "RES_TYPE"       "INCOME"        
## [29] "TRAV_DATE"      "TRAV_DOW"       "HOLIDAY"        "HOL_TYPE"      
## [33] "OV_SAMPLE"      "STUDY"          "RECRUIT"        "RETRIEVE"      
## [37] "R_LANG"         "SURV_LANG"
colnames(hh)
##  [1] "ID"             "HH_ID"          "HH_WEIGHT"      "HH_GEO_ACC"    
##  [5] "H_STATE"        "H_COUNTY"       "H_CPA"          "H_MCD"         
##  [9] "H_TRACT"        "H_TAZ"          "A_TYPE"         "HH_TOT_TRIPS"  
## [13] "HH_MO_TRIPS"    "HH_NM_TRIPS"    "GPS_Factor"     "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS"  "F_HH_NM_TRIPS"  "HH_SIZE"        "HH_WORK"       
## [21] "TOT_VEH"        "OP_VEH"         "VEH_OUT"        "TOT_BIKE"      
## [25] "TOLL_ACCNT"     "CAR_SHARE"      "RES_TYPE"       "INCOME"        
## [29] "TRAV_DATE"      "TRAV_DOW"       "HOLIDAY"        "HOL_TYPE"      
## [33] "OV_SAMPLE"      "STUDY"          "RECRUIT"        "RETRIEVE"      
## [37] "R_LANG"         "SURV_LANG"
rownames(hh[100:105,])
## [1] "100" "101" "102" "103" "104" "105"

Functions View() and fix() provide a useful tabular view of the data. View() opens a separate tab in RStudio. It is important to remember to capitalize the first letter of this function, otherwise it will not run. Fix() (not capitalized) opens a new window and allows for editing, just like Excel. In order to reuse the edits the R object has to be saved as a new file.

View(hh)
fix(hh)

Function table() creates useful summary table and multidimensional cross-tables. In this case, passing the TOT_VEH variable to the function creates a one-dimensional table of the number of households by the number of vehicles. Dividing it by the length of the variable produces the percentages.

table(hh$TOT_VEH)
## 
##    0    1    2    3    4    5    6    7    8   10   15 
##  833 3109 3726 1120  318   93   22    8    4    1    1
table(hh$TOT_VEH)/length(hh$TOT_VEH)
## 
##            0            1            2            3            4 
## 0.0902003249 0.3366540336 0.4034650785 0.1212777477 0.0344342177 
##            5            6            7            8           10 
## 0.0100703844 0.0023822415 0.0008662696 0.0004331348 0.0001082837 
##           15 
## 0.0001082837

Using the round() function we rounded the values and multiplied them by 100 to make them more understandable. 40.4% of the households have two cars.

round(table(hh$TOT_VEH)/length(hh$TOT_VEH)*100, digits=3)
## 
##      0      1      2      3      4      5      6      7      8     10 
##  9.020 33.665 40.347 12.128  3.443  1.007  0.238  0.087  0.043  0.011 
##     15 
##  0.011

Passing two arguments creates a two-dimensional table of the number of households by the number of cars and the household income. Note that the brackets are removing data where the income is unknown (98) or not given (99). You will also need to consult the data dictionary to understand the income levels represented by the numbers 1 through 10.

table(hh$TOT_VEH[hh$INCOME < 97], hh$INCOME[hh$INCOME < 97])
##     
##        1   2   3   4   5   6   7   8   9  10
##   0  179 240  93  90  95  33  30  17  15   0
##   1  114 504 405 510 587 335 236  84  82  27
##   2   19  90 140 333 646 685 812 343 210 121
##   3    3  14  25  57 147 187 295 152  78  64
##   4    2   4   6  15  47  53  70  47  23  19
##   5    1   1   2   3  10  10  23  16   7   6
##   6    0   0   0   0   3   5   5   5   1   3
##   7    0   0   0   0   1   4   1   0   1   0
##   8    0   0   0   0   2   1   0   0   0   1
##   10   0   0   0   0   0   0   0   1   0   0
##   15   0   0   0   0   1   0   0   0   0   0

Functions hist() and plot() are used for plotting and let us see how the data is distributed. Hist() creates a histogram, embedding the function density() in plot() will plot the density of the observed distribution. In this case the density plot is not particularly useful, but it frequently is quite helpful with larger, more continuous data.

hist(hh$TOT_VEH)

plot(density(hh$TOT_VEH))

More on functions table() and hist()

?table
?hist

You see that many of the observations have a value of 98 or 99. This is the value used for those households that refused to report their income. Using the square brackets or the function subset() allows us to only select valid observations. In general, use brackets for a single variable, but subset for multiple variables (including the entire data frame).

table(hh$INCOME)
## 
##    1    2    3    4    5    6    7    8    9   10   98   99 
##  318  853  671 1008 1539 1313 1472  665  417  241   13  725
hist(hh$INCOM[hh$INCOME < 98])

table(hh$TOT_VEH[hh$INCOME < 98])
## 
##    0    1    2    3    4    5    6    7    8   10   15 
##  792 2884 3399 1022  286   79   22    7    4    1    1
summary(subset(hh[,1:5], hh$INCOME < 98))
##        ID           HH_ID           HH_WEIGHT         HH_GEO_ACC
##  Min.   :   1   Min.   : 100140   Min.   :  40.32   Min.   :1   
##  1st Qu.:2366   1st Qu.: 397693   1st Qu.: 113.42   1st Qu.:1   
##  Median :4783   Median : 512032   Median : 169.94   Median :1   
##  Mean   :4815   Mean   : 636321   Mean   : 233.20   Mean   :1   
##  3rd Qu.:7288   3rd Qu.: 708975   3rd Qu.: 291.01   3rd Qu.:1   
##  Max.   :9628   Max.   :1736693   Max.   :1086.64   Max.   :1   
##     H_STATE     
##  Min.   :34.00  
##  1st Qu.:42.00  
##  Median :42.00  
##  Mean   :40.03  
##  3rd Qu.:42.00  
##  Max.   :42.00

More on function subset()

?subset

You can also recode the missing observations, so that they can be displayed nicely in the histogram.

hh$inc <- hh$INCOM #same variable
hh$inc[hh$inc==98]<- 11  #recoding missing from 98 to 11
hh$inc[hh$inc==99]<- 12 #recoding refused from 99 to 12
hist(hh$inc) #looks much better now

There are also 3 other data-frames from DVRPC. Often you will want to combine information. In order to combine the data, you first need to understand the different data values. Notice that the person-level file has a column called HH_ID. This will be the variable to use to join the person data and the household data

names(per)
##  [1] "ID"            "HH_ID"         "PERSON_NUM"    "PERSON_ID"    
##  [5] "HH_WEIGHT"     "P_WEIGHT"      "P_TOT_TRIPS"   "P_MO_TRIPS"   
##  [9] "P_NM_TRIPS"    "GPS_Factor"    "F_P_TOT_TRIPS" "F_P_MO_TRIPS" 
## [13] "F_P_NM_TRIPS"  "GEND"          "AGECAT"        "RACE"         
## [17] "EDUCA"         "RELAT"         "LIC"           "EMPLY"        
## [21] "WK_STAT"       "SEMPLY"        "JOBS"          "WK_LOC"       
## [25] "W_GEO_ACC"     "W_STATE"       "W_REGION"      "W_COUNTY"     
## [29] "W_CPA"         "W_MCD"         "W_TRACT"       "W_TAZ"        
## [33] "HOURS"         "OCCUP"         "WK_MODE"       "WK_MODETMS"   
## [37] "ARRV_WRK"      "ARRV_TMS"      "LV_WRK"        "LV_TMS"       
## [41] "TCOMM"         "PARK_SUB"      "PARK_SUB_AMT"  "TRAN_SUB"     
## [45] "TRAN_SUB_AMT"  "WDAY_1"        "WDAY_2"        "WDAY_3"       
## [49] "WDAY_4"        "WDAY_5"        "WDAY_6"        "WDAY_7"       
## [53] "STUDE"         "SCHOL"         "S_HOME"        "S_ONLN"       
## [57] "S_GEO_ACC"     "S_STATE"       "S_REGION"      "S_COUNTY"     
## [61] "S_CPA"         "S_MCD"         "S_TRACT"       "S_TAZ"        
## [65] "PRESCH_1"      "PRESCH_2"      "PRESCH_3"      "T_TRIP"       
## [69] "PAY_TYPE1"     "PAY_TYPE2"     "PAY_TYPE3"     "TAP"

Use the length() and unique() commands to check how well the data will combine. The function length() allows you to extract the length of any vector; unique() creates a vector of all unique observations within another vector or a variable. Note that this should be equal to the number of observations in the household data. By contrast, the length of the person id should be equal to the number of observations in the person data.

length(unique(per$HH_ID))
## [1] 9235
length(unique(per$PERSON_ID))
## [1] 20216

You should also notice that the person id is the same as the household id with the person number added to it.

head(per[1:4])
##    ID  HH_ID PERSON_NUM PERSON_ID
## 1 279 100140          1  10014001
## 2 280 100140          2  10014002
## 3 281 100206          1  10020601
## 4 282 100206          2  10020602
## 5 283 100206          3  10020603
## 6 284 100291          1  10029101

Now that you have found the correct identification numbers, you can merge the files using the merge command. Add all of the household data to the person file and create a new data-frame. In this case, there is a perfect merger, but if there were some people with missing household data, you could keep all entries by setting all.x = True instead of False.

per_hh <- merge(per, hh, by = "HH_ID", all.x = FALSE, all.y=FALSE, sort = FALSE)

Always be careful to check your data before and after a merge to make sure that you are getting what you wants. In this case, we want the dimensions of the data-frame to have the same number of observations as the person data.

The next data-set contains data about individual trips that people made during one day. Note that you made the head() command show 10 entries instead of the default of 6. The trips data-frame has an id to match both the household and person data.

head(trip[,1:12], 10)

Try making a subset of all trips made by members of one household using the subset() function. The output of the function is a data frame, which can be checked using class() function. Some functions return objects in matrix form, which makes the data processing a little bit different.

test <- subset(trip, trip$HH_ID == 100374 )
class(test)
## [1] "data.frame"

Now look at two variables of this subset. The variable MODE contains information about the mode that was used in each trip. MODE_AGG combines some of the modes into larger categories. Again refer to the data dictionary.

test$MODE
##  [1]  5  5  5  5 NA  5  5  5  5  5  5  5  5  5  5 NA  5  5  5  5 NA  5  5
## [24]  5  5  5  5 NA
test$MODE_AGG
##  [1]  3  3  3  3 NA  3  3  3  3  3  3  3  3  3  3 NA  3  3  3  3 NA  3  3
## [24]  3  3  3  3 NA

You can find out how many trips do not have any information about mode using is.na function. It produces a vector of the same length as trip$mode, and only contain TRUE or FALSE values. Summary() function counts how many different values are contained in the logical vector. The number of TRUEs is about 20,000, therefore there are 20,000 trips with no information about mode.

summary(is.na(trip$MODE))
##    Mode   FALSE    TRUE    NA's 
## logical   61739   20201       0

Then you can select only the observations that have information about mode.

head(subset(trip,is.na(trip$MODE)==FALSE))[,5:10]
##   HH_WEIGHT P_WEIGHT TOUR_NUM WKSUB_NUM   WKSUB_ID TRIP_NUM
## 1  133.4603 132.5431        1        NA 1.0014e+11        1
## 2  133.4603 132.5431        1        NA 1.0014e+11        2
## 3  133.4603 132.5431        1        NA 1.0014e+11        3
## 4  133.4603 132.5431        1        NA 1.0014e+11        4
## 5  133.4603 132.5431        1        NA 1.0014e+11        5
## 7  133.4603 115.1988        1        NA 1.0014e+11        1

You may also wish to have a better idea of what is happening with the data with no reported mode. Notice that the trip number is equal to 0 (no trips) or 97 (final destination).

table(trip$TRIP_NUM[is.na(trip$MODE)==TRUE])
## 
##     0    97 
##  3611 16590

Finally, look at the vehicle data-set.

head(veh)
##   ID  HH_ID HH_WEIGHT TOT_VEH OP_VEH VEH_NUM   VEH_ID VEH_OWN YEAR MAKE
## 1 79 100140  133.4603       2      2       1 10014001       1 2009   15
## 2 80 100140  133.4603       2      2       2 10014002       1 2010    8
## 3 81 100206  204.1062       2      2       1 10020601       1 1999   14
## 4 82 100206  204.1062       2      2       2 10020602       1 2009   37
## 5 83 100291  287.9105       2      2       1 10029101       2 2010   39
## 6 84 100291  287.9105       2      2       2 10029102       2 2011   39
##           MODEL BODY VEH_TYPE
## 1        SONATA    1        2
## 2 GRAND CARAVAN    8        2
## 3           CRV    2        2
## 4     FORRESTER    2        2
## 5         PRIUS    1        1
## 6       SEIANNA    8        2

Notice that there are more vehicles than households. And that there are fewer unique household ids in the vehicle data than there are households. This is because not all households own cars, but some own more than 1.

dim(veh)
## [1] 14903    13
length(unique(veh$HH_ID))
## [1] 8307

Now that you have looked at all the data, try to answer these two questions: How many trips do people take on average? And how does this vary by income?

The first should be relatively straightforward, since the person data already contains a variable on each persons total trips.

table(per$P_TOT_TRIPS)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 3625  629 6025 2467 2880 1544 1251  698  478  292  182   76   33   21    6 
##   15   16   19   20 
##    4    2    1    2
mean(per$P_TOT_TRIPS)
## [1] 3.053225

This, however, is not entirely correct. The data come from a sample of households that vary systematically from households in Philadelphia. This is represented by weights. Look at the value of weights and then use the weighted.mean() function to apply them.

summary(per$P_WEIGHT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73.82   88.67  117.50  278.30  233.10 3665.00
weighted.mean(per$P_TOT_TRIPS, per$P_WEIGHT)
## [1] 2.902891

It is often interesting to look at who gets under-sampled in these types of surveys. First, plot the density of person weights. Notice how there is a long tail of people with very high weights.

plot(density(per$P_WEIGHT))

plot(density(per$P_WEIGHT[per$P_WEIGHT<500]))

Now compare the racial differences across the high and low weights. Notice how the under-sampled population (those given a high weight) is less white than the over-sampled population. In general, these types of surveys tend to over-sample white households, households that own there own homes, large households, and households in single-family homes. Without applying the weights, the summaries are more likely to reflect the sample than the population of the Philadelphia region.

round(table(per$RACE[per$P_WEIGHT<=278.3])/length(per$RACE[per$P_WEIGHT<=278.3]),3)
## 
##     1     2     3     4     5     6    97    98    99   100   988 
## 0.872 0.052 0.008 0.001 0.020 0.000 0.002 0.001 0.011 0.008 0.024
round(table(per$RACE[per$P_WEIGHT>278.3])/length(per$RACE[per$P_WEIGHT>278.3]),3)
## 
##     1     2     3     4     5    97    98    99   100   988 
## 0.712 0.157 0.039 0.003 0.027 0.004 0.001 0.011 0.018 0.027

Note that each weight is equal to the number of people or households that the observation is supposed to represent. So that you can sum them for an estimate of the number of people, households, or vehicles in the region.

sum(per$P_WEIGHT)
## [1] 5627010
sum(hh$HH_WEIGHT)
## [1] 2097203
sum(hh$HH_WEIGHT*hh$TOT_VEH)
## [1] 3330849

Now look at how the number of trips vary by income. This is a little bit trickier. First, you have to have merge the person and household data (which we already did). Second, you need to make the estimate for each income group. There are a few ways to automate this (if you are interested look at the apply functions and the doBy package), but for now do it manually for the highest and lowest income groups.

weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==1], per_hh$P_WEIGHT[per_hh$INCOME==1]) 
## [1] 2.489796
weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==10], per_hh$P_WEIGHT[per_hh$INCOME==10])
## [1] 3.339281

Finally, you may want to create and save a new data-frame that only contains what you want. First create a new data-frame that contains the variables you want to keep.

dat <- per_hh[c("HH_ID", "PERSON_ID", "P_TOT_TRIPS",  "GEND" , "AGECAT", "RACE", "INCOME", "TOT_VEH")]

You can save it as a CSV or an R data file. A CSV is particularly useful if you have created a summary file that you want to open to make a nicely formatted table in Excel or send your file to someone who does not use R.

The first argument specifies the data frame that needs to be saved, argument specifies the file name. By default, R will write the files to your current working directory.

write.csv(dat, file="per_hh.csv")
save(dat, file="per_hh.Rda")

EXERCISE

  1. Describe in detail the daily activities and trips of the first person in the trips data.
  2. What is the mode share for the Philadelphia region? (Hint: Start with MODE_AGG instead of MODE)
  3. For Philadelphia?
  4. Give the weighted value too. (Hint: use the trip files with the person weights).
  5. Compare the regional mode share for work and non-work trips. (For convenience, use the TOUR_TYPE variable instead of the activities.)
  6. Compare the mode share for households with income under $35,000 with those above or equal to $35,000.
  7. Plot the number of trips against income and describe any patterns that you see.
  8. What percent of driving trips had to pay for parking?
  9. What was the average hourly parking rate? (Hint: not all prices are presented in hours, so convert these)


 

Posted in Transportation Planning | Comments Off on Working with the Philadelphia Region Household Travel Survey

Transit user perceptions of driverless buses

PhD student Xiaoxia (Summer) Dong presents our paper with Matt DiScenna on Transit user perceptions of driverless buses at the 2017 PA Automated Vehicle Summit in State College, PA.

The paper reports the results of a stated preference survey of regular transit users’ willingness to ride and concerns about driverless buses in the Philadelphia region. As automated technologies advance, driverless buses may offer significant efficiency, safety, and operational improvements over traditional bus services. However, unfamiliarity with automated vehicle technology may challenge its acceptance among the general public and slow the adoption of new technologies. Using a mixed logit modeling framework, this research examines which types of transit users are most willing to ride in driverless buses and whether having a transit employee on board to monitor the vehicle operations and/or provide customer service matters. Of the 891 surveyed members of University of Pennsylvania’s transit pass benefit program, two-thirds express a willingness to ride in a driverless bus when a transit employee is on board to monitor vehicle operations and provide customer service. By contrast, only 13% would agree to ride a bus without an employee on board. Males and those in younger age groups (18–34) are more willing to ride in driverless buses than females and those in older age groups. Findings suggest that, so long as a transit employee is onboard, many transit passengers will willingly board early generation automated buses. An abrupt shift to buses without employees on board, by contrast, will likely alienate many transit users.

Posted in Uncategorized | Leave a comment

Getting start with American Factfinder

This lab provides some tips for using American Factfinder to complete a neighborhood analysis over time in Philadelphia.

First, you will need to identify the location of your neighborhood and identify the appropriate Census Tracts in 2000 and 2010. Note that these numbers can change over time, as can the basic geographies. It is extremely important that the boundaries used in 2000 match the boundaries used in 2010. Otherwise, you are not comparing the same neighborhoods.
This website is a helpful starting point, but there are no official neighborhood designations in Philadelphia. To find your Census boundaries, refer to these or other resources:
Tiger Web Apps
2000 Census Tract Maps (note that you can change the location of the map based on the number at the end of the URL and the map key in the bottom right.)
2010 Maps (See note above about moving the map)

Once you have written down the appropriate Census Tract numbers, get started with American Factfinder. In the left drag-down menu, start with Advanced Search. I recommend starting with the dataset in the Topics section and entering one of the data sets you plan to use (e.g., 2000 SF3, the 2000 long form data.) After you click on the dataset, you will see it gets added to the selection criteria in the top left.

Next, click Geographies, manually enter Census Tracts, Pennsylvania, Philadelphia, and the desired Census Tract numbers.

Next, start with the quick tables to begin your database on the neighborhood.

From this base, you can add more specific data that are relevant to your neighborhood analysis. I recommend using the search function above. Be sure to remove the Product Type: Quick Tables from the top left selection, before searching to see a complete list. Do, however, keep the data type selected. Otherwise you will be searching from an unmanageable number.

Posted in Planning Methods | Leave a comment

Planning Methods: Population trend model

Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville’s population. For a review of OLS, please refer to class readings:

An Introduction to Statistical Learning, chapter 3

Video lectures available here (also, chapter 3)

Tufte’s Data Analysis for Politics and Policy ($2 for ebook)

Let’s start by looking at the population over time:

plot(pop$Year, pop$Libreville)

It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we’ll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.

lbv <- lm(pop$Libreville ~ pop$Year)
summary(lbv)
## 
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.81 -24.46  -4.11  21.42  91.48 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.185e+04  1.314e+03  -16.62 3.84e-09 ***
## pop$Year     1.116e+01  6.637e-01   16.82 3.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.77 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9626, Adjusted R-squared:  0.9592 
## F-statistic:   283 on 1 and 11 DF,  p-value: 3.389e-09

What do the estimates mean?

The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year.

So if we want to know the estimated population in 2000, we could estimate it manually as -21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually.

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000
## [1] 481.7582

Or we could predict each year in the data:

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year
##  [1] -76.48352 -20.65934  35.16484  90.98901 146.81319 202.63736 258.46154
##  [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330

Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).

cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind
##    [,1] [,2]      [,3]
## 1  1950   15 -76.48352
## 2  1955   17 -20.65934
## 3  1960   29  35.16484
## 4  1965   49  90.98901
## 5  1970   77 146.81319
## 6  1975  155 202.63736
## 7  1980  234 258.46154
## 8  1985  293 314.28571
## 9  1990  366 370.10989
## 10 1995  439 425.93407
## 11 2000  496 481.75824
## 12 2005  559 537.58242
## 13 2010  631 593.40659
## 14 2015   NA 649.23077
## 15 2020   NA 705.05495
## 16 2025   NA 760.87912
## 17 2030   NA 816.70330

Is this a good predictor of population from 2015 to 2030?

The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let’s start by looking at the regression line?

plot(pop$Year, pop$Libreville)
abline(lbv)

The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)

plot(predict(lbv), resid(lbv))
abline(h=0)

Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?


If you guessed underpredict, then I agree. While we don’t necessarily know that the ongoing trend will continue, we’re making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future. *****

Another way that we might test our model is to see how well our general modeling framework predicts values outside of the sample used in the regression. For cross-sectional, we might fit our model on a random sample of our data and then see how well it predicts other data.

Since we are predicting growth trends, however, it makes more sense to intentionally remove the most recent datapoints and see how well a model using older data predicts them.

lbv_back <- lm(Libreville ~ Year, subset(pop, pop$Year < 1995))

plot(pop$Year, pop$Libreville)
abline(lbv_back)

This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data.

mean(resid(lbv_back)^2)
## [1] 1438.128
mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015] - predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)
## [1] 9702.728

Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.

There are a few ways that we might try to get better predictions.

The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:

plot(pop$Year, pop$Libreville)
lbv <- lm(Libreville ~ Year, subset(pop, pop$Year > 1970))
summary(lbv)
## 
## Call:
## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year > 
##     1970))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6667 -3.5595 -0.9524  3.5060  8.8095 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.636e+04  3.534e+02  -74.58 3.91e-10 ***
## Year         1.343e+01  1.774e-01   75.70 3.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.747 on 6 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.9988 
## F-statistic:  5731 on 1 and 6 DF,  p-value: 3.576e-10
abline(lbv)

Though not necessarily well-suited to these data, adding a quadratic terms can often improve model fit:

lbv2 <- lm(pop$Libreville ~ pop$Year + I(pop$Year^2))
summary(lbv2)
## 
## Call:
## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.701 -11.787   6.736  15.260  29.637 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.190e+05  8.743e+04   4.792 0.000733 ***
## pop$Year      -4.341e+02  8.832e+01  -4.915 0.000609 ***
## I(pop$Year^2)  1.124e-01  2.230e-02   5.042 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.95 on 10 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9873 
## F-statistic: 468.3 on 2 and 10 DF,  p-value: 1.315e-10
plot(pop$Year, pop$Libreville)

lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)

Or we could try predicting Libreville’s population as a function of a larger geography, such as all Sub-Saharan African cities over 300,000.

plot(pop$Sub_Saharan_300kplus, pop$Libreville)

summary(lm(Libreville~Sub_Saharan_300kplus,pop))
## 
## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.44 -27.04 -23.48  41.54  51.95 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          22.016224  15.931117   1.382    0.194    
## Sub_Saharan_300kplus  0.004803   0.000242  19.850 5.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.14 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9704 
## F-statistic:   394 on 1 and 11 DF,  p-value: 5.795e-10
abline(lm(Libreville~Sub_Saharan_300kplus,pop))

Or we could take advantage of the knowledge of the population in the year immediately before our prediction. Our previous models ignored the fact that the 2010 data is a lot more important than any of the other years of data for predicting the population in the future.

First, we need to create a new time-lagged population variable. Walk through the code slowly using the command head(pop_n1) between each line of code to see exactly what you are doing.

pop_n1 <- pop[c("Libreville", "Year")]
pop_n1$Year <- pop_n1$Year + 5 
pop_n1$Libreville_m5 <- pop_n1$Libreville
pop_n1 <- pop_n1[c(2:3)] 
pop_n1 <- subset(pop_n1, pop_n1$Year < 2015)
pop <- merge(pop, pop_n1, by = "Year", all.x = T)

Now predict Libreville’s population using the population 5 years earlier and the year.

lbv3 <- lm(Libreville ~ Libreville_m5 + Year, pop)
summary(lbv3)
## 
## Call:
## lm(formula = Libreville ~ Libreville_m5 + Year, data = pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.877  -6.747  -1.215   8.818  15.653 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -9.271e+03  1.961e+03  -4.726  0.00108 ** 
## Libreville_m5  6.705e-01  9.021e-02   7.432 3.96e-05 ***
## Year           4.740e+00  9.995e-01   4.742  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.79 on 9 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9972, Adjusted R-squared:  0.9966 
## F-statistic:  1601 on 2 and 9 DF,  p-value: 3.271e-12

This model explains 99.7% of the variance in the data an indicates that the population in 2015 is equal to the population in 2010 x .67 + 2010 (the year) x 4.74. To estimate the population in 2020, you will need to add the 2015 population estimate.

plot(pop$Year, pop$Libreville)
lines(pop$Year[pop$Year < 2014 & pop$Year > 1950], predict(lbv3), col=3)

EXERCISE Download population data for New York City and read it into R.

  1. Plot New York City’s population from 1790 to 2010 and describe the major trends.
  2. Based on these trends and your knowledge of NYC, describe what you think NYC’s population will be in 2020 and why.
  3. Use a linear regression to make a prediction of New York’s 2020 population. Is the prediction any good? Why?
  4. Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).
  5. Try a quadratic regression.
  6. Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).
  7. Use earlier data (i.e., excluding the prediction years) to predict NYC’s 2000 and 2010 population. Describe the predictions in relationship to the earlier estimates.
  8. Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?
  9. Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.
  10. Create a time-lagged population variable and use it to predict New York’s population. What is the expected population in 2020?
Posted in Planning Methods | Comments Off on Planning Methods: Population trend model

Moving to access report

Brookings Institution report with Gilles Duranton:

Developing a common narrative on urban accessibility: An urban planning perspective

Posted in Uncategorized | Leave a comment

Beyond Mobility book cover

Sneak peak at the cover of my forthcoming Island Press book with Robert Cervero and Stefan Al:

Traditionally, transport planning has sought to enhance mobility. Shifting the focus from movement to accessing destinations leads to an entirely different framework for planning and designing cities and the infrastructure that serves them. This book illustrate principles and practices of linking sustainable transport programs with place-making.

 

Posted in Uncategorized | Leave a comment

Queretaro Bike Planning Studio

Last fall, I took 8 graduate students to Queretaro, Mexico, for a bicycle planning studio. The final final planning document is available here: CYCLE_QRO_Small

A few group images from the trip:

Posted in Uncategorized | Leave a comment

Consumer preference survey on electric motorcycles in Solo, Indonesia

https://kotakita-journal.tumblr.com/post/137522690302/what-would-it-take-for-you-to-give-up-your

Posted in Uncategorized | Leave a comment

Planning for cars that drive themselves

Driverless cars are coming. Nearly all major car manufactures and several large technology firms are racing to develop vehicles that can operate with limited-to-no human interaction. Despite the probability of technological, legal, and political pitfalls along the way, fully autonomous vehicles (buses and cars) will likely be commercially available and driving themselves on city streets and highways in the next 20 years—well within a long-range planning horizon.

Read more.

 

Posted in Uncategorized | Leave a comment

Examining Philadelphia land use data by Census Tract





The focus of this lab is to examine and learn about a dataset that matches tax assessor and Census data by tract in Philadelphia. The tax assessor data were recently made freely available online. This dataset will be used to help develop a land use program and land allocation plan for Philadelphia in 2030.

Land uses are divided into seven categories: single-family housing, multifamily housing, commercial, industrial, retail, vacant, and exempt. Information on lot sizes, amount of built area, and market value are aggregates at the Census tract level and matched with Census and spatial information, such as the distance to a train station, bus stop, and city hall. Download the final dataset here and read it into RStudio. David Hsu and Paul Amos helped assemble this dataset.

You can also find a description of the underlying tax assessor data here.

It is a fairly wide dataset (77 columns) so start by looking at the variable names and numbers.

names(dat)
##  [1] "fips"                       "tract"                     
##  [3] "sfr.built.sf"               "mfr.built.sf"              
##  [5] "retail.built.sf"            "comm.built.sf"             
##  [7] "indust.built.sf"            "vacant.built.sf"           
##  [9] "exempt.built.sf"            "sfr.lot.sf"                
## [11] "mfr.lot.sf"                 "retail.lot.sf"             
## [13] "comm.lot.sf"                "indust.lot.sf"             
## [15] "vacant.lot.sf"              "exempt.lot.sf"             
## [17] "sfr.mark.val"               "mfr.mark.val"              
## [19] "retail.mark.val"            "comm.mark.val"             
## [21] "indust.mark.val"            "vacant.mark.val"           
## [23] "exempt.mark.val"            "total.built.sf"            
## [25] "total.lot.sf"               "total.mark.val"            
## [27] "ft.2.city.hall"             "ft.2.train.st"             
## [29] "ft.2.bus.st"                "pop.2010"                  
## [31] "pop_density.2010"           "square_miles.2010"         
## [33] "pop_white_nonhispanic.2010" "pop_black.2010"            
## [35] "pop_asian.2010"             "pop_hispanic.2010"         
## [37] "households.2010"            "households_family.2010"    
## [39] "households_nonfamily.2010"  "avg_hh_size.2010"          
## [41] "pop_25_plus.2010"           "edu_less_highschool.2010"  
## [43] "edu_highschool.2010"        "edu_collegeplus.2010"      
## [45] "pop_civ_employed.2010"      "pop_civ_unemployed.2010"   
## [47] "median_hh_income.2010"      "per_capita_income.2010"    
## [49] "housing_vacant.2010"        "housing_occupied.2010"     
## [51] "housing_median_age.2010"    "percent_poverty.2010"      
## [53] "pop_change.2000"            "pop_plus10.2000"           
## [55] "pop.2000"                   "pop_density.2000"          
## [57] "square_miles.2000"          "pop_white_nonhispanic.2000"
## [59] "pop_black.2000"             "pop_asian.2000"            
## [61] "pop_hispanic.2000"          "households.2000"           
## [63] "households_family.2000"     "households_nonfamily.2000" 
## [65] "avg_hh_size.2000"           "pop_25_plus.2000"          
## [67] "edu_less_highschool.2000"   "edu_highschool.2000"       
## [69] "edu_collegeplus.2000"       "pop_civ_employed.2000"     
## [71] "pop_civ_unemployed.2000"    "median_hh_income.2000"     
## [73] "per_capita_income.2000"     "housing_vacant.2000"       
## [75] "housing_occupied.2000"      "housing_median_age.2000"   
## [77] "percent_poverty.2000"

Variables 3 through 26 are from the tax assessor. Summarize these variables to get a sense of how the built environment varies across census tracts.

summary(dat[c(3:26)])
##   sfr.built.sf      mfr.built.sf     retail.built.sf  comm.built.sf     
##  Min.   :      0   Min.   :      0   Min.   :     0   Min.   :       0  
##  1st Qu.: 848570   1st Qu.: 144324   1st Qu.: 22700   1st Qu.:   52053  
##  Median :1429116   Median : 270901   Median : 71180   Median :  120155  
##  Mean   :1432991   Mean   : 395834   Mean   : 93474   Mean   :  357440  
##  3rd Qu.:1970797   3rd Qu.: 461973   3rd Qu.:133074   3rd Qu.:  239220  
##  Max.   :4357472   Max.   :3099629   Max.   :780672   Max.   :19805858  
##  indust.built.sf   vacant.built.sf    exempt.built.sf   
##  Min.   :      0   Min.   :   0.000   Min.   :    3771  
##  1st Qu.:      0   1st Qu.:   0.000   1st Qu.:  178846  
##  Median :  32572   Median :   0.000   Median :  319638  
##  Mean   : 272199   Mean   :   8.906   Mean   :  712816  
##  3rd Qu.: 198236   3rd Qu.:   0.000   3rd Qu.:  683018  
##  Max.   :6372334   Max.   :1520.000   Max.   :28206707  
##    sfr.lot.sf         mfr.lot.sf      retail.lot.sf     comm.lot.sf      
##  Min.   :       0   Min.   :      0   Min.   :     0   Min.   :       0  
##  1st Qu.:  955739   1st Qu.: 115597   1st Qu.: 23083   1st Qu.:   79158  
##  Median : 1845223   Median : 251603   Median : 54802   Median :  168255  
##  Mean   : 2476646   Mean   : 400486   Mean   : 66522   Mean   :  497081  
##  3rd Qu.: 3139156   3rd Qu.: 516614   3rd Qu.: 89680   3rd Qu.:  455709  
##  Max.   :33126247   Max.   :3026616   Max.   :633754   Max.   :17741673  
##  indust.lot.sf      vacant.lot.sf      exempt.lot.sf     
##  Min.   :       0   Min.   :       0   Min.   :   29248  
##  1st Qu.:       0   1st Qu.:   68008   1st Qu.:  348156  
##  Median :   31122   Median :  167437   Median :  746535  
##  Mean   :  548498   Mean   :  448650   Mean   : 1803958  
##  3rd Qu.:  212020   3rd Qu.:  415549   3rd Qu.: 1628277  
##  Max.   :26713019   Max.   :13823323   Max.   :40611433  
##   sfr.mark.val        mfr.mark.val       retail.mark.val   
##  Min.   :        0   Min.   :        0   Min.   :       0  
##  1st Qu.: 17434950   1st Qu.:  2777300   1st Qu.:  522825  
##  Median : 40370150   Median :  6816500   Median : 1359150  
##  Mean   : 48218979   Mean   : 13362086   Mean   : 2320078  
##  3rd Qu.: 70492725   3rd Qu.: 14893450   3rd Qu.: 2729175  
##  Max.   :249615900   Max.   :248868900   Max.   :32522100  
##  comm.mark.val       indust.mark.val     vacant.mark.val   
##  Min.   :0.000e+00   Min.   :        0   Min.   :       0  
##  1st Qu.:1.369e+06   1st Qu.:        0   1st Qu.:  214075  
##  Median :3.658e+06   Median :   393350   Median :  481950  
##  Mean   :1.887e+07   Mean   :  4313410   Mean   : 1652835  
##  3rd Qu.:9.132e+06   3rd Qu.:  2400625   3rd Qu.: 1139750  
##  Max.   :1.671e+09   Max.   :154563400   Max.   :36376000  
##  exempt.mark.val     total.built.sf      total.lot.sf     
##  Min.   :5.657e+05   Min.   :   17886   Min.   :  605741  
##  1st Qu.:6.623e+06   1st Qu.: 2190500   1st Qu.: 2539507  
##  Median :1.379e+07   Median : 2830458   Median : 4162716  
##  Mean   :4.344e+07   Mean   : 3264763   Mean   : 6241842  
##  3rd Qu.:3.132e+07   3rd Qu.: 3534784   3rd Qu.: 6811800  
##  Max.   :1.502e+09   Max.   :28206707   Max.   :91325669  
##  total.mark.val     
##  Min.   :4.546e+06  
##  1st Qu.:5.499e+07  
##  Median :9.375e+07  
##  Mean   :1.322e+08  
##  3rd Qu.:1.392e+08  
##  Max.   :2.286e+09

There are also four vaiables related to geographic location. The fips codes could be matched to a census tract shape file, the others measure the number of feet between a census tract centroid and city hall, the closest train station, and the closest bus stop.

summary(dat[c(1,27:29)])
##       fips          ft.2.city.hall  ft.2.train.st      ft.2.bus.st     
##  Min.   :4.21e+10   Min.   : 2881   Min.   :  258.3   Min.   :   6.36  
##  1st Qu.:4.21e+10   1st Qu.:16596   1st Qu.: 4087.1   1st Qu.: 265.63  
##  Median :4.21e+10   Median :28692   Median : 6777.2   Median : 628.73  
##  Mean   :4.21e+10   Mean   :31776   Mean   : 7443.2   Mean   : 781.29  
##  3rd Qu.:4.21e+10   3rd Qu.:43484   3rd Qu.: 9832.9   3rd Qu.:1083.03  
##  Max.   :4.21e+10   Max.   :87673   Max.   :26020.5   Max.   :7140.68

The remaining data are taken from the 2000 Census and 2010 5-year ACS. Note that not all of the 2000 data match and there are several null values in different categories.

Look at the total number of square miles of the census tracts and the total square miles of parcel area as well.

sum (dat$square_miles.2010)
## [1] 134.1014
sum(dat$total.lot.sf)/  27.8784e6 #divided by the number of square feet in a square miles 
## [1] 85.97578

Now start creating and examining some key ratios, like total FAR. There is significant variation with some tracts having almost no built areas and others with a high net FAR.

dat$far <- dat$total.built.sf / dat$total.lot.sf
summary(dat$far)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.005557  0.419600  0.696300  0.875800  1.052000 13.330000

Try exluding vacant land from the FAR calculation.

summary (dat$total.built.sf / (dat$total.lot.sf - dat$vacant.lot.sf))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.005557  0.423300  0.748200  0.943900  1.165000 14.650000

And looking at how vacancy varies across census tracts

hist(dat$vacant.lot.sf/ dat$total.lot.sf )

plot of chunk unnamed-chunk-8

Now, create some net FAR calculations by land use type.

summary(dat$comm.built.sf/ dat$comm.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00376  0.33490  0.61070  1.09600  1.16400 24.78000        9
summary(dat$indust.built.sf/ dat$indust.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##  0.00711  0.59420  1.04900  1.44700  1.66800 16.04000      104
summary(dat$retail.built.sf/ dat$retail.lot.sf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0355  0.9365  1.3690  1.3830  1.8030  4.3190      29
summary(dat$exempt.built.sf/ dat$exempt.lot.sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001703 0.209100 0.522600 0.783200 1.033000 7.203000

It is often useful to know how much housing households and individuals consume on average and across neighborhoods. These ratios will also be useful in understadning how much space new households are likely to consume. Note the substantial variation.

dat$res.sf.per.hh <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$households.2010
dat$res.sf.per.hh[dat$res.sf.per.hh == Inf] <- NA ##get rid of zeroes since n/0 == Infinity
summary( dat$res.sf.per.hh )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   104.1  1092.0  1220.0  1237.0  1367.0  4538.0       8

This looks about right with most households having between 1000 and 1400 square feet.

hist( dat$res.sf.per.hh )

plot of chunk unnamed-chunk-11

And per capita

dat$res.sf.per.person <- (dat$sfr.built.sf + dat$mfr.built.sf) / dat$pop.2010
dat$res.sf.per.person[dat$res.sf.per.person == Inf] <- NA
hist( dat$res.sf.per.person )

plot of chunk unnamed-chunk-12

Look at the relationship between the population and square footage of housing.

plot( I(dat$sfr.built.sf + dat$mfr.built.sf), dat$pop.2010 )

plot of chunk unnamed-chunk-13

Perhaps surprisingly there does not seem to be a strong relationship between housing per capita and the centrality of a neighborhood.

plot(dat$res.sf.per.person , dat$ft.2.city.hall)

plot of chunk unnamed-chunk-14

There is a bit more of a pattern between location and the proportion of housing that is multifamily.

dat$per.mf <- dat$mfr.built.sf / (dat$sfr.built.sf + dat$mfr.built.sf )
plot(dat$per.mf , dat$ft.2.city.hall)

plot of chunk unnamed-chunk-15

Which land uses tend to go together? Use a correlation table to examine total square footage.

cor(dat[c(3:7,9)])
##                 sfr.built.sf mfr.built.sf retail.built.sf comm.built.sf
## sfr.built.sf       1.0000000 -0.227687793     0.208504423   -0.21049419
## mfr.built.sf      -0.2276878  1.000000000     0.009489829    0.45513983
## retail.built.sf    0.2085044  0.009489829     1.000000000    0.04310543
## comm.built.sf     -0.2104942  0.455139826     0.043105428    1.00000000
## indust.built.sf   -0.1466598 -0.135759385    -0.038551580    0.04383247
## exempt.built.sf   -0.2613878  0.103988183    -0.049405209    0.16346712
##                 indust.built.sf exempt.built.sf
## sfr.built.sf        -0.14665984     -0.26138775
## mfr.built.sf        -0.13575939      0.10398818
## retail.built.sf     -0.03855158     -0.04940521
## comm.built.sf        0.04383247      0.16346712
## indust.built.sf      1.00000000      0.06878225
## exempt.built.sf      0.06878225      1.00000000

And land area

cor(dat[c(10:16)])
##                 sfr.lot.sf   mfr.lot.sf retail.lot.sf comm.lot.sf
## sfr.lot.sf     1.000000000  0.174062104   -0.01660241  0.03719301
## mfr.lot.sf     0.174062104  1.000000000   -0.06027767  0.12670109
## retail.lot.sf -0.016602407 -0.060277674    1.00000000 -0.01112620
## comm.lot.sf    0.037193013  0.126701092   -0.01112620  1.00000000
## indust.lot.sf -0.071661971 -0.009073780   -0.10092578  0.39273649
## vacant.lot.sf -0.013189068  0.009647197   -0.05241089  0.71002277
## exempt.lot.sf  0.000523062  0.011982805   -0.16823137  0.37194754
##               indust.lot.sf vacant.lot.sf exempt.lot.sf
## sfr.lot.sf      -0.07166197  -0.013189068   0.000523062
## mfr.lot.sf      -0.00907378   0.009647197   0.011982805
## retail.lot.sf   -0.10092578  -0.052410890  -0.168231370
## comm.lot.sf      0.39273649   0.710022767   0.371947536
## indust.lot.sf    1.00000000   0.643286320   0.489299242
## vacant.lot.sf    0.64328632   1.000000000   0.531508105
## exempt.lot.sf    0.48929924   0.531508105   1.000000000

Neither of these measures normalizes the data. Now compare how the proportion of area in a given land use compares with the proportion in other land uses.

cor(I(dat[3]/dat[24]), I(dat[4]/dat[24]))
##              mfr.built.sf
## sfr.built.sf   -0.2779901
cor(I(dat[3]/dat[24]), I(dat[5]/dat[24]))
##              retail.built.sf
## sfr.built.sf       0.1160276
cor(I(dat[3]/dat[24]), I(dat[6]/dat[24]))
##              comm.built.sf
## sfr.built.sf    -0.4443231
cor(I(dat[3]/dat[24]), I(dat[7]/dat[24]))
##              indust.built.sf
## sfr.built.sf      -0.3305134
cor(I(dat[3]/dat[24]), I(dat[8]/dat[24]))
##              vacant.built.sf
## sfr.built.sf     0.008721208
cor(I(dat[3]/dat[24]), I(dat[9]/dat[24]))
##              exempt.built.sf
## sfr.built.sf      -0.6608287

Try using regressions to better understanding partial correlations between land uses.

summary(lm(sfr.built.sf ~ mfr.built.sf + retail.built.sf + comm.built.sf + indust.built.sf 
           +  exempt.built.sf +  vacant.lot.sf  , dat))
## 
## Call:
## lm(formula = sfr.built.sf ~ mfr.built.sf + retail.built.sf + 
##     comm.built.sf + indust.built.sf + exempt.built.sf + vacant.lot.sf, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1548586  -500194   -17272   498795  2940319 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.550e+06  6.833e+04  22.686  < 2e-16 ***
## mfr.built.sf    -3.486e-01  1.005e-01  -3.469 0.000582 ***
## retail.built.sf  1.605e+00  3.848e-01   4.172 3.76e-05 ***
## comm.built.sf   -5.509e-02  3.127e-02  -1.762 0.078859 .  
## indust.built.sf -1.539e-01  6.400e-02  -2.405 0.016662 *  
## exempt.built.sf -8.591e-02  1.978e-02  -4.344 1.80e-05 ***
## vacant.lot.sf   -1.419e-02  3.871e-02  -0.367 0.714091    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728300 on 377 degrees of freedom
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1661 
## F-statistic: 13.71 on 6 and 377 DF,  p-value: 4.216e-14

EXERCISE

  1. Estimate a gross measure of total FAR using Census land area and plot this against net FAR. Describe any important variation.

  2. Try developing a model to predict meighborhood FAR. What are the important and statistically significant predictors?

  3. Try predicting the FAR of each land use. Which are the easiest to predict? How do the predictors vary across uses?

Posted in Planning Methods | Comments Off on Examining Philadelphia land use data by Census Tract