
Recent Posts
Archives
Grupo Milenio forum on transportation and planning in Mexico City
Posted in Uncategorized
Leave a comment
2018 Transit Trends
http://www.trapezegroup.com/blogentry/2018transittrendsfromaroundtheglobenorthamerica
Posted in Uncategorized
Leave a comment
Where do bike lanes work best? A Bayesian spatial model of bicycle lanes and bicycle crashes
This link should work for 50 days: https://authors.elsevier.com/c/1WBX23IVV9Z7m9
Posted in Uncategorized
Leave a comment
Working with the Philadelphia Region Household Travel Survey
The purpose of this lab is to learn to work with the Philadelphia region 2012 household travel survey. You will learn how to upload data into R, summarize information from the data, combine datasheets, and examine subsets of the data. By the end of the lab, you should feel comfortable looking at the travel survey and be able to complete the relevant questions from assignment 1.
Many large cities collect important individuallevel data to understand and model regional travel behavior and predict the impacts of transportation investments and new commercial and housing developments. The United States also collects household travel data for the entire country. The data for surveys from 1983 to 2009 can be found here.
To work with the Philadelphia household travel survey, download the data from the Delaware Valley Regional Planning Commission website. You can find links to the 2002 data here as well. The 2012 files come as an Access database or in excel files. For this exercise, you need to open the excel files and save the sheets as a CSV. In Excel, click on File/Save as and then scroll down to CSV (comma delimited). Save or move the files to the folder where you want to work.
First, clear objects from the memory with rm() command, and set up the working directory to the folder containing your files. Note that if you are using a PC, you need to change all back slashes to forward slashes. For a brief tutorial, click here.
Next, upload the data into your work session with the read.csv() command. This will likely work with no additional options, but to be safe, use the following: Strip.white parameter gets rid of blank spaces in the table, sep parameter sets the separator, blank.lines.skip skips the blank lines, na.strings specifies the symbol for missing observations. Any of the parameters can be changed. Depending on how your data are saved to a CSV, you may need to change the seperator.
Note that you may have to change the names of the subfolder and files, depending on how you saved the CSVs.
If you are having trouble reading the data into R, there is a handy interface in RStudio. Click on Tools in the top right, then Import Dataset, then From a Text File.
hh < read.csv("DVRPC HTS Database Files/1_Household_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")
per < read.csv("DVRPC HTS Database Files/2_Person_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")
veh < read.csv("DVRPC HTS Database Files/3_Vehicle_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")
trip < read.csv("DVRPC HTS Database Files/4_Trip_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")
Consult the documentation about the function parameters when you load the files.
?read.csv
You should now have four objects in the top right window of R Studio. You can also check that these loaded properly using the ls() command.
ls()
## [1] "hh" "per" "trip" "veh"
Again, type ?command to understand the function.
?ls
Finally, check your dimension to make sure that you have the right number of observations and variables.
dim(hh)
## [1] 9235 38
dim(per)
## [1] 20216 72
dim(trip)
## [1] 81940 81
dim(veh)
## [1] 14903 13
Str() uncovers the structure of the data, as well as head(). Tail() takes the last 6 and is particularly useful for making sure that your data loaded properly. The summary() function provides descriptive statistics such as mean or range, depending on the data type (for example, factors and characters do not have mean values). The brackets used with the summary command (hh[,1:5]), indicate that you only want to look at the first five variables. If you move the comma after 1:5, then you will see a summary of the first five entries for all of the variables.
str(hh)
## 'data.frame': 9235 obs. of 38 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ HH_ID : int 100140 100206 100291 100355 100374 100449 100503 100601 100663 100677 ...
## $ HH_WEIGHT : num 133 204 288 222 365 ...
## $ HH_GEO_ACC : int 1 1 1 1 1 1 1 1 1 1 ...
## $ H_STATE : int 34 34 34 34 34 34 34 34 34 34 ...
## $ H_COUNTY : int 34005 34005 34005 34005 34005 34005 34005 34005 34005 34005 ...
## $ H_CPA : Factor w/ 84 levels "005_701","005_702",..: 2 2 1 1 1 1 1 1 4 4 ...
## $ H_MCD : num 3.4e+09 3.4e+09 3.4e+09 3.4e+09 3.4e+09 ...
## $ H_TRACT : num 3.4e+10 3.4e+10 3.4e+10 3.4e+10 3.4e+10 ...
## $ H_TAZ : int 20225 20249 20048 20042 20002 20020 20026 20011 20637 20629 ...
## $ A_TYPE : int 4 4 4 4 4 4 4 5 5 5 ...
## $ HH_TOT_TRIPS : int 8 11 9 5 24 9 9 18 4 5 ...
## $ HH_MO_TRIPS : int 8 5 9 4 24 9 7 16 4 5 ...
## $ HH_NM_TRIPS : int 0 6 0 1 0 0 2 2 0 0 ...
## $ GPS_Factor : num 1.12 1.17 1.12 1.17 1.17 1.17 1.12 1.12 1.13 1.17 ...
## $ F_HH_TOT_TRIPS: num 8.96 12.87 10.08 5.85 28.08 ...
## $ F_HH_MO_TRIPS : num 8.96 5.85 10.08 4.68 28.08 ...
## $ F_HH_NM_TRIPS : num 0 7.02 0 1.17 0 0 2.24 2.24 0 0 ...
## $ HH_SIZE : int 2 3 5 1 4 2 2 4 1 2 ...
## $ HH_WORK : int 2 2 2 1 1 0 1 2 0 2 ...
## $ TOT_VEH : int 2 2 2 1 2 2 2 2 1 1 ...
## $ OP_VEH : int 2 2 2 1 2 2 2 2 1 1 ...
## $ VEH_OUT : int 20 9000 11 9000 3 20 50 15 1 30 ...
## $ TOT_BIKE : int 1 3 7 0 4 0 2 4 0 0 ...
## $ TOLL_ACCNT : int 1 1 1 2 1 1 1 1 2 2 ...
## $ CAR_SHARE : int 2 2 2 2 2 2 2 2 2 2 ...
## $ RES_TYPE : int 1 1 1 1 1 1 2 1 1 1 ...
## $ INCOME : int 6 5 7 4 5 5 6 7 99 4 ...
## $ TRAV_DATE : Factor w/ 271 levels "1/1/2013","1/10/2013",..: 236 230 215 243 230 213 243 213 215 213 ...
## $ TRAV_DOW : int 5 4 2 3 4 1 3 1 2 1 ...
## $ HOLIDAY : Factor w/ 21 levels "","Birthday of Martin Luther King, Jr.",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HOL_TYPE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OV_SAMPLE : int 1 0 0 0 1 0 1 0 0 0 ...
## $ STUDY : int 1 1 1 1 1 1 1 1 1 1 ...
## $ RECRUIT : int 1 1 1 1 1 1 1 1 1 1 ...
## $ RETRIEVE : int 3 3 3 2 3 3 2 3 3 2 ...
## $ R_LANG : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SURV_LANG : int 1 1 1 1 1 1 1 1 1 1 ...
head(hh)
## ID HH_ID HH_WEIGHT HH_GEO_ACC H_STATE H_COUNTY H_CPA H_MCD
## 1 1 100140 133.46032 1 34 34005 005_702 3400543740
## 2 2 100206 204.10617 1 34 34005 005_702 3400547880
## 3 3 100291 287.91055 1 34 34005 005_701 3400517440
## 4 4 100355 222.44414 1 34 34005 005_701 3400517080
## 5 5 100374 365.14646 1 34 34005 005_701 3400505740
## 6 6 100449 74.05932 1 34 34005 005_701 3400508950
## H_TRACT H_TAZ A_TYPE HH_TOT_TRIPS HH_MO_TRIPS HH_NM_TRIPS GPS_Factor
## 1 34005700405 20225 4 8 8 0 1.12
## 2 34005700501 20249 4 11 5 6 1.17
## 3 34005700603 20048 4 9 9 0 1.12
## 4 34005700800 20042 4 5 4 1 1.17
## 5 34005700900 20002 4 24 24 0 1.17
## 6 34005701104 20020 4 9 9 0 1.17
## F_HH_TOT_TRIPS F_HH_MO_TRIPS F_HH_NM_TRIPS HH_SIZE HH_WORK TOT_VEH
## 1 8.96 8.96 0.00 2 2 2
## 2 12.87 5.85 7.02 3 2 2
## 3 10.08 10.08 0.00 5 2 2
## 4 5.85 4.68 1.17 1 1 1
## 5 28.08 28.08 0.00 4 1 2
## 6 10.53 10.53 0.00 2 0 2
## OP_VEH VEH_OUT TOT_BIKE TOLL_ACCNT CAR_SHARE RES_TYPE INCOME TRAV_DATE
## 1 2 20 1 1 2 1 6 8/3/2012
## 2 2 9000 3 1 2 1 5 8/2/2012
## 3 2 11 7 1 2 1 7 7/31/2012
## 4 1 9000 0 2 2 1 4 8/8/2012
## 5 2 3 4 1 2 1 5 8/2/2012
## 6 2 20 0 1 2 1 5 7/30/2012
## TRAV_DOW HOLIDAY HOL_TYPE OV_SAMPLE STUDY RECRUIT RETRIEVE R_LANG
## 1 5 0 1 1 1 3 1
## 2 4 0 0 1 1 3 1
## 3 2 0 0 1 1 3 1
## 4 3 0 0 1 1 2 1
## 5 4 0 1 1 1 3 1
## 6 1 0 0 1 1 3 1
## SURV_LANG
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
summary(hh[,1:5])
## ID HH_ID HH_WEIGHT HH_GEO_ACC
## Min. : 1 Min. : 100140 Min. : 40.32 Min. :1
## 1st Qu.:2361 1st Qu.: 397484 1st Qu.: 109.56 1st Qu.:1
## Median :4746 Median : 510518 Median : 165.11 Median :1
## Mean :4781 Mean : 629041 Mean : 227.09 Mean :1
## 3rd Qu.:7198 3rd Qu.: 703038 3rd Qu.: 279.67 3rd Qu.:1
## Max. :9628 Max. :1736693 Max. :1086.64 Max. :1
## H_STATE
## Min. :34.00
## 1st Qu.:42.00
## Median :42.00
## Mean :40.04
## 3rd Qu.:42.00
## Max. :42.00
For this purpose functions names() and colnames() are almost identical – they provide the names of all columns (variables) in the dataset. Rownames provides the list of names of rows 100 to 1005. Be sure to check the data dictionary provided with the data to understand each variable.
names(hh)
## [1] "ID" "HH_ID" "HH_WEIGHT" "HH_GEO_ACC"
## [5] "H_STATE" "H_COUNTY" "H_CPA" "H_MCD"
## [9] "H_TRACT" "H_TAZ" "A_TYPE" "HH_TOT_TRIPS"
## [13] "HH_MO_TRIPS" "HH_NM_TRIPS" "GPS_Factor" "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS" "F_HH_NM_TRIPS" "HH_SIZE" "HH_WORK"
## [21] "TOT_VEH" "OP_VEH" "VEH_OUT" "TOT_BIKE"
## [25] "TOLL_ACCNT" "CAR_SHARE" "RES_TYPE" "INCOME"
## [29] "TRAV_DATE" "TRAV_DOW" "HOLIDAY" "HOL_TYPE"
## [33] "OV_SAMPLE" "STUDY" "RECRUIT" "RETRIEVE"
## [37] "R_LANG" "SURV_LANG"
colnames(hh)
## [1] "ID" "HH_ID" "HH_WEIGHT" "HH_GEO_ACC"
## [5] "H_STATE" "H_COUNTY" "H_CPA" "H_MCD"
## [9] "H_TRACT" "H_TAZ" "A_TYPE" "HH_TOT_TRIPS"
## [13] "HH_MO_TRIPS" "HH_NM_TRIPS" "GPS_Factor" "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS" "F_HH_NM_TRIPS" "HH_SIZE" "HH_WORK"
## [21] "TOT_VEH" "OP_VEH" "VEH_OUT" "TOT_BIKE"
## [25] "TOLL_ACCNT" "CAR_SHARE" "RES_TYPE" "INCOME"
## [29] "TRAV_DATE" "TRAV_DOW" "HOLIDAY" "HOL_TYPE"
## [33] "OV_SAMPLE" "STUDY" "RECRUIT" "RETRIEVE"
## [37] "R_LANG" "SURV_LANG"
rownames(hh[100:105,])
## [1] "100" "101" "102" "103" "104" "105"
Functions View() and fix() provide a useful tabular view of the data. View() opens a separate tab in RStudio. It is important to remember to capitalize the first letter of this function, otherwise it will not run. Fix() (not capitalized) opens a new window and allows for editing, just like Excel. In order to reuse the edits the R object has to be saved as a new file.
View(hh)
fix(hh)
Function table() creates useful summary table and multidimensional crosstables. In this case, passing the TOT_VEH variable to the function creates a onedimensional table of the number of households by the number of vehicles. Dividing it by the length of the variable produces the percentages.
table(hh$TOT_VEH)
##
## 0 1 2 3 4 5 6 7 8 10 15
## 833 3109 3726 1120 318 93 22 8 4 1 1
table(hh$TOT_VEH)/length(hh$TOT_VEH)
##
## 0 1 2 3 4
## 0.0902003249 0.3366540336 0.4034650785 0.1212777477 0.0344342177
## 5 6 7 8 10
## 0.0100703844 0.0023822415 0.0008662696 0.0004331348 0.0001082837
## 15
## 0.0001082837
Using the round() function we rounded the values and multiplied them by 100 to make them more understandable. 40.4% of the households have two cars.
round(table(hh$TOT_VEH)/length(hh$TOT_VEH)*100, digits=3)
##
## 0 1 2 3 4 5 6 7 8 10
## 9.020 33.665 40.347 12.128 3.443 1.007 0.238 0.087 0.043 0.011
## 15
## 0.011
Passing two arguments creates a twodimensional table of the number of households by the number of cars and the household income. Note that the brackets are removing data where the income is unknown (98) or not given (99). You will also need to consult the data dictionary to understand the income levels represented by the numbers 1 through 10.
table(hh$TOT_VEH[hh$INCOME < 97], hh$INCOME[hh$INCOME < 97])
##
## 1 2 3 4 5 6 7 8 9 10
## 0 179 240 93 90 95 33 30 17 15 0
## 1 114 504 405 510 587 335 236 84 82 27
## 2 19 90 140 333 646 685 812 343 210 121
## 3 3 14 25 57 147 187 295 152 78 64
## 4 2 4 6 15 47 53 70 47 23 19
## 5 1 1 2 3 10 10 23 16 7 6
## 6 0 0 0 0 3 5 5 5 1 3
## 7 0 0 0 0 1 4 1 0 1 0
## 8 0 0 0 0 2 1 0 0 0 1
## 10 0 0 0 0 0 0 0 1 0 0
## 15 0 0 0 0 1 0 0 0 0 0
Functions hist() and plot() are used for plotting and let us see how the data is distributed. Hist() creates a histogram, embedding the function density() in plot() will plot the density of the observed distribution. In this case the density plot is not particularly useful, but it frequently is quite helpful with larger, more continuous data.
hist(hh$TOT_VEH)
plot(density(hh$TOT_VEH))
More on functions table() and hist()
?table
?hist
You see that many of the observations have a value of 98 or 99. This is the value used for those households that refused to report their income. Using the square brackets or the function subset() allows us to only select valid observations. In general, use brackets for a single variable, but subset for multiple variables (including the entire data frame).
table(hh$INCOME)
##
## 1 2 3 4 5 6 7 8 9 10 98 99
## 318 853 671 1008 1539 1313 1472 665 417 241 13 725
hist(hh$INCOM[hh$INCOME < 98])
table(hh$TOT_VEH[hh$INCOME < 98])
##
## 0 1 2 3 4 5 6 7 8 10 15
## 792 2884 3399 1022 286 79 22 7 4 1 1
summary(subset(hh[,1:5], hh$INCOME < 98))
## ID HH_ID HH_WEIGHT HH_GEO_ACC
## Min. : 1 Min. : 100140 Min. : 40.32 Min. :1
## 1st Qu.:2366 1st Qu.: 397693 1st Qu.: 113.42 1st Qu.:1
## Median :4783 Median : 512032 Median : 169.94 Median :1
## Mean :4815 Mean : 636321 Mean : 233.20 Mean :1
## 3rd Qu.:7288 3rd Qu.: 708975 3rd Qu.: 291.01 3rd Qu.:1
## Max. :9628 Max. :1736693 Max. :1086.64 Max. :1
## H_STATE
## Min. :34.00
## 1st Qu.:42.00
## Median :42.00
## Mean :40.03
## 3rd Qu.:42.00
## Max. :42.00
More on function subset()
?subset
You can also recode the missing observations, so that they can be displayed nicely in the histogram.
hh$inc < hh$INCOM #same variable
hh$inc[hh$inc==98]< 11 #recoding missing from 98 to 11
hh$inc[hh$inc==99]< 12 #recoding refused from 99 to 12
hist(hh$inc) #looks much better now
There are also 3 other dataframes from DVRPC. Often you will want to combine information. In order to combine the data, you first need to understand the different data values. Notice that the personlevel file has a column called HH_ID. This will be the variable to use to join the person data and the household data
names(per)
## [1] "ID" "HH_ID" "PERSON_NUM" "PERSON_ID"
## [5] "HH_WEIGHT" "P_WEIGHT" "P_TOT_TRIPS" "P_MO_TRIPS"
## [9] "P_NM_TRIPS" "GPS_Factor" "F_P_TOT_TRIPS" "F_P_MO_TRIPS"
## [13] "F_P_NM_TRIPS" "GEND" "AGECAT" "RACE"
## [17] "EDUCA" "RELAT" "LIC" "EMPLY"
## [21] "WK_STAT" "SEMPLY" "JOBS" "WK_LOC"
## [25] "W_GEO_ACC" "W_STATE" "W_REGION" "W_COUNTY"
## [29] "W_CPA" "W_MCD" "W_TRACT" "W_TAZ"
## [33] "HOURS" "OCCUP" "WK_MODE" "WK_MODETMS"
## [37] "ARRV_WRK" "ARRV_TMS" "LV_WRK" "LV_TMS"
## [41] "TCOMM" "PARK_SUB" "PARK_SUB_AMT" "TRAN_SUB"
## [45] "TRAN_SUB_AMT" "WDAY_1" "WDAY_2" "WDAY_3"
## [49] "WDAY_4" "WDAY_5" "WDAY_6" "WDAY_7"
## [53] "STUDE" "SCHOL" "S_HOME" "S_ONLN"
## [57] "S_GEO_ACC" "S_STATE" "S_REGION" "S_COUNTY"
## [61] "S_CPA" "S_MCD" "S_TRACT" "S_TAZ"
## [65] "PRESCH_1" "PRESCH_2" "PRESCH_3" "T_TRIP"
## [69] "PAY_TYPE1" "PAY_TYPE2" "PAY_TYPE3" "TAP"
Use the length() and unique() commands to check how well the data will combine. The function length() allows you to extract the length of any vector; unique() creates a vector of all unique observations within another vector or a variable. Note that this should be equal to the number of observations in the household data. By contrast, the length of the person id should be equal to the number of observations in the person data.
length(unique(per$HH_ID))
## [1] 9235
length(unique(per$PERSON_ID))
## [1] 20216
You should also notice that the person id is the same as the household id with the person number added to it.
head(per[1:4])
## ID HH_ID PERSON_NUM PERSON_ID
## 1 279 100140 1 10014001
## 2 280 100140 2 10014002
## 3 281 100206 1 10020601
## 4 282 100206 2 10020602
## 5 283 100206 3 10020603
## 6 284 100291 1 10029101
Now that you have found the correct identification numbers, you can merge the files using the merge command. Add all of the household data to the person file and create a new dataframe. In this case, there is a perfect merger, but if there were some people with missing household data, you could keep all entries by setting all.x = True instead of False.
per_hh < merge(per, hh, by = "HH_ID", all.x = FALSE, all.y=FALSE, sort = FALSE)
Always be careful to check your data before and after a merge to make sure that you are getting what you wants. In this case, we want the dimensions of the dataframe to have the same number of observations as the person data.
The next dataset contains data about individual trips that people made during one day. Note that you made the head() command show 10 entries instead of the default of 6. The trips dataframe has an id to match both the household and person data.
head(trip[,1:12], 10)
Try making a subset of all trips made by members of one household using the subset() function. The output of the function is a data frame, which can be checked using class() function. Some functions return objects in matrix form, which makes the data processing a little bit different.
test < subset(trip, trip$HH_ID == 100374 )
class(test)
## [1] "data.frame"
Now look at two variables of this subset. The variable MODE contains information about the mode that was used in each trip. MODE_AGG combines some of the modes into larger categories. Again refer to the data dictionary.
test$MODE
## [1] 5 5 5 5 NA 5 5 5 5 5 5 5 5 5 5 NA 5 5 5 5 NA 5 5
## [24] 5 5 5 5 NA
test$MODE_AGG
## [1] 3 3 3 3 NA 3 3 3 3 3 3 3 3 3 3 NA 3 3 3 3 NA 3 3
## [24] 3 3 3 3 NA
You can find out how many trips do not have any information about mode using is.na function. It produces a vector of the same length as trip$mode, and only contain TRUE or FALSE values. Summary() function counts how many different values are contained in the logical vector. The number of TRUEs is about 20,000, therefore there are 20,000 trips with no information about mode.
summary(is.na(trip$MODE))
## Mode FALSE TRUE NA's
## logical 61739 20201 0
Then you can select only the observations that have information about mode.
head(subset(trip,is.na(trip$MODE)==FALSE))[,5:10]
## HH_WEIGHT P_WEIGHT TOUR_NUM WKSUB_NUM WKSUB_ID TRIP_NUM
## 1 133.4603 132.5431 1 NA 1.0014e+11 1
## 2 133.4603 132.5431 1 NA 1.0014e+11 2
## 3 133.4603 132.5431 1 NA 1.0014e+11 3
## 4 133.4603 132.5431 1 NA 1.0014e+11 4
## 5 133.4603 132.5431 1 NA 1.0014e+11 5
## 7 133.4603 115.1988 1 NA 1.0014e+11 1
You may also wish to have a better idea of what is happening with the data with no reported mode. Notice that the trip number is equal to 0 (no trips) or 97 (final destination).
table(trip$TRIP_NUM[is.na(trip$MODE)==TRUE])
##
## 0 97
## 3611 16590
Finally, look at the vehicle dataset.
head(veh)
## ID HH_ID HH_WEIGHT TOT_VEH OP_VEH VEH_NUM VEH_ID VEH_OWN YEAR MAKE
## 1 79 100140 133.4603 2 2 1 10014001 1 2009 15
## 2 80 100140 133.4603 2 2 2 10014002 1 2010 8
## 3 81 100206 204.1062 2 2 1 10020601 1 1999 14
## 4 82 100206 204.1062 2 2 2 10020602 1 2009 37
## 5 83 100291 287.9105 2 2 1 10029101 2 2010 39
## 6 84 100291 287.9105 2 2 2 10029102 2 2011 39
## MODEL BODY VEH_TYPE
## 1 SONATA 1 2
## 2 GRAND CARAVAN 8 2
## 3 CRV 2 2
## 4 FORRESTER 2 2
## 5 PRIUS 1 1
## 6 SEIANNA 8 2
Notice that there are more vehicles than households. And that there are fewer unique household ids in the vehicle data than there are households. This is because not all households own cars, but some own more than 1.
dim(veh)
## [1] 14903 13
length(unique(veh$HH_ID))
## [1] 8307
Now that you have looked at all the data, try to answer these two questions: How many trips do people take on average? And how does this vary by income?
The first should be relatively straightforward, since the person data already contains a variable on each persons total trips.
table(per$P_TOT_TRIPS)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 3625 629 6025 2467 2880 1544 1251 698 478 292 182 76 33 21 6
## 15 16 19 20
## 4 2 1 2
mean(per$P_TOT_TRIPS)
## [1] 3.053225
This, however, is not entirely correct. The data come from a sample of households that vary systematically from households in Philadelphia. This is represented by weights. Look at the value of weights and then use the weighted.mean() function to apply them.
summary(per$P_WEIGHT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 73.82 88.67 117.50 278.30 233.10 3665.00
weighted.mean(per$P_TOT_TRIPS, per$P_WEIGHT)
## [1] 2.902891
It is often interesting to look at who gets undersampled in these types of surveys. First, plot the density of person weights. Notice how there is a long tail of people with very high weights.
plot(density(per$P_WEIGHT))
plot(density(per$P_WEIGHT[per$P_WEIGHT<500]))
Now compare the racial differences across the high and low weights. Notice how the undersampled population (those given a high weight) is less white than the oversampled population. In general, these types of surveys tend to oversample white households, households that own there own homes, large households, and households in singlefamily homes. Without applying the weights, the summaries are more likely to reflect the sample than the population of the Philadelphia region.
round(table(per$RACE[per$P_WEIGHT<=278.3])/length(per$RACE[per$P_WEIGHT<=278.3]),3)
##
## 1 2 3 4 5 6 97 98 99 100 988
## 0.872 0.052 0.008 0.001 0.020 0.000 0.002 0.001 0.011 0.008 0.024
round(table(per$RACE[per$P_WEIGHT>278.3])/length(per$RACE[per$P_WEIGHT>278.3]),3)
##
## 1 2 3 4 5 97 98 99 100 988
## 0.712 0.157 0.039 0.003 0.027 0.004 0.001 0.011 0.018 0.027
Note that each weight is equal to the number of people or households that the observation is supposed to represent. So that you can sum them for an estimate of the number of people, households, or vehicles in the region.
sum(per$P_WEIGHT)
## [1] 5627010
sum(hh$HH_WEIGHT)
## [1] 2097203
sum(hh$HH_WEIGHT*hh$TOT_VEH)
## [1] 3330849
Now look at how the number of trips vary by income. This is a little bit trickier. First, you have to have merge the person and household data (which we already did). Second, you need to make the estimate for each income group. There are a few ways to automate this (if you are interested look at the apply functions and the doBy package), but for now do it manually for the highest and lowest income groups.
weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==1], per_hh$P_WEIGHT[per_hh$INCOME==1])
## [1] 2.489796
weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==10], per_hh$P_WEIGHT[per_hh$INCOME==10])
## [1] 3.339281
Finally, you may want to create and save a new dataframe that only contains what you want. First create a new dataframe that contains the variables you want to keep.
dat < per_hh[c("HH_ID", "PERSON_ID", "P_TOT_TRIPS", "GEND" , "AGECAT", "RACE", "INCOME", "TOT_VEH")]
You can save it as a CSV or an R data file. A CSV is particularly useful if you have created a summary file that you want to open to make a nicely formatted table in Excel or send your file to someone who does not use R.
The first argument specifies the data frame that needs to be saved, argument specifies the file name. By default, R will write the files to your current working directory.
write.csv(dat, file="per_hh.csv")
save(dat, file="per_hh.Rda")
EXERCISE
 Describe in detail the daily activities and trips of the first person in the trips data.
 What is the mode share for the Philadelphia region? (Hint: Start with MODE_AGG instead of MODE)
 For Philadelphia?
 Give the weighted value too. (Hint: use the trip files with the person weights).
 Compare the regional mode share for work and nonwork trips. (For convenience, use the TOUR_TYPE variable instead of the activities.)
 Compare the mode share for households with income under $35,000 with those above or equal to $35,000.
 Plot the number of trips against income and describe any patterns that you see.
 What percent of driving trips had to pay for parking?
 What was the average hourly parking rate? (Hint: not all prices are presented in hours, so convert these)
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table tablecondensed'); });
Posted in Transportation Planning
Comments Off on Working with the Philadelphia Region Household Travel Survey
Transit user perceptions of driverless buses
PhD student Xiaoxia (Summer) Dong presents our paper with Matt DiScenna on Transit user perceptions of driverless buses at the 2017 PA Automated Vehicle Summit in State College, PA.
The paper reports the results of a stated preference survey of regular transit users’ willingness to ride and concerns about driverless buses in the Philadelphia region. As automated technologies advance, driverless buses may offer significant efficiency, safety, and operational improvements over traditional bus services. However, unfamiliarity with automated vehicle technology may challenge its acceptance among the general public and slow the adoption of new technologies. Using a mixed logit modeling framework, this research examines which types of transit users are most willing to ride in driverless buses and whether having a transit employee on board to monitor the vehicle operations and/or provide customer service matters. Of the 891 surveyed members of University of Pennsylvania’s transit pass benefit program, twothirds express a willingness to ride in a driverless bus when a transit employee is on board to monitor vehicle operations and provide customer service. By contrast, only 13% would agree to ride a bus without an employee on board. Males and those in younger age groups (18–34) are more willing to ride in driverless buses than females and those in older age groups. Findings suggest that, so long as a transit employee is onboard, many transit passengers will willingly board early generation automated buses. An abrupt shift to buses without employees on board, by contrast, will likely alienate many transit users.
Posted in Uncategorized
Leave a comment
Getting start with American Factfinder
This lab provides some tips for using American Factfinder to complete a neighborhood analysis over time in Philadelphia.
First, you will need to identify the location of your neighborhood and identify the appropriate Census Tracts in 2000 and 2010. Note that these numbers can change over time, as can the basic geographies. It is extremely important that the boundaries used in 2000 match the boundaries used in 2010. Otherwise, you are not comparing the same neighborhoods.
This website is a helpful starting point, but there are no official neighborhood designations in Philadelphia. To find your Census boundaries, refer to these or other resources:
Tiger Web Apps
2000 Census Tract Maps (note that you can change the location of the map based on the number at the end of the URL and the map key in the bottom right.)
2010 Maps (See note above about moving the map)
Once you have written down the appropriate Census Tract numbers, get started with American Factfinder. In the left dragdown menu, start with Advanced Search. I recommend starting with the dataset in the Topics section and entering one of the data sets you plan to use (e.g., 2000 SF3, the 2000 long form data.) After you click on the dataset, you will see it gets added to the selection criteria in the top left.
Next, click Geographies, manually enter Census Tracts, Pennsylvania, Philadelphia, and the desired Census Tract numbers.
Next, start with the quick tables to begin your database on the neighborhood.
From this base, you can add more specific data that are relevant to your neighborhood analysis. I recommend using the search function above. Be sure to remove the Product Type: Quick Tables from the top left selection, before searching to see a complete list. Do, however, keep the data type selected. Otherwise you will be searching from an unmanageable number.
Posted in Planning Methods
Leave a comment
Planning Methods: Population trend model
Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville’s population. For a review of OLS, please refer to class readings:
An Introduction to Statistical Learning, chapter 3
Video lectures available here (also, chapter 3)
Tufte’s Data Analysis for Politics and Policy ($2 for ebook)
Let’s start by looking at the population over time:
plot(pop$Year, pop$Libreville)
It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we’ll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.
lbv < lm(pop$Libreville ~ pop$Year)
summary(lbv)
##
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## 69.81 24.46 4.11 21.42 91.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 2.185e+04 1.314e+03 16.62 3.84e09 ***
## pop$Year 1.116e+01 6.637e01 16.82 3.39e09 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.77 on 11 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple Rsquared: 0.9626, Adjusted Rsquared: 0.9592
## Fstatistic: 283 on 1 and 11 DF, pvalue: 3.389e09
What do the estimates mean?
The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year.
So if we want to know the estimated population in 2000, we could estimate it manually as 21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually.
summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000
## [1] 481.7582
Or we could predict each year in the data:
summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year
## [1] 76.48352 20.65934 35.16484 90.98901 146.81319 202.63736 258.46154
## [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330
Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).
cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind
## [,1] [,2] [,3]
## 1 1950 15 76.48352
## 2 1955 17 20.65934
## 3 1960 29 35.16484
## 4 1965 49 90.98901
## 5 1970 77 146.81319
## 6 1975 155 202.63736
## 7 1980 234 258.46154
## 8 1985 293 314.28571
## 9 1990 366 370.10989
## 10 1995 439 425.93407
## 11 2000 496 481.75824
## 12 2005 559 537.58242
## 13 2010 631 593.40659
## 14 2015 NA 649.23077
## 15 2020 NA 705.05495
## 16 2025 NA 760.87912
## 17 2030 NA 816.70330
Is this a good predictor of population from 2015 to 2030?
The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let’s start by looking at the regression line?
plot(pop$Year, pop$Libreville)
abline(lbv)
The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)
plot(predict(lbv), resid(lbv))
abline(h=0)
Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?
If you guessed underpredict, then I agree. While we don’t necessarily know that the ongoing trend will continue, we’re making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future. *****
Another way that we might test our model is to see how well our general modeling framework predicts values outside of the sample used in the regression. For crosssectional, we might fit our model on a random sample of our data and then see how well it predicts other data.
Since we are predicting growth trends, however, it makes more sense to intentionally remove the most recent datapoints and see how well a model using older data predicts them.
lbv_back < lm(Libreville ~ Year, subset(pop, pop$Year < 1995))
plot(pop$Year, pop$Libreville)
abline(lbv_back)
This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data.
mean(resid(lbv_back)^2)
## [1] 1438.128
mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015]  predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)
## [1] 9702.728
Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.
There are a few ways that we might try to get better predictions.
The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:
plot(pop$Year, pop$Libreville)
lbv < lm(Libreville ~ Year, subset(pop, pop$Year > 1970))
summary(lbv)
##
## Call:
## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year >
## 1970))
##
## Residuals:
## Min 1Q Median 3Q Max
## 6.6667 3.5595 0.9524 3.5060 8.8095
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 2.636e+04 3.534e+02 74.58 3.91e10 ***
## Year 1.343e+01 1.774e01 75.70 3.58e10 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.747 on 6 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple Rsquared: 0.999, Adjusted Rsquared: 0.9988
## Fstatistic: 5731 on 1 and 6 DF, pvalue: 3.576e10
abline(lbv)
Though not necessarily wellsuited to these data, adding a quadratic terms can often improve model fit:
lbv2 < lm(pop$Libreville ~ pop$Year + I(pop$Year^2))
summary(lbv2)
##
## Call:
## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## 41.701 11.787 6.736 15.260 29.637
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 4.190e+05 8.743e+04 4.792 0.000733 ***
## pop$Year 4.341e+02 8.832e+01 4.915 0.000609 ***
## I(pop$Year^2) 1.124e01 2.230e02 5.042 0.000505 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.95 on 10 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple Rsquared: 0.9894, Adjusted Rsquared: 0.9873
## Fstatistic: 468.3 on 2 and 10 DF, pvalue: 1.315e10
plot(pop$Year, pop$Libreville)
lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)
Or we could try predicting Libreville’s population as a function of a larger geography, such as all SubSaharan African cities over 300,000.
plot(pop$Sub_Saharan_300kplus, pop$Libreville)
summary(lm(Libreville~Sub_Saharan_300kplus,pop))
##
## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## 50.44 27.04 23.48 41.54 51.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 22.016224 15.931117 1.382 0.194
## Sub_Saharan_300kplus 0.004803 0.000242 19.850 5.79e10 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.14 on 11 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple Rsquared: 0.9728, Adjusted Rsquared: 0.9704
## Fstatistic: 394 on 1 and 11 DF, pvalue: 5.795e10
abline(lm(Libreville~Sub_Saharan_300kplus,pop))
Or we could take advantage of the knowledge of the population in the year immediately before our prediction. Our previous models ignored the fact that the 2010 data is a lot more important than any of the other years of data for predicting the population in the future.
First, we need to create a new timelagged population variable. Walk through the code slowly using the command head(pop_n1) between each line of code to see exactly what you are doing.
pop_n1 < pop[c("Libreville", "Year")]
pop_n1$Year < pop_n1$Year + 5
pop_n1$Libreville_m5 < pop_n1$Libreville
pop_n1 < pop_n1[c(2:3)]
pop_n1 < subset(pop_n1, pop_n1$Year < 2015)
pop < merge(pop, pop_n1, by = "Year", all.x = T)
Now predict Libreville’s population using the population 5 years earlier and the year.
lbv3 < lm(Libreville ~ Libreville_m5 + Year, pop)
summary(lbv3)
##
## Call:
## lm(formula = Libreville ~ Libreville_m5 + Year, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## 22.877 6.747 1.215 8.818 15.653
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 9.271e+03 1.961e+03 4.726 0.00108 **
## Libreville_m5 6.705e01 9.021e02 7.432 3.96e05 ***
## Year 4.740e+00 9.995e01 4.742 0.00106 **
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.79 on 9 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple Rsquared: 0.9972, Adjusted Rsquared: 0.9966
## Fstatistic: 1601 on 2 and 9 DF, pvalue: 3.271e12
This model explains 99.7% of the variance in the data an indicates that the population in 2015 is equal to the population in 2010 x .67 + 2010 (the year) x 4.74. To estimate the population in 2020, you will need to add the 2015 population estimate.
plot(pop$Year, pop$Libreville)
lines(pop$Year[pop$Year < 2014 & pop$Year > 1950], predict(lbv3), col=3)
EXERCISE Download population data for New York City and read it into R.
 Plot New York City’s population from 1790 to 2010 and describe the major trends.
 Based on these trends and your knowledge of NYC, describe what you think NYC’s population will be in 2020 and why.
 Use a linear regression to make a prediction of New York’s 2020 population. Is the prediction any good? Why?
 Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).
 Try a quadratic regression.
 Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).
 Use earlier data (i.e., excluding the prediction years) to predict NYC’s 2000 and 2010 population. Describe the predictions in relationship to the earlier estimates.
 Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?
 Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.
 Create a timelagged population variable and use it to predict New York’s population. What is the expected population in 2020?
Posted in Planning Methods
Comments Off on Planning Methods: Population trend model
Moving to access report
Brookings Institution report with Gilles Duranton:
Developing a common narrative on urban accessibility: An urban planning perspective
Posted in Uncategorized
Leave a comment
Beyond Mobility book cover
Sneak peak at the cover of my forthcoming Island Press book with Robert Cervero and Stefan Al:
Traditionally, transport planning has sought to enhance mobility. Shifting the focus from movement to accessing destinations leads to an entirely different framework for planning and designing cities and the infrastructure that serves them. This book illustrate principles and practices of linking sustainable transport programs with placemaking.
Posted in Uncategorized
Leave a comment