## 2018 Transit Trends

http://www.trapezegroup.com/blog-entry/2018-transit-trends-from-around-the-globe-north-america

## Where do bike lanes work best? A Bayesian spatial model of bicycle lanes and bicycle crashes

This link should work for 50 days: https://authors.elsevier.com/c/1WBX23IVV9Z7m9

## Working with the Philadelphia Region Household Travel Survey

The purpose of this lab is to learn to work with the Philadelphia region 2012 household travel survey. You will learn how to upload data into R, summarize information from the data, combine datasheets, and examine subsets of the data. By the end of the lab, you should feel comfortable looking at the travel survey and be able to complete the relevant questions from assignment 1.

Many large cities collect important individual-level data to understand and model regional travel behavior and predict the impacts of transportation investments and new commercial and housing developments. The United States also collects household travel data for the entire country. The data for surveys from 1983 to 2009 can be found here.

To work with the Philadelphia household travel survey, download the data from the Delaware Valley Regional Planning Commission website. You can find links to the 2002 data here as well. The 2012 files come as an Access database or in excel files. For this exercise, you need to open the excel files and save the sheets as a CSV. In Excel, click on File/Save as and then scroll down to CSV (comma delimited). Save or move the files to the folder where you want to work.

First, clear objects from the memory with rm() command, and set up the working directory to the folder containing your files. Note that if you are using a PC, you need to change all back slashes to forward slashes. For a brief tutorial, click here.

Next, upload the data into your work session with the read.csv() command. This will likely work with no additional options, but to be safe, use the following: Strip.white parameter gets rid of blank spaces in the table, sep parameter sets the separator, blank.lines.skip skips the blank lines, na.strings specifies the symbol for missing observations. Any of the parameters can be changed. Depending on how your data are saved to a CSV, you may need to change the seperator.

Note that you may have to change the names of the sub-folder and files, depending on how you saved the CSVs.

If you are having trouble reading the data into R, there is a handy interface in RStudio. Click on Tools in the top right, then Import Dataset, then From a Text File.

hh <- read.csv("DVRPC HTS Database Files/1_Household_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

per <- read.csv("DVRPC HTS Database Files/2_Person_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

veh <- read.csv("DVRPC HTS Database Files/3_Vehicle_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

trip <- read.csv("DVRPC HTS Database Files/4_Trip_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

Consult the documentation about the function parameters when you load the files.

?read.csv

You should now have four objects in the top right window of R Studio. You can also check that these loaded properly using the ls() command.

ls()
## [1] "hh"   "per"  "trip" "veh"

Again, type ?command to understand the function.

?ls

Finally, check your dimension to make sure that you have the right number of observations and variables.

dim(hh)
## [1] 9235   38
dim(per)
## [1] 20216    72
dim(trip)
## [1] 81940    81
dim(veh)
## [1] 14903    13

Str() uncovers the structure of the data, as well as head(). Tail() takes the last 6 and is particularly useful for making sure that your data loaded properly. The summary() function provides descriptive statistics such as mean or range, depending on the data type (for example, factors and characters do not have mean values). The brackets used with the summary command (hh[,1:5]), indicate that you only want to look at the first five variables. If you move the comma after 1:5, then you will see a summary of the first five entries for all of the variables.

str(hh) 
## 'data.frame':    9235 obs. of  38 variables:
##  $ID : int 1 2 3 4 5 6 7 8 9 10 ... ##$ HH_ID         : int  100140 100206 100291 100355 100374 100449 100503 100601 100663 100677 ...
##  $HH_WEIGHT : num 133 204 288 222 365 ... ##$ HH_GEO_ACC    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $H_STATE : int 34 34 34 34 34 34 34 34 34 34 ... ##$ H_COUNTY      : int  34005 34005 34005 34005 34005 34005 34005 34005 34005 34005 ...
##  $H_CPA : Factor w/ 84 levels "005_701","005_702",..: 2 2 1 1 1 1 1 1 4 4 ... ##$ H_MCD         : num  3.4e+09 3.4e+09 3.4e+09 3.4e+09 3.4e+09 ...
##  $H_TRACT : num 3.4e+10 3.4e+10 3.4e+10 3.4e+10 3.4e+10 ... ##$ H_TAZ         : int  20225 20249 20048 20042 20002 20020 20026 20011 20637 20629 ...
##  $A_TYPE : int 4 4 4 4 4 4 4 5 5 5 ... ##$ HH_TOT_TRIPS  : int  8 11 9 5 24 9 9 18 4 5 ...
##  $HH_MO_TRIPS : int 8 5 9 4 24 9 7 16 4 5 ... ##$ HH_NM_TRIPS   : int  0 6 0 1 0 0 2 2 0 0 ...
##  $GPS_Factor : num 1.12 1.17 1.12 1.17 1.17 1.17 1.12 1.12 1.13 1.17 ... ##$ F_HH_TOT_TRIPS: num  8.96 12.87 10.08 5.85 28.08 ...
##  $F_HH_MO_TRIPS : num 8.96 5.85 10.08 4.68 28.08 ... ##$ F_HH_NM_TRIPS : num  0 7.02 0 1.17 0 0 2.24 2.24 0 0 ...
##  $HH_SIZE : int 2 3 5 1 4 2 2 4 1 2 ... ##$ HH_WORK       : int  2 2 2 1 1 0 1 2 0 2 ...
##  $TOT_VEH : int 2 2 2 1 2 2 2 2 1 1 ... ##$ OP_VEH        : int  2 2 2 1 2 2 2 2 1 1 ...
##  $VEH_OUT : int 20 9000 11 9000 3 20 50 15 1 30 ... ##$ TOT_BIKE      : int  1 3 7 0 4 0 2 4 0 0 ...
##  $TOLL_ACCNT : int 1 1 1 2 1 1 1 1 2 2 ... ##$ CAR_SHARE     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $RES_TYPE : int 1 1 1 1 1 1 2 1 1 1 ... ##$ INCOME        : int  6 5 7 4 5 5 6 7 99 4 ...
##  $TRAV_DATE : Factor w/ 271 levels "1/1/2013","1/10/2013",..: 236 230 215 243 230 213 243 213 215 213 ... ##$ TRAV_DOW      : int  5 4 2 3 4 1 3 1 2 1 ...
##  $HOLIDAY : Factor w/ 21 levels "","Birthday of Martin Luther King, Jr.",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ HOL_TYPE      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $OV_SAMPLE : int 1 0 0 0 1 0 1 0 0 0 ... ##$ STUDY         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $RECRUIT : int 1 1 1 1 1 1 1 1 1 1 ... ##$ RETRIEVE      : int  3 3 3 2 3 3 2 3 3 2 ...
##  $R_LANG : int 1 1 1 1 1 1 1 1 1 1 ... ##$ SURV_LANG     : int  1 1 1 1 1 1 1 1 1 1 ...
head(hh)
##   ID  HH_ID HH_WEIGHT HH_GEO_ACC H_STATE H_COUNTY   H_CPA      H_MCD
## 1  1 100140 133.46032          1      34    34005 005_702 3400543740
## 2  2 100206 204.10617          1      34    34005 005_702 3400547880
## 3  3 100291 287.91055          1      34    34005 005_701 3400517440
## 4  4 100355 222.44414          1      34    34005 005_701 3400517080
## 5  5 100374 365.14646          1      34    34005 005_701 3400505740
## 6  6 100449  74.05932          1      34    34005 005_701 3400508950
##       H_TRACT H_TAZ A_TYPE HH_TOT_TRIPS HH_MO_TRIPS HH_NM_TRIPS GPS_Factor
## 1 34005700405 20225      4            8           8           0       1.12
## 2 34005700501 20249      4           11           5           6       1.17
## 3 34005700603 20048      4            9           9           0       1.12
## 4 34005700800 20042      4            5           4           1       1.17
## 5 34005700900 20002      4           24          24           0       1.17
## 6 34005701104 20020      4            9           9           0       1.17
##   F_HH_TOT_TRIPS F_HH_MO_TRIPS F_HH_NM_TRIPS HH_SIZE HH_WORK TOT_VEH
## 1           8.96          8.96          0.00       2       2       2
## 2          12.87          5.85          7.02       3       2       2
## 3          10.08         10.08          0.00       5       2       2
## 4           5.85          4.68          1.17       1       1       1
## 5          28.08         28.08          0.00       4       1       2
## 6          10.53         10.53          0.00       2       0       2
##   OP_VEH VEH_OUT TOT_BIKE TOLL_ACCNT CAR_SHARE RES_TYPE INCOME TRAV_DATE
## 1      2      20        1          1         2        1      6  8/3/2012
## 2      2    9000        3          1         2        1      5  8/2/2012
## 3      2      11        7          1         2        1      7 7/31/2012
## 4      1    9000        0          2         2        1      4  8/8/2012
## 5      2       3        4          1         2        1      5  8/2/2012
## 6      2      20        0          1         2        1      5 7/30/2012
##   TRAV_DOW HOLIDAY HOL_TYPE OV_SAMPLE STUDY RECRUIT RETRIEVE R_LANG
## 1        5                0         1     1       1        3      1
## 2        4                0         0     1       1        3      1
## 3        2                0         0     1       1        3      1
## 4        3                0         0     1       1        2      1
## 5        4                0         1     1       1        3      1
## 6        1                0         0     1       1        3      1
##   SURV_LANG
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1
summary(hh[,1:5])
##        ID           HH_ID           HH_WEIGHT         HH_GEO_ACC
##  Min.   :   1   Min.   : 100140   Min.   :  40.32   Min.   :1
##  1st Qu.:2361   1st Qu.: 397484   1st Qu.: 109.56   1st Qu.:1
##  Median :4746   Median : 510518   Median : 165.11   Median :1
##  Mean   :4781   Mean   : 629041   Mean   : 227.09   Mean   :1
##  3rd Qu.:7198   3rd Qu.: 703038   3rd Qu.: 279.67   3rd Qu.:1
##  Max.   :9628   Max.   :1736693   Max.   :1086.64   Max.   :1
##     H_STATE
##  Min.   :34.00
##  1st Qu.:42.00
##  Median :42.00
##  Mean   :40.04
##  3rd Qu.:42.00
##  Max.   :42.00

For this purpose functions names() and colnames() are almost identical – they provide the names of all columns (variables) in the data-set. Rownames provides the list of names of rows 100 to 1005. Be sure to check the data dictionary provided with the data to understand each variable.

names(hh)
##  [1] "ID"             "HH_ID"          "HH_WEIGHT"      "HH_GEO_ACC"
##  [5] "H_STATE"        "H_COUNTY"       "H_CPA"          "H_MCD"
##  [9] "H_TRACT"        "H_TAZ"          "A_TYPE"         "HH_TOT_TRIPS"
## [13] "HH_MO_TRIPS"    "HH_NM_TRIPS"    "GPS_Factor"     "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS"  "F_HH_NM_TRIPS"  "HH_SIZE"        "HH_WORK"
## [21] "TOT_VEH"        "OP_VEH"         "VEH_OUT"        "TOT_BIKE"
## [25] "TOLL_ACCNT"     "CAR_SHARE"      "RES_TYPE"       "INCOME"
## [29] "TRAV_DATE"      "TRAV_DOW"       "HOLIDAY"        "HOL_TYPE"
## [33] "OV_SAMPLE"      "STUDY"          "RECRUIT"        "RETRIEVE"
## [37] "R_LANG"         "SURV_LANG"
colnames(hh)
##  [1] "ID"             "HH_ID"          "HH_WEIGHT"      "HH_GEO_ACC"
##  [5] "H_STATE"        "H_COUNTY"       "H_CPA"          "H_MCD"
##  [9] "H_TRACT"        "H_TAZ"          "A_TYPE"         "HH_TOT_TRIPS"
## [13] "HH_MO_TRIPS"    "HH_NM_TRIPS"    "GPS_Factor"     "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS"  "F_HH_NM_TRIPS"  "HH_SIZE"        "HH_WORK"
## [21] "TOT_VEH"        "OP_VEH"         "VEH_OUT"        "TOT_BIKE"
## [25] "TOLL_ACCNT"     "CAR_SHARE"      "RES_TYPE"       "INCOME"
## [29] "TRAV_DATE"      "TRAV_DOW"       "HOLIDAY"        "HOL_TYPE"
## [33] "OV_SAMPLE"      "STUDY"          "RECRUIT"        "RETRIEVE"
## [37] "R_LANG"         "SURV_LANG"
rownames(hh[100:105,])
## [1] "100" "101" "102" "103" "104" "105"

Functions View() and fix() provide a useful tabular view of the data. View() opens a separate tab in RStudio. It is important to remember to capitalize the first letter of this function, otherwise it will not run. Fix() (not capitalized) opens a new window and allows for editing, just like Excel. In order to reuse the edits the R object has to be saved as a new file.

View(hh)
fix(hh)

Function table() creates useful summary table and multidimensional cross-tables. In this case, passing the TOT_VEH variable to the function creates a one-dimensional table of the number of households by the number of vehicles. Dividing it by the length of the variable produces the percentages.

table(hh$TOT_VEH) ## ## 0 1 2 3 4 5 6 7 8 10 15 ## 833 3109 3726 1120 318 93 22 8 4 1 1 table(hh$TOT_VEH)/length(hh$TOT_VEH) ## ## 0 1 2 3 4 ## 0.0902003249 0.3366540336 0.4034650785 0.1212777477 0.0344342177 ## 5 6 7 8 10 ## 0.0100703844 0.0023822415 0.0008662696 0.0004331348 0.0001082837 ## 15 ## 0.0001082837 Using the round() function we rounded the values and multiplied them by 100 to make them more understandable. 40.4% of the households have two cars. round(table(hh$TOT_VEH)/length(hh$TOT_VEH)*100, digits=3) ## ## 0 1 2 3 4 5 6 7 8 10 ## 9.020 33.665 40.347 12.128 3.443 1.007 0.238 0.087 0.043 0.011 ## 15 ## 0.011 Passing two arguments creates a two-dimensional table of the number of households by the number of cars and the household income. Note that the brackets are removing data where the income is unknown (98) or not given (99). You will also need to consult the data dictionary to understand the income levels represented by the numbers 1 through 10. table(hh$TOT_VEH[hh$INCOME < 97], hh$INCOME[hh$INCOME < 97]) ## ## 1 2 3 4 5 6 7 8 9 10 ## 0 179 240 93 90 95 33 30 17 15 0 ## 1 114 504 405 510 587 335 236 84 82 27 ## 2 19 90 140 333 646 685 812 343 210 121 ## 3 3 14 25 57 147 187 295 152 78 64 ## 4 2 4 6 15 47 53 70 47 23 19 ## 5 1 1 2 3 10 10 23 16 7 6 ## 6 0 0 0 0 3 5 5 5 1 3 ## 7 0 0 0 0 1 4 1 0 1 0 ## 8 0 0 0 0 2 1 0 0 0 1 ## 10 0 0 0 0 0 0 0 1 0 0 ## 15 0 0 0 0 1 0 0 0 0 0 Functions hist() and plot() are used for plotting and let us see how the data is distributed. Hist() creates a histogram, embedding the function density() in plot() will plot the density of the observed distribution. In this case the density plot is not particularly useful, but it frequently is quite helpful with larger, more continuous data. hist(hh$TOT_VEH)

plot(density(hh$TOT_VEH)) More on functions table() and hist() ?table ?hist You see that many of the observations have a value of 98 or 99. This is the value used for those households that refused to report their income. Using the square brackets or the function subset() allows us to only select valid observations. In general, use brackets for a single variable, but subset for multiple variables (including the entire data frame). table(hh$INCOME)
##
##    1    2    3    4    5    6    7    8    9   10   98   99
##  318  853  671 1008 1539 1313 1472  665  417  241   13  725
hist(hh$INCOM[hh$INCOME < 98])

table(hh$TOT_VEH[hh$INCOME < 98])
##
##    0    1    2    3    4    5    6    7    8   10   15
##  792 2884 3399 1022  286   79   22    7    4    1    1
summary(subset(hh[,1:5], hh$INCOME < 98)) ## ID HH_ID HH_WEIGHT HH_GEO_ACC ## Min. : 1 Min. : 100140 Min. : 40.32 Min. :1 ## 1st Qu.:2366 1st Qu.: 397693 1st Qu.: 113.42 1st Qu.:1 ## Median :4783 Median : 512032 Median : 169.94 Median :1 ## Mean :4815 Mean : 636321 Mean : 233.20 Mean :1 ## 3rd Qu.:7288 3rd Qu.: 708975 3rd Qu.: 291.01 3rd Qu.:1 ## Max. :9628 Max. :1736693 Max. :1086.64 Max. :1 ## H_STATE ## Min. :34.00 ## 1st Qu.:42.00 ## Median :42.00 ## Mean :40.03 ## 3rd Qu.:42.00 ## Max. :42.00 More on function subset() ?subset You can also recode the missing observations, so that they can be displayed nicely in the histogram. hh$inc <- hh$INCOM #same variable hh$inc[hh$inc==98]<- 11 #recoding missing from 98 to 11 hh$inc[hh$inc==99]<- 12 #recoding refused from 99 to 12 hist(hh$inc) #looks much better now

There are also 3 other data-frames from DVRPC. Often you will want to combine information. In order to combine the data, you first need to understand the different data values. Notice that the person-level file has a column called HH_ID. This will be the variable to use to join the person data and the household data

names(per)
##  [1] "ID"            "HH_ID"         "PERSON_NUM"    "PERSON_ID"
##  [5] "HH_WEIGHT"     "P_WEIGHT"      "P_TOT_TRIPS"   "P_MO_TRIPS"
##  [9] "P_NM_TRIPS"    "GPS_Factor"    "F_P_TOT_TRIPS" "F_P_MO_TRIPS"
## [13] "F_P_NM_TRIPS"  "GEND"          "AGECAT"        "RACE"
## [17] "EDUCA"         "RELAT"         "LIC"           "EMPLY"
## [21] "WK_STAT"       "SEMPLY"        "JOBS"          "WK_LOC"
## [25] "W_GEO_ACC"     "W_STATE"       "W_REGION"      "W_COUNTY"
## [29] "W_CPA"         "W_MCD"         "W_TRACT"       "W_TAZ"
## [33] "HOURS"         "OCCUP"         "WK_MODE"       "WK_MODETMS"
## [37] "ARRV_WRK"      "ARRV_TMS"      "LV_WRK"        "LV_TMS"
## [41] "TCOMM"         "PARK_SUB"      "PARK_SUB_AMT"  "TRAN_SUB"
## [45] "TRAN_SUB_AMT"  "WDAY_1"        "WDAY_2"        "WDAY_3"
## [49] "WDAY_4"        "WDAY_5"        "WDAY_6"        "WDAY_7"
## [53] "STUDE"         "SCHOL"         "S_HOME"        "S_ONLN"
## [57] "S_GEO_ACC"     "S_STATE"       "S_REGION"      "S_COUNTY"
## [61] "S_CPA"         "S_MCD"         "S_TRACT"       "S_TAZ"
## [65] "PRESCH_1"      "PRESCH_2"      "PRESCH_3"      "T_TRIP"
## [69] "PAY_TYPE1"     "PAY_TYPE2"     "PAY_TYPE3"     "TAP"

Use the length() and unique() commands to check how well the data will combine. The function length() allows you to extract the length of any vector; unique() creates a vector of all unique observations within another vector or a variable. Note that this should be equal to the number of observations in the household data. By contrast, the length of the person id should be equal to the number of observations in the person data.

length(unique(per$HH_ID)) ## [1] 9235 length(unique(per$PERSON_ID))
## [1] 20216

You should also notice that the person id is the same as the household id with the person number added to it.

head(per[1:4])
##    ID  HH_ID PERSON_NUM PERSON_ID
## 1 279 100140          1  10014001
## 2 280 100140          2  10014002
## 3 281 100206          1  10020601
## 4 282 100206          2  10020602
## 5 283 100206          3  10020603
## 6 284 100291          1  10029101

Now that you have found the correct identification numbers, you can merge the files using the merge command. Add all of the household data to the person file and create a new data-frame. In this case, there is a perfect merger, but if there were some people with missing household data, you could keep all entries by setting all.x = True instead of False.

per_hh <- merge(per, hh, by = "HH_ID", all.x = FALSE, all.y=FALSE, sort = FALSE)

Always be careful to check your data before and after a merge to make sure that you are getting what you wants. In this case, we want the dimensions of the data-frame to have the same number of observations as the person data.

The next data-set contains data about individual trips that people made during one day. Note that you made the head() command show 10 entries instead of the default of 6. The trips data-frame has an id to match both the household and person data.

head(trip[,1:12], 10)

Try making a subset of all trips made by members of one household using the subset() function. The output of the function is a data frame, which can be checked using class() function. Some functions return objects in matrix form, which makes the data processing a little bit different.

test <- subset(trip, trip$HH_ID == 100374 ) class(test) ## [1] "data.frame" Now look at two variables of this subset. The variable MODE contains information about the mode that was used in each trip. MODE_AGG combines some of the modes into larger categories. Again refer to the data dictionary. test$MODE
##  [1]  5  5  5  5 NA  5  5  5  5  5  5  5  5  5  5 NA  5  5  5  5 NA  5  5
## [24]  5  5  5  5 NA
test$MODE_AGG ## [1] 3 3 3 3 NA 3 3 3 3 3 3 3 3 3 3 NA 3 3 3 3 NA 3 3 ## [24] 3 3 3 3 NA You can find out how many trips do not have any information about mode using is.na function. It produces a vector of the same length as trip$mode, and only contain TRUE or FALSE values. Summary() function counts how many different values are contained in the logical vector. The number of TRUEs is about 20,000, therefore there are 20,000 trips with no information about mode.

summary(is.na(trip$MODE)) ## Mode FALSE TRUE NA's ## logical 61739 20201 0 Then you can select only the observations that have information about mode. head(subset(trip,is.na(trip$MODE)==FALSE))[,5:10]
##   HH_WEIGHT P_WEIGHT TOUR_NUM WKSUB_NUM   WKSUB_ID TRIP_NUM
## 1  133.4603 132.5431        1        NA 1.0014e+11        1
## 2  133.4603 132.5431        1        NA 1.0014e+11        2
## 3  133.4603 132.5431        1        NA 1.0014e+11        3
## 4  133.4603 132.5431        1        NA 1.0014e+11        4
## 5  133.4603 132.5431        1        NA 1.0014e+11        5
## 7  133.4603 115.1988        1        NA 1.0014e+11        1

You may also wish to have a better idea of what is happening with the data with no reported mode. Notice that the trip number is equal to 0 (no trips) or 97 (final destination).

table(trip$TRIP_NUM[is.na(trip$MODE)==TRUE])
##
##     0    97
##  3611 16590

Finally, look at the vehicle data-set.

head(veh)
##   ID  HH_ID HH_WEIGHT TOT_VEH OP_VEH VEH_NUM   VEH_ID VEH_OWN YEAR MAKE
## 1 79 100140  133.4603       2      2       1 10014001       1 2009   15
## 2 80 100140  133.4603       2      2       2 10014002       1 2010    8
## 3 81 100206  204.1062       2      2       1 10020601       1 1999   14
## 4 82 100206  204.1062       2      2       2 10020602       1 2009   37
## 5 83 100291  287.9105       2      2       1 10029101       2 2010   39
## 6 84 100291  287.9105       2      2       2 10029102       2 2011   39
##           MODEL BODY VEH_TYPE
## 1        SONATA    1        2
## 2 GRAND CARAVAN    8        2
## 3           CRV    2        2
## 4     FORRESTER    2        2
## 5         PRIUS    1        1
## 6       SEIANNA    8        2

Notice that there are more vehicles than households. And that there are fewer unique household ids in the vehicle data than there are households. This is because not all households own cars, but some own more than 1.

dim(veh)
## [1] 14903    13
length(unique(veh$HH_ID)) ## [1] 8307 Now that you have looked at all the data, try to answer these two questions: How many trips do people take on average? And how does this vary by income? The first should be relatively straightforward, since the person data already contains a variable on each persons total trips. table(per$P_TOT_TRIPS)
##
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14
## 3625  629 6025 2467 2880 1544 1251  698  478  292  182   76   33   21    6
##   15   16   19   20
##    4    2    1    2
mean(per$P_TOT_TRIPS) ## [1] 3.053225 This, however, is not entirely correct. The data come from a sample of households that vary systematically from households in Philadelphia. This is represented by weights. Look at the value of weights and then use the weighted.mean() function to apply them. summary(per$P_WEIGHT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   73.82   88.67  117.50  278.30  233.10 3665.00
weighted.mean(per$P_TOT_TRIPS, per$P_WEIGHT)
## [1] 2.902891

It is often interesting to look at who gets under-sampled in these types of surveys. First, plot the density of person weights. Notice how there is a long tail of people with very high weights.

plot(density(per$P_WEIGHT)) plot(density(per$P_WEIGHT[per$P_WEIGHT<500])) Now compare the racial differences across the high and low weights. Notice how the under-sampled population (those given a high weight) is less white than the over-sampled population. In general, these types of surveys tend to over-sample white households, households that own there own homes, large households, and households in single-family homes. Without applying the weights, the summaries are more likely to reflect the sample than the population of the Philadelphia region. round(table(per$RACE[per$P_WEIGHT<=278.3])/length(per$RACE[per$P_WEIGHT<=278.3]),3) ## ## 1 2 3 4 5 6 97 98 99 100 988 ## 0.872 0.052 0.008 0.001 0.020 0.000 0.002 0.001 0.011 0.008 0.024 round(table(per$RACE[per$P_WEIGHT>278.3])/length(per$RACE[per$P_WEIGHT>278.3]),3) ## ## 1 2 3 4 5 97 98 99 100 988 ## 0.712 0.157 0.039 0.003 0.027 0.004 0.001 0.011 0.018 0.027 Note that each weight is equal to the number of people or households that the observation is supposed to represent. So that you can sum them for an estimate of the number of people, households, or vehicles in the region. sum(per$P_WEIGHT)
## [1] 5627010
sum(hh$HH_WEIGHT) ## [1] 2097203 sum(hh$HH_WEIGHT*hh$TOT_VEH) ## [1] 3330849 Now look at how the number of trips vary by income. This is a little bit trickier. First, you have to have merge the person and household data (which we already did). Second, you need to make the estimate for each income group. There are a few ways to automate this (if you are interested look at the apply functions and the doBy package), but for now do it manually for the highest and lowest income groups. weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==1], per_hh$P_WEIGHT[per_hh$INCOME==1])  ## [1] 2.489796 weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==10], per_hh$P_WEIGHT[per_hh$INCOME==10]) ## [1] 3.339281 Finally, you may want to create and save a new data-frame that only contains what you want. First create a new data-frame that contains the variables you want to keep. dat <- per_hh[c("HH_ID", "PERSON_ID", "P_TOT_TRIPS", "GEND" , "AGECAT", "RACE", "INCOME", "TOT_VEH")] You can save it as a CSV or an R data file. A CSV is particularly useful if you have created a summary file that you want to open to make a nicely formatted table in Excel or send your file to someone who does not use R. The first argument specifies the data frame that needs to be saved, argument specifies the file name. By default, R will write the files to your current working directory. write.csv(dat, file="per_hh.csv") save(dat, file="per_hh.Rda") EXERCISE 1. Describe in detail the daily activities and trips of the first person in the trips data. 2. What is the mode share for the Philadelphia region? (Hint: Start with MODE_AGG instead of MODE) 3. For Philadelphia? 4. Give the weighted value too. (Hint: use the trip files with the person weights). 5. Compare the regional mode share for work and non-work trips. (For convenience, use the TOUR_TYPE variable instead of the activities.) 6. Compare the mode share for households with income under$35,000 with those above or equal to $35,000. 7. Plot the number of trips against income and describe any patterns that you see. 8. What percent of driving trips had to pay for parking? 9. What was the average hourly parking rate? (Hint: not all prices are presented in hours, so convert these) Posted in Transportation Planning | Comments Off on Working with the Philadelphia Region Household Travel Survey ## Transit user perceptions of driverless buses PhD student Xiaoxia (Summer) Dong presents our paper with Matt DiScenna on Transit user perceptions of driverless buses at the 2017 PA Automated Vehicle Summit in State College, PA. The paper reports the results of a stated preference survey of regular transit users’ willingness to ride and concerns about driverless buses in the Philadelphia region. As automated technologies advance, driverless buses may offer significant efficiency, safety, and operational improvements over traditional bus services. However, unfamiliarity with automated vehicle technology may challenge its acceptance among the general public and slow the adoption of new technologies. Using a mixed logit modeling framework, this research examines which types of transit users are most willing to ride in driverless buses and whether having a transit employee on board to monitor the vehicle operations and/or provide customer service matters. Of the 891 surveyed members of University of Pennsylvania’s transit pass benefit program, two-thirds express a willingness to ride in a driverless bus when a transit employee is on board to monitor vehicle operations and provide customer service. By contrast, only 13% would agree to ride a bus without an employee on board. Males and those in younger age groups (18–34) are more willing to ride in driverless buses than females and those in older age groups. Findings suggest that, so long as a transit employee is onboard, many transit passengers will willingly board early generation automated buses. An abrupt shift to buses without employees on board, by contrast, will likely alienate many transit users. Posted in Uncategorized | Leave a comment ## Getting start with American Factfinder This lab provides some tips for using American Factfinder to complete a neighborhood analysis over time in Philadelphia. First, you will need to identify the location of your neighborhood and identify the appropriate Census Tracts in 2000 and 2010. Note that these numbers can change over time, as can the basic geographies. It is extremely important that the boundaries used in 2000 match the boundaries used in 2010. Otherwise, you are not comparing the same neighborhoods. This website is a helpful starting point, but there are no official neighborhood designations in Philadelphia. To find your Census boundaries, refer to these or other resources: Tiger Web Apps 2000 Census Tract Maps (note that you can change the location of the map based on the number at the end of the URL and the map key in the bottom right.) 2010 Maps (See note above about moving the map) Once you have written down the appropriate Census Tract numbers, get started with American Factfinder. In the left drag-down menu, start with Advanced Search. I recommend starting with the dataset in the Topics section and entering one of the data sets you plan to use (e.g., 2000 SF3, the 2000 long form data.) After you click on the dataset, you will see it gets added to the selection criteria in the top left. Next, click Geographies, manually enter Census Tracts, Pennsylvania, Philadelphia, and the desired Census Tract numbers. Next, start with the quick tables to begin your database on the neighborhood. From this base, you can add more specific data that are relevant to your neighborhood analysis. I recommend using the search function above. Be sure to remove the Product Type: Quick Tables from the top left selection, before searching to see a complete list. Do, however, keep the data type selected. Otherwise you will be searching from an unmanageable number. Posted in Planning Methods | Leave a comment ## Planning Methods: Population trend model Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville’s population. For a review of OLS, please refer to class readings: An Introduction to Statistical Learning, chapter 3 Video lectures available here (also, chapter 3) Tufte’s Data Analysis for Politics and Policy ($2 for ebook)

Let’s start by looking at the population over time:

plot(pop$Year, pop$Libreville)

It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we’ll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.

lbv <- lm(pop$Libreville ~ pop$Year)
summary(lbv)
##
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -69.81 -24.46  -4.11  21.42  91.48
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.185e+04  1.314e+03  -16.62 3.84e-09 ***
## pop$Year 1.116e+01 6.637e-01 16.82 3.39e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 44.77 on 11 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.9626, Adjusted R-squared: 0.9592 ## F-statistic: 283 on 1 and 11 DF, p-value: 3.389e-09 What do the estimates mean? The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year. So if we want to know the estimated population in 2000, we could estimate it manually as -21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually. summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000 ## [1] 481.7582 Or we could predict each year in the data: summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year
##  [1] -76.48352 -20.65934  35.16484  90.98901 146.81319 202.63736 258.46154
##  [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330

Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).

cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind
##    [,1] [,2]      [,3]
## 1  1950   15 -76.48352
## 2  1955   17 -20.65934
## 3  1960   29  35.16484
## 4  1965   49  90.98901
## 5  1970   77 146.81319
## 6  1975  155 202.63736
## 7  1980  234 258.46154
## 8  1985  293 314.28571
## 9  1990  366 370.10989
## 10 1995  439 425.93407
## 11 2000  496 481.75824
## 12 2005  559 537.58242
## 13 2010  631 593.40659
## 14 2015   NA 649.23077
## 15 2020   NA 705.05495
## 16 2025   NA 760.87912
## 17 2030   NA 816.70330

Is this a good predictor of population from 2015 to 2030?

The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let’s start by looking at the regression line?

plot(pop$Year, pop$Libreville)
abline(lbv)

The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)

plot(predict(lbv), resid(lbv))
abline(h=0)

Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?

If you guessed underpredict, then I agree. While we don’t necessarily know that the ongoing trend will continue, we’re making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future. *****

Another way that we might test our model is to see how well our general modeling framework predicts values outside of the sample used in the regression. For cross-sectional, we might fit our model on a random sample of our data and then see how well it predicts other data.

Since we are predicting growth trends, however, it makes more sense to intentionally remove the most recent datapoints and see how well a model using older data predicts them.

lbv_back <- lm(Libreville ~ Year, subset(pop, pop$Year < 1995)) plot(pop$Year, pop$Libreville) abline(lbv_back) This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data. mean(resid(lbv_back)^2) ## [1] 1438.128 mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015] - predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)
## [1] 9702.728

Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.

There are a few ways that we might try to get better predictions.

The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:

plot(pop$Year, pop$Libreville)
lbv <- lm(Libreville ~ Year, subset(pop, pop$Year > 1970)) summary(lbv) ## ## Call: ## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year >
##     1970))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -6.6667 -3.5595 -0.9524  3.5060  8.8095
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.636e+04  3.534e+02  -74.58 3.91e-10 ***
## Year         1.343e+01  1.774e-01   75.70 3.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.747 on 6 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.9988
## F-statistic:  5731 on 1 and 6 DF,  p-value: 3.576e-10
abline(lbv)

Though not necessarily well-suited to these data, adding a quadratic terms can often improve model fit:

lbv2 <- lm(pop$Libreville ~ pop$Year + I(pop$Year^2)) summary(lbv2) ## ## Call: ## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -41.701 -11.787   6.736  15.260  29.637
##
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    4.190e+05  8.743e+04   4.792 0.000733 ***
## pop$Year -4.341e+02 8.832e+01 -4.915 0.000609 *** ## I(pop$Year^2)  1.124e-01  2.230e-02   5.042 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.95 on 10 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9873
## F-statistic: 468.3 on 2 and 10 DF,  p-value: 1.315e-10
plot(pop$Year, pop$Libreville)

lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)

Or we could try predicting Libreville’s population as a function of a larger geography, such as all Sub-Saharan African cities over 300,000.

plot(pop$Sub_Saharan_300kplus, pop$Libreville)

summary(lm(Libreville~Sub_Saharan_300kplus,pop))
##
## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -50.44 -27.04 -23.48  41.54  51.95
##
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)          22.016224  15.931117   1.382    0.194
## Sub_Saharan_300kplus  0.004803   0.000242  19.850 5.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.14 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9704
## F-statistic:   394 on 1 and 11 DF,  p-value: 5.795e-10
abline(lm(Libreville~Sub_Saharan_300kplus,pop))

Or we could take advantage of the knowledge of the population in the year immediately before our prediction. Our previous models ignored the fact that the 2010 data is a lot more important than any of the other years of data for predicting the population in the future.

First, we need to create a new time-lagged population variable. Walk through the code slowly using the command head(pop_n1) between each line of code to see exactly what you are doing.

pop_n1 <- pop[c("Libreville", "Year")]
pop_n1$Year <- pop_n1$Year + 5
pop_n1$Libreville_m5 <- pop_n1$Libreville
pop_n1 <- pop_n1[c(2:3)]
pop_n1 <- subset(pop_n1, pop_n1$Year < 2015) pop <- merge(pop, pop_n1, by = "Year", all.x = T) Now predict Libreville’s population using the population 5 years earlier and the year. lbv3 <- lm(Libreville ~ Libreville_m5 + Year, pop) summary(lbv3) ## ## Call: ## lm(formula = Libreville ~ Libreville_m5 + Year, data = pop) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.877 -6.747 -1.215 8.818 15.653 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.271e+03 1.961e+03 -4.726 0.00108 ** ## Libreville_m5 6.705e-01 9.021e-02 7.432 3.96e-05 *** ## Year 4.740e+00 9.995e-01 4.742 0.00106 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.79 on 9 degrees of freedom ## (5 observations deleted due to missingness) ## Multiple R-squared: 0.9972, Adjusted R-squared: 0.9966 ## F-statistic: 1601 on 2 and 9 DF, p-value: 3.271e-12 This model explains 99.7% of the variance in the data an indicates that the population in 2015 is equal to the population in 2010 x .67 + 2010 (the year) x 4.74. To estimate the population in 2020, you will need to add the 2015 population estimate. plot(pop$Year, pop$Libreville) lines(pop$Year[pop$Year < 2014 & pop$Year > 1950], predict(lbv3), col=3)

1. Plot New York City’s population from 1790 to 2010 and describe the major trends.
2. Based on these trends and your knowledge of NYC, describe what you think NYC’s population will be in 2020 and why.
3. Use a linear regression to make a prediction of New York’s 2020 population. Is the prediction any good? Why?
4. Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).
6. Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).
7. Use earlier data (i.e., excluding the prediction years) to predict NYC’s 2000 and 2010 population. Describe the predictions in relationship to the earlier estimates.
8. Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?
9. Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.
10. Create a time-lagged population variable and use it to predict New York’s population. What is the expected population in 2020?
Posted in Planning Methods | Comments Off on Planning Methods: Population trend model

## Moving to access report

Brookings Institution report with Gilles Duranton:

Developing a common narrative on urban accessibility: An urban planning perspective

## Beyond Mobility book cover

Sneak peak at the cover of my forthcoming Island Press book with Robert Cervero and Stefan Al:

Traditionally, transport planning has sought to enhance mobility. Shifting the focus from movement to accessing destinations leads to an entirely different framework for planning and designing cities and the infrastructure that serves them. This book illustrate principles and practices of linking sustainable transport programs with place-making.