Working with the Philadelphia Region Household Travel Survey

The purpose of this lab is to learn to work with the Philadelphia region 2012 household travel survey. You will learn how to upload data into R, summarize information from the data, combine datasheets, and examine subsets of the data. By the end of the lab, you should feel comfortable looking at the travel survey and be able to complete the relevant questions from assignment 1.

Many large cities collect important individual-level data to understand and model regional travel behavior and predict the impacts of transportation investments and new commercial and housing developments. The United States also collects household travel data for the entire country. The data for surveys from 1983 to 2009 can be found here.

To work with the Philadelphia household travel survey, download the data from the Delaware Valley Regional Planning Commission website. You can find links to the 2002 data here as well. The 2012 files come as an Access database or in excel files. For this exercise, you need to open the excel files and save the sheets as a CSV. In Excel, click on File/Save as and then scroll down to CSV (comma delimited). Save or move the files to the folder where you want to work.

First, clear objects from the memory with rm() command, and set up the working directory to the folder containing your files. Note that if you are using a PC, you need to change all back slashes to forward slashes. For a brief tutorial, click here.

Next, upload the data into your work session with the read.csv() command. This will likely work with no additional options, but to be safe, use the following: Strip.white parameter gets rid of blank spaces in the table, sep parameter sets the separator, blank.lines.skip skips the blank lines, na.strings specifies the symbol for missing observations. Any of the parameters can be changed. Depending on how your data are saved to a CSV, you may need to change the seperator.

Note that you may have to change the names of the sub-folder and files, depending on how you saved the CSVs.

If you are having trouble reading the data into R, there is a handy interface in RStudio. Click on Tools in the top right, then Import Dataset, then From a Text File.

hh <- read.csv("DVRPC HTS Database Files/1_Household_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

per <- read.csv("DVRPC HTS Database Files/2_Person_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

veh <- read.csv("DVRPC HTS Database Files/3_Vehicle_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

trip <- read.csv("DVRPC HTS Database Files/4_Trip_Public.csv", sep=",", strip.white = TRUE, blank.lines.skip = T, na.strings = "NA")

Consult the documentation about the function parameters when you load the files.

?read.csv

You should now have four objects in the top right window of R Studio. You can also check that these loaded properly using the ls() command.

ls()
## [1] "hh"   "per"  "trip" "veh"

Again, type ?command to understand the function.

?ls

Finally, check your dimension to make sure that you have the right number of observations and variables.

dim(hh)
## [1] 9235   38
dim(per)
## [1] 20216    72
dim(trip)
## [1] 81940    81
dim(veh)
## [1] 14903    13

Str() uncovers the structure of the data, as well as head(). Tail() takes the last 6 and is particularly useful for making sure that your data loaded properly. The summary() function provides descriptive statistics such as mean or range, depending on the data type (for example, factors and characters do not have mean values). The brackets used with the summary command (hh[,1:5]), indicate that you only want to look at the first five variables. If you move the comma after 1:5, then you will see a summary of the first five entries for all of the variables.

str(hh) 
## 'data.frame':    9235 obs. of  38 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ HH_ID         : int  100140 100206 100291 100355 100374 100449 100503 100601 100663 100677 ...
##  $ HH_WEIGHT     : num  133 204 288 222 365 ...
##  $ HH_GEO_ACC    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ H_STATE       : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ H_COUNTY      : int  34005 34005 34005 34005 34005 34005 34005 34005 34005 34005 ...
##  $ H_CPA         : Factor w/ 84 levels "005_701","005_702",..: 2 2 1 1 1 1 1 1 4 4 ...
##  $ H_MCD         : num  3.4e+09 3.4e+09 3.4e+09 3.4e+09 3.4e+09 ...
##  $ H_TRACT       : num  3.4e+10 3.4e+10 3.4e+10 3.4e+10 3.4e+10 ...
##  $ H_TAZ         : int  20225 20249 20048 20042 20002 20020 20026 20011 20637 20629 ...
##  $ A_TYPE        : int  4 4 4 4 4 4 4 5 5 5 ...
##  $ HH_TOT_TRIPS  : int  8 11 9 5 24 9 9 18 4 5 ...
##  $ HH_MO_TRIPS   : int  8 5 9 4 24 9 7 16 4 5 ...
##  $ HH_NM_TRIPS   : int  0 6 0 1 0 0 2 2 0 0 ...
##  $ GPS_Factor    : num  1.12 1.17 1.12 1.17 1.17 1.17 1.12 1.12 1.13 1.17 ...
##  $ F_HH_TOT_TRIPS: num  8.96 12.87 10.08 5.85 28.08 ...
##  $ F_HH_MO_TRIPS : num  8.96 5.85 10.08 4.68 28.08 ...
##  $ F_HH_NM_TRIPS : num  0 7.02 0 1.17 0 0 2.24 2.24 0 0 ...
##  $ HH_SIZE       : int  2 3 5 1 4 2 2 4 1 2 ...
##  $ HH_WORK       : int  2 2 2 1 1 0 1 2 0 2 ...
##  $ TOT_VEH       : int  2 2 2 1 2 2 2 2 1 1 ...
##  $ OP_VEH        : int  2 2 2 1 2 2 2 2 1 1 ...
##  $ VEH_OUT       : int  20 9000 11 9000 3 20 50 15 1 30 ...
##  $ TOT_BIKE      : int  1 3 7 0 4 0 2 4 0 0 ...
##  $ TOLL_ACCNT    : int  1 1 1 2 1 1 1 1 2 2 ...
##  $ CAR_SHARE     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ RES_TYPE      : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ INCOME        : int  6 5 7 4 5 5 6 7 99 4 ...
##  $ TRAV_DATE     : Factor w/ 271 levels "1/1/2013","1/10/2013",..: 236 230 215 243 230 213 243 213 215 213 ...
##  $ TRAV_DOW      : int  5 4 2 3 4 1 3 1 2 1 ...
##  $ HOLIDAY       : Factor w/ 21 levels "","Birthday of Martin Luther King, Jr.",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HOL_TYPE      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OV_SAMPLE     : int  1 0 0 0 1 0 1 0 0 0 ...
##  $ STUDY         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RECRUIT       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RETRIEVE      : int  3 3 3 2 3 3 2 3 3 2 ...
##  $ R_LANG        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SURV_LANG     : int  1 1 1 1 1 1 1 1 1 1 ...
head(hh)
##   ID  HH_ID HH_WEIGHT HH_GEO_ACC H_STATE H_COUNTY   H_CPA      H_MCD
## 1  1 100140 133.46032          1      34    34005 005_702 3400543740
## 2  2 100206 204.10617          1      34    34005 005_702 3400547880
## 3  3 100291 287.91055          1      34    34005 005_701 3400517440
## 4  4 100355 222.44414          1      34    34005 005_701 3400517080
## 5  5 100374 365.14646          1      34    34005 005_701 3400505740
## 6  6 100449  74.05932          1      34    34005 005_701 3400508950
##       H_TRACT H_TAZ A_TYPE HH_TOT_TRIPS HH_MO_TRIPS HH_NM_TRIPS GPS_Factor
## 1 34005700405 20225      4            8           8           0       1.12
## 2 34005700501 20249      4           11           5           6       1.17
## 3 34005700603 20048      4            9           9           0       1.12
## 4 34005700800 20042      4            5           4           1       1.17
## 5 34005700900 20002      4           24          24           0       1.17
## 6 34005701104 20020      4            9           9           0       1.17
##   F_HH_TOT_TRIPS F_HH_MO_TRIPS F_HH_NM_TRIPS HH_SIZE HH_WORK TOT_VEH
## 1           8.96          8.96          0.00       2       2       2
## 2          12.87          5.85          7.02       3       2       2
## 3          10.08         10.08          0.00       5       2       2
## 4           5.85          4.68          1.17       1       1       1
## 5          28.08         28.08          0.00       4       1       2
## 6          10.53         10.53          0.00       2       0       2
##   OP_VEH VEH_OUT TOT_BIKE TOLL_ACCNT CAR_SHARE RES_TYPE INCOME TRAV_DATE
## 1      2      20        1          1         2        1      6  8/3/2012
## 2      2    9000        3          1         2        1      5  8/2/2012
## 3      2      11        7          1         2        1      7 7/31/2012
## 4      1    9000        0          2         2        1      4  8/8/2012
## 5      2       3        4          1         2        1      5  8/2/2012
## 6      2      20        0          1         2        1      5 7/30/2012
##   TRAV_DOW HOLIDAY HOL_TYPE OV_SAMPLE STUDY RECRUIT RETRIEVE R_LANG
## 1        5                0         1     1       1        3      1
## 2        4                0         0     1       1        3      1
## 3        2                0         0     1       1        3      1
## 4        3                0         0     1       1        2      1
## 5        4                0         1     1       1        3      1
## 6        1                0         0     1       1        3      1
##   SURV_LANG
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1
summary(hh[,1:5])
##        ID           HH_ID           HH_WEIGHT         HH_GEO_ACC
##  Min.   :   1   Min.   : 100140   Min.   :  40.32   Min.   :1   
##  1st Qu.:2361   1st Qu.: 397484   1st Qu.: 109.56   1st Qu.:1   
##  Median :4746   Median : 510518   Median : 165.11   Median :1   
##  Mean   :4781   Mean   : 629041   Mean   : 227.09   Mean   :1   
##  3rd Qu.:7198   3rd Qu.: 703038   3rd Qu.: 279.67   3rd Qu.:1   
##  Max.   :9628   Max.   :1736693   Max.   :1086.64   Max.   :1   
##     H_STATE     
##  Min.   :34.00  
##  1st Qu.:42.00  
##  Median :42.00  
##  Mean   :40.04  
##  3rd Qu.:42.00  
##  Max.   :42.00

For this purpose functions names() and colnames() are almost identical – they provide the names of all columns (variables) in the data-set. Rownames provides the list of names of rows 100 to 1005. Be sure to check the data dictionary provided with the data to understand each variable.

names(hh)
##  [1] "ID"             "HH_ID"          "HH_WEIGHT"      "HH_GEO_ACC"    
##  [5] "H_STATE"        "H_COUNTY"       "H_CPA"          "H_MCD"         
##  [9] "H_TRACT"        "H_TAZ"          "A_TYPE"         "HH_TOT_TRIPS"  
## [13] "HH_MO_TRIPS"    "HH_NM_TRIPS"    "GPS_Factor"     "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS"  "F_HH_NM_TRIPS"  "HH_SIZE"        "HH_WORK"       
## [21] "TOT_VEH"        "OP_VEH"         "VEH_OUT"        "TOT_BIKE"      
## [25] "TOLL_ACCNT"     "CAR_SHARE"      "RES_TYPE"       "INCOME"        
## [29] "TRAV_DATE"      "TRAV_DOW"       "HOLIDAY"        "HOL_TYPE"      
## [33] "OV_SAMPLE"      "STUDY"          "RECRUIT"        "RETRIEVE"      
## [37] "R_LANG"         "SURV_LANG"
colnames(hh)
##  [1] "ID"             "HH_ID"          "HH_WEIGHT"      "HH_GEO_ACC"    
##  [5] "H_STATE"        "H_COUNTY"       "H_CPA"          "H_MCD"         
##  [9] "H_TRACT"        "H_TAZ"          "A_TYPE"         "HH_TOT_TRIPS"  
## [13] "HH_MO_TRIPS"    "HH_NM_TRIPS"    "GPS_Factor"     "F_HH_TOT_TRIPS"
## [17] "F_HH_MO_TRIPS"  "F_HH_NM_TRIPS"  "HH_SIZE"        "HH_WORK"       
## [21] "TOT_VEH"        "OP_VEH"         "VEH_OUT"        "TOT_BIKE"      
## [25] "TOLL_ACCNT"     "CAR_SHARE"      "RES_TYPE"       "INCOME"        
## [29] "TRAV_DATE"      "TRAV_DOW"       "HOLIDAY"        "HOL_TYPE"      
## [33] "OV_SAMPLE"      "STUDY"          "RECRUIT"        "RETRIEVE"      
## [37] "R_LANG"         "SURV_LANG"
rownames(hh[100:105,])
## [1] "100" "101" "102" "103" "104" "105"

Functions View() and fix() provide a useful tabular view of the data. View() opens a separate tab in RStudio. It is important to remember to capitalize the first letter of this function, otherwise it will not run. Fix() (not capitalized) opens a new window and allows for editing, just like Excel. In order to reuse the edits the R object has to be saved as a new file.

View(hh)
fix(hh)

Function table() creates useful summary table and multidimensional cross-tables. In this case, passing the TOT_VEH variable to the function creates a one-dimensional table of the number of households by the number of vehicles. Dividing it by the length of the variable produces the percentages.

table(hh$TOT_VEH)
## 
##    0    1    2    3    4    5    6    7    8   10   15 
##  833 3109 3726 1120  318   93   22    8    4    1    1
table(hh$TOT_VEH)/length(hh$TOT_VEH)
## 
##            0            1            2            3            4 
## 0.0902003249 0.3366540336 0.4034650785 0.1212777477 0.0344342177 
##            5            6            7            8           10 
## 0.0100703844 0.0023822415 0.0008662696 0.0004331348 0.0001082837 
##           15 
## 0.0001082837

Using the round() function we rounded the values and multiplied them by 100 to make them more understandable. 40.4% of the households have two cars.

round(table(hh$TOT_VEH)/length(hh$TOT_VEH)*100, digits=3)
## 
##      0      1      2      3      4      5      6      7      8     10 
##  9.020 33.665 40.347 12.128  3.443  1.007  0.238  0.087  0.043  0.011 
##     15 
##  0.011

Passing two arguments creates a two-dimensional table of the number of households by the number of cars and the household income. Note that the brackets are removing data where the income is unknown (98) or not given (99). You will also need to consult the data dictionary to understand the income levels represented by the numbers 1 through 10.

table(hh$TOT_VEH[hh$INCOME < 97], hh$INCOME[hh$INCOME < 97])
##     
##        1   2   3   4   5   6   7   8   9  10
##   0  179 240  93  90  95  33  30  17  15   0
##   1  114 504 405 510 587 335 236  84  82  27
##   2   19  90 140 333 646 685 812 343 210 121
##   3    3  14  25  57 147 187 295 152  78  64
##   4    2   4   6  15  47  53  70  47  23  19
##   5    1   1   2   3  10  10  23  16   7   6
##   6    0   0   0   0   3   5   5   5   1   3
##   7    0   0   0   0   1   4   1   0   1   0
##   8    0   0   0   0   2   1   0   0   0   1
##   10   0   0   0   0   0   0   0   1   0   0
##   15   0   0   0   0   1   0   0   0   0   0

Functions hist() and plot() are used for plotting and let us see how the data is distributed. Hist() creates a histogram, embedding the function density() in plot() will plot the density of the observed distribution. In this case the density plot is not particularly useful, but it frequently is quite helpful with larger, more continuous data.

hist(hh$TOT_VEH)

plot(density(hh$TOT_VEH))

More on functions table() and hist()

?table
?hist

You see that many of the observations have a value of 98 or 99. This is the value used for those households that refused to report their income. Using the square brackets or the function subset() allows us to only select valid observations. In general, use brackets for a single variable, but subset for multiple variables (including the entire data frame).

table(hh$INCOME)
## 
##    1    2    3    4    5    6    7    8    9   10   98   99 
##  318  853  671 1008 1539 1313 1472  665  417  241   13  725
hist(hh$INCOM[hh$INCOME < 98])

table(hh$TOT_VEH[hh$INCOME < 98])
## 
##    0    1    2    3    4    5    6    7    8   10   15 
##  792 2884 3399 1022  286   79   22    7    4    1    1
summary(subset(hh[,1:5], hh$INCOME < 98))
##        ID           HH_ID           HH_WEIGHT         HH_GEO_ACC
##  Min.   :   1   Min.   : 100140   Min.   :  40.32   Min.   :1   
##  1st Qu.:2366   1st Qu.: 397693   1st Qu.: 113.42   1st Qu.:1   
##  Median :4783   Median : 512032   Median : 169.94   Median :1   
##  Mean   :4815   Mean   : 636321   Mean   : 233.20   Mean   :1   
##  3rd Qu.:7288   3rd Qu.: 708975   3rd Qu.: 291.01   3rd Qu.:1   
##  Max.   :9628   Max.   :1736693   Max.   :1086.64   Max.   :1   
##     H_STATE     
##  Min.   :34.00  
##  1st Qu.:42.00  
##  Median :42.00  
##  Mean   :40.03  
##  3rd Qu.:42.00  
##  Max.   :42.00

More on function subset()

?subset

You can also recode the missing observations, so that they can be displayed nicely in the histogram.

hh$inc <- hh$INCOM #same variable
hh$inc[hh$inc==98]<- 11  #recoding missing from 98 to 11
hh$inc[hh$inc==99]<- 12 #recoding refused from 99 to 12
hist(hh$inc) #looks much better now

There are also 3 other data-frames from DVRPC. Often you will want to combine information. In order to combine the data, you first need to understand the different data values. Notice that the person-level file has a column called HH_ID. This will be the variable to use to join the person data and the household data

names(per)
##  [1] "ID"            "HH_ID"         "PERSON_NUM"    "PERSON_ID"    
##  [5] "HH_WEIGHT"     "P_WEIGHT"      "P_TOT_TRIPS"   "P_MO_TRIPS"   
##  [9] "P_NM_TRIPS"    "GPS_Factor"    "F_P_TOT_TRIPS" "F_P_MO_TRIPS" 
## [13] "F_P_NM_TRIPS"  "GEND"          "AGECAT"        "RACE"         
## [17] "EDUCA"         "RELAT"         "LIC"           "EMPLY"        
## [21] "WK_STAT"       "SEMPLY"        "JOBS"          "WK_LOC"       
## [25] "W_GEO_ACC"     "W_STATE"       "W_REGION"      "W_COUNTY"     
## [29] "W_CPA"         "W_MCD"         "W_TRACT"       "W_TAZ"        
## [33] "HOURS"         "OCCUP"         "WK_MODE"       "WK_MODETMS"   
## [37] "ARRV_WRK"      "ARRV_TMS"      "LV_WRK"        "LV_TMS"       
## [41] "TCOMM"         "PARK_SUB"      "PARK_SUB_AMT"  "TRAN_SUB"     
## [45] "TRAN_SUB_AMT"  "WDAY_1"        "WDAY_2"        "WDAY_3"       
## [49] "WDAY_4"        "WDAY_5"        "WDAY_6"        "WDAY_7"       
## [53] "STUDE"         "SCHOL"         "S_HOME"        "S_ONLN"       
## [57] "S_GEO_ACC"     "S_STATE"       "S_REGION"      "S_COUNTY"     
## [61] "S_CPA"         "S_MCD"         "S_TRACT"       "S_TAZ"        
## [65] "PRESCH_1"      "PRESCH_2"      "PRESCH_3"      "T_TRIP"       
## [69] "PAY_TYPE1"     "PAY_TYPE2"     "PAY_TYPE3"     "TAP"

Use the length() and unique() commands to check how well the data will combine. The function length() allows you to extract the length of any vector; unique() creates a vector of all unique observations within another vector or a variable. Note that this should be equal to the number of observations in the household data. By contrast, the length of the person id should be equal to the number of observations in the person data.

length(unique(per$HH_ID))
## [1] 9235
length(unique(per$PERSON_ID))
## [1] 20216

You should also notice that the person id is the same as the household id with the person number added to it.

head(per[1:4])
##    ID  HH_ID PERSON_NUM PERSON_ID
## 1 279 100140          1  10014001
## 2 280 100140          2  10014002
## 3 281 100206          1  10020601
## 4 282 100206          2  10020602
## 5 283 100206          3  10020603
## 6 284 100291          1  10029101

Now that you have found the correct identification numbers, you can merge the files using the merge command. Add all of the household data to the person file and create a new data-frame. In this case, there is a perfect merger, but if there were some people with missing household data, you could keep all entries by setting all.x = True instead of False.

per_hh <- merge(per, hh, by = "HH_ID", all.x = FALSE, all.y=FALSE, sort = FALSE)

Always be careful to check your data before and after a merge to make sure that you are getting what you wants. In this case, we want the dimensions of the data-frame to have the same number of observations as the person data.

The next data-set contains data about individual trips that people made during one day. Note that you made the head() command show 10 entries instead of the default of 6. The trips data-frame has an id to match both the household and person data.

head(trip[,1:12], 10)

Try making a subset of all trips made by members of one household using the subset() function. The output of the function is a data frame, which can be checked using class() function. Some functions return objects in matrix form, which makes the data processing a little bit different.

test <- subset(trip, trip$HH_ID == 100374 )
class(test)
## [1] "data.frame"

Now look at two variables of this subset. The variable MODE contains information about the mode that was used in each trip. MODE_AGG combines some of the modes into larger categories. Again refer to the data dictionary.

test$MODE
##  [1]  5  5  5  5 NA  5  5  5  5  5  5  5  5  5  5 NA  5  5  5  5 NA  5  5
## [24]  5  5  5  5 NA
test$MODE_AGG
##  [1]  3  3  3  3 NA  3  3  3  3  3  3  3  3  3  3 NA  3  3  3  3 NA  3  3
## [24]  3  3  3  3 NA

You can find out how many trips do not have any information about mode using is.na function. It produces a vector of the same length as trip$mode, and only contain TRUE or FALSE values. Summary() function counts how many different values are contained in the logical vector. The number of TRUEs is about 20,000, therefore there are 20,000 trips with no information about mode.

summary(is.na(trip$MODE))
##    Mode   FALSE    TRUE    NA's 
## logical   61739   20201       0

Then you can select only the observations that have information about mode.

head(subset(trip,is.na(trip$MODE)==FALSE))[,5:10]
##   HH_WEIGHT P_WEIGHT TOUR_NUM WKSUB_NUM   WKSUB_ID TRIP_NUM
## 1  133.4603 132.5431        1        NA 1.0014e+11        1
## 2  133.4603 132.5431        1        NA 1.0014e+11        2
## 3  133.4603 132.5431        1        NA 1.0014e+11        3
## 4  133.4603 132.5431        1        NA 1.0014e+11        4
## 5  133.4603 132.5431        1        NA 1.0014e+11        5
## 7  133.4603 115.1988        1        NA 1.0014e+11        1

You may also wish to have a better idea of what is happening with the data with no reported mode. Notice that the trip number is equal to 0 (no trips) or 97 (final destination).

table(trip$TRIP_NUM[is.na(trip$MODE)==TRUE])
## 
##     0    97 
##  3611 16590

Finally, look at the vehicle data-set.

head(veh)
##   ID  HH_ID HH_WEIGHT TOT_VEH OP_VEH VEH_NUM   VEH_ID VEH_OWN YEAR MAKE
## 1 79 100140  133.4603       2      2       1 10014001       1 2009   15
## 2 80 100140  133.4603       2      2       2 10014002       1 2010    8
## 3 81 100206  204.1062       2      2       1 10020601       1 1999   14
## 4 82 100206  204.1062       2      2       2 10020602       1 2009   37
## 5 83 100291  287.9105       2      2       1 10029101       2 2010   39
## 6 84 100291  287.9105       2      2       2 10029102       2 2011   39
##           MODEL BODY VEH_TYPE
## 1        SONATA    1        2
## 2 GRAND CARAVAN    8        2
## 3           CRV    2        2
## 4     FORRESTER    2        2
## 5         PRIUS    1        1
## 6       SEIANNA    8        2

Notice that there are more vehicles than households. And that there are fewer unique household ids in the vehicle data than there are households. This is because not all households own cars, but some own more than 1.

dim(veh)
## [1] 14903    13
length(unique(veh$HH_ID))
## [1] 8307

Now that you have looked at all the data, try to answer these two questions: How many trips do people take on average? And how does this vary by income?

The first should be relatively straightforward, since the person data already contains a variable on each persons total trips.

table(per$P_TOT_TRIPS)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 3625  629 6025 2467 2880 1544 1251  698  478  292  182   76   33   21    6 
##   15   16   19   20 
##    4    2    1    2
mean(per$P_TOT_TRIPS)
## [1] 3.053225

This, however, is not entirely correct. The data come from a sample of households that vary systematically from households in Philadelphia. This is represented by weights. Look at the value of weights and then use the weighted.mean() function to apply them.

summary(per$P_WEIGHT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73.82   88.67  117.50  278.30  233.10 3665.00
weighted.mean(per$P_TOT_TRIPS, per$P_WEIGHT)
## [1] 2.902891

It is often interesting to look at who gets under-sampled in these types of surveys. First, plot the density of person weights. Notice how there is a long tail of people with very high weights.

plot(density(per$P_WEIGHT))

plot(density(per$P_WEIGHT[per$P_WEIGHT<500]))

Now compare the racial differences across the high and low weights. Notice how the under-sampled population (those given a high weight) is less white than the over-sampled population. In general, these types of surveys tend to over-sample white households, households that own there own homes, large households, and households in single-family homes. Without applying the weights, the summaries are more likely to reflect the sample than the population of the Philadelphia region.

round(table(per$RACE[per$P_WEIGHT<=278.3])/length(per$RACE[per$P_WEIGHT<=278.3]),3)
## 
##     1     2     3     4     5     6    97    98    99   100   988 
## 0.872 0.052 0.008 0.001 0.020 0.000 0.002 0.001 0.011 0.008 0.024
round(table(per$RACE[per$P_WEIGHT>278.3])/length(per$RACE[per$P_WEIGHT>278.3]),3)
## 
##     1     2     3     4     5    97    98    99   100   988 
## 0.712 0.157 0.039 0.003 0.027 0.004 0.001 0.011 0.018 0.027

Note that each weight is equal to the number of people or households that the observation is supposed to represent. So that you can sum them for an estimate of the number of people, households, or vehicles in the region.

sum(per$P_WEIGHT)
## [1] 5627010
sum(hh$HH_WEIGHT)
## [1] 2097203
sum(hh$HH_WEIGHT*hh$TOT_VEH)
## [1] 3330849

Now look at how the number of trips vary by income. This is a little bit trickier. First, you have to have merge the person and household data (which we already did). Second, you need to make the estimate for each income group. There are a few ways to automate this (if you are interested look at the apply functions and the doBy package), but for now do it manually for the highest and lowest income groups.

weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==1], per_hh$P_WEIGHT[per_hh$INCOME==1]) 
## [1] 2.489796
weighted.mean(per_hh$P_TOT_TRIPS[per_hh$INCOME==10], per_hh$P_WEIGHT[per_hh$INCOME==10])
## [1] 3.339281

Finally, you may want to create and save a new data-frame that only contains what you want. First create a new data-frame that contains the variables you want to keep.

dat <- per_hh[c("HH_ID", "PERSON_ID", "P_TOT_TRIPS",  "GEND" , "AGECAT", "RACE", "INCOME", "TOT_VEH")]

You can save it as a CSV or an R data file. A CSV is particularly useful if you have created a summary file that you want to open to make a nicely formatted table in Excel or send your file to someone who does not use R.

The first argument specifies the data frame that needs to be saved, argument specifies the file name. By default, R will write the files to your current working directory.

write.csv(dat, file="per_hh.csv")
save(dat, file="per_hh.Rda")

EXERCISE

  1. Describe in detail the daily activities and trips of the first person in the trips data.
  2. What is the mode share for the Philadelphia region? (Hint: Start with MODE_AGG instead of MODE)
  3. For Philadelphia?
  4. Give the weighted value too. (Hint: use the trip files with the person weights).
  5. Compare the regional mode share for work and non-work trips. (For convenience, use the TOUR_TYPE variable instead of the activities.)
  6. Compare the mode share for households with income under $35,000 with those above or equal to $35,000.
  7. Plot the number of trips against income and describe any patterns that you see.
  8. What percent of driving trips had to pay for parking?
  9. What was the average hourly parking rate? (Hint: not all prices are presented in hours, so convert these)


 

This entry was posted in Transportation Planning. Bookmark the permalink.