I frequently use the National Transit Database for analysis or quick summaries of key information about public transportation in the US. I also have my students use it to perform a back-of-the-envelope cost benefit analysis for a proposed light rail system in one of my transportation planning courses. In both instances, I find the downloadable excel sheets to be a bit clunky and slow to manipulate. Most recently I wanted a quick comparison of bus ridership per route kilometer for a system in Indonesia. As usual, I downloaded the excel sheets and spent about an hour playing with the data until I got a single reasonable answer. As usual, I’d have to do another bit of work if I wanted to change anything. Instead, I decided to clean up a subset of the data and rearrange it into a long-format panel. For a description of the data and some of the choices made (like when to make an entry a 0 or an NA), see here.
I’m also hoping that the clean data set will encourage my students to choose to work with the data in R. I use R in my Planning Methods and Transportation Planning courses because:
- Data sets are ever larger and more readily available. Many are useful to planners and for planning. The ability to manage and work with large amounts of data is an important skill for planners, particularly transportation planners.
- It’s a great program for managing, visualizing, and analyzing data. Many of its primary competitors are only really good at the third. Poor data management can be embarrassing.
- It has a strong user community that builds add-on packages to accomplish a range of tasks, provides tutorials, and answers questions online. It’s also frequently used in free, online courses and is a program that grows with users as they learn more. Here are some links to great free courses from professors at Johns Hopkins and Stanford.
- It’s free, so even the most cash-strapped municipality has access.
Unfortunately, the learning curve can be steep for new users. Processing data, in particular, can be slow and is rarely immediately rewarding. New users are unlikely to see much benefit to downloading, cleaning, and formatting data just to use it in a class assignment. I wouldn’t either.
Now back to my original question: how does the number of passengers per kilometer on my Indonesian city stack up to ridership in US cities?
After loading the file, this code does the trick. I only include directly operated routes (Service == “DO”) because the privately operated routes (Service == “PT”) are less accurately reported. Unlinked passenger trips (UPT) aren’t the same thing as total bus passengers–riders who transfer once count as two unlinked passenger trips–but they’re collected in the same way that corridor bus counts in Indonesia would likely have occurred. I divide directional route miles (DRM) by two because DRM counts 1 mile of bidirectional service as 2 miles and is closer to what the Indonesian counts measure.
summary(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.2915 14.9500 32.9000 67.7300 64.3100 1582.0000 162
And this provides even more information about the distribution of ridership (Unlinked Passenger Trips anyway) per kilometer of bus route.
plot(density(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ), na.rm=T))
And this tells me how many systems perform better than my Indonesian system, which carries only 16 passengers per kilometer.
table(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )<16)
##
## FALSE TRUE
## 317 158
Before going any further, I recommend trying these three commands to get a better feel for the data: head(NTD.ts) str(NTD.ts) summary(NTD.ts) In the interest of space, I’m excluding the outputs.
One of the handiest features of R is the ease with which it handles multiple data sets, so I’m going to add a couple variables and then create a new new data set that only looks at buses in 2013 that are directly operated by public agencies. I switched from Stata to R halfway into my dissertation (an otherwise foolish decision) because this made it so much easier to interact with and combine several large data sets.
NTD.ts$passenger.km <- (NTD.ts$UPT/365) / ((1.61*NTD.ts$DRM)/2)
NTD.ts$fare.recovery <- NTD.ts$FARES / NTD.ts$OPEXP_TOTAL
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0 & Service == "DO")
Now let’s plot passengers per kilometer against fare recovery.
plot(new.dat$fare.recovery, new.dat$passenger.km)
I’m a bit suspicious of the systems drawing more fare revenue than they spend on operations, so I want to look at them more closely: two University systems, an airport shuttle, and one I’m not sure about.
subset(new.dat, new.dat$fare.recovery>1)
## Year TRS.ID Mode Service Last.Report.Year
## 58601 2013 2166 MB DO 2013
## 59799 2013 2132 MB DO 2013
## 59988 2013 4180 MB DO 2013
## 60326 2013 8107 MB DO 2013
## System.Name
## 58601 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799 New Jersey Transit Corporation-45(NJTC-45)
## 59988 University of Georgia Transit System(UGA)
## 60326 The University of Montana - ASUM Transportation(ASUM OT)
## Small.Systems.Waiver City State Census.Year
## 58601 N Elizabeth NJ 2010
## 59799 N Newark NJ 2010
## 59988 N Athens GA 2010
## 60326 N Missoula MT 2010
## UZA.Name UZA UZA.Area.SQ.Miles UZA.Population
## 58601 New York-Newark, NY-NJ-CT 1 3450 18351295
## 59799 New York-Newark, NY-NJ-CT 1 3450 18351295
## 59988 Athens-Clarke County, GA 249 98 128754
## 60326 Missoula, MT 348 45 82157
## OPEXP_TOTAL OPEXP_VO OPEXP_VM OPEXP_NVM OPEXP_GA FARES VOMS
## 58601 15077606 9014072 4403391 708772 951371 16883604 52
## 59799 8299771 5067517 2285398 217667 729189 10034922 41
## 59988 5355484 4052395 791332 52669 459088 6281720 47
## 60326 598190 327957 164685 13613 91935 601851 6
## VRM VRH DRM UPT PMT CPI passenger.km fare.recovery
## 58601 1750683 191541 69.8 10294583 32942666 1 501.9548 1.119780
## 59799 1079382 135429 68.9 5504389 8461884 1 271.8950 1.209060
## 59988 853644 114959 96.0 11058944 4202399 1 392.0610 1.172951
## 60326 153163 11578 6.3 421694 737965 1 227.8076 1.006120
In case I want to only look at a few items.
subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$fare.recovery>1)
## TRS.ID System.Name
## 58601 2166 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799 2132 New Jersey Transit Corporation-45(NJTC-45)
## 59988 4180 University of Georgia Transit System(UGA)
## 60326 8107 The University of Montana - ASUM Transportation(ASUM OT)
## OPEXP_TOTAL DRM UPT PMT passenger.km fare.recovery
## 58601 15077606 69.8 10294583 32942666 501.9548 1.119780
## 59799 8299771 68.9 5504389 8461884 271.8950 1.209060
## 59988 5355484 96.0 11058944 4202399 392.0610 1.172951
## 60326 598190 6.3 421694 737965 227.8076 1.006120
Which systems are busiest?
subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$passenger.km > 750)
## TRS.ID
## 58641 2008
## 59050 5066
## 59876 5158
## System.Name
## 58641 MTA New York City Transit(NYCT)
## 59050 Chicago Transit Authority(CTA)
## 59876 University of Michigan Parking and Transportation Services(UMTS)
## OPEXP_TOTAL DRM UPT PMT passenger.km fare.recovery
## 58641 2487134393 1659.0 770962014 1614997081 1581.6043 0.3417458
## 59050 764280757 1311.2 300116357 728561319 778.9902 0.3909879
## 59876 6995443 20.7 7263234 20293476 1194.1832 0.2824377
I might also want to see how DRM and UPT look against eachother, since these two numbers for them basis of the calculation. It looks an awful lot like fare recovery against productivity.
plot(new.dat$DRM, new.dat$UPT)
Lastly I want to look at how passengers for kilometer has changed over time. First I’ll edit new.dat to include the other years. Then I’ll do a box plot by year.
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F)
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
Clearly there’s some bad data. This merits further investigation, but for now I’ll just remove the worst offenders.
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 )
boxplot(passenger.km ~ Year,data=new.dat, main="Bus ridership desnity over time", na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
There’s so much variation that it’s hard to read so I’ll make two plots, one for high productivity systems and one for low.
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 & passenger.km > 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
xlab="Year", ylab="UPT per kilometer of busway")