Exploring US bus ridership with the National Transit Database

I frequently use the National Transit Database for analysis or quick summaries of key information about public transportation in the US. I also have my students use it to perform a back-of-the-envelope cost benefit analysis for a proposed light rail system in one of my transportation planning courses. In both instances, I find the downloadable excel sheets to be a bit clunky and slow to manipulate. Most recently I wanted a quick comparison of bus ridership per route kilometer for a system in Indonesia. As usual, I downloaded the excel sheets and spent about an hour playing with the data until I got a single reasonable answer. As usual, I’d have to do another bit of work if I wanted to change anything. Instead, I decided to clean up a subset of the data and rearrange it into a long-format panel. For a description of the data and some of the choices made (like when to make an entry a 0 or an NA), see here.

I’m also hoping that the clean data set will encourage my students to choose to work with the data in R. I use R in my Planning Methods and Transportation Planning courses because:

  1. Data sets are ever larger and more readily available. Many are useful to planners and for planning. The ability to manage and work with large amounts of data is an important skill for planners, particularly transportation planners.
  2. It’s a great program for managing, visualizing, and analyzing data. Many of its primary competitors are only really good at the third. Poor data management can be embarrassing.
  3. It has a strong user community that builds add-on packages to accomplish a range of tasks, provides tutorials, and answers questions online. It’s also frequently used in free, online courses and is a program that grows with users as they learn more. Here are some links to great free courses from professors at Johns Hopkins and Stanford.
  4. It’s free, so even the most cash-strapped municipality has access.

Unfortunately, the learning curve can be steep for new users. Processing data, in particular, can be slow and is rarely immediately rewarding. New users are unlikely to see much benefit to downloading, cleaning, and formatting data just to use it in a class assignment. I wouldn’t either.

Now back to my original question: how does the number of passengers per kilometer on my Indonesian city stack up to ridership in US cities?

After loading the file, this code does the trick. I only include directly operated routes (Service == “DO”) because the privately operated routes (Service == “PT”) are less accurately reported. Unlinked passenger trips (UPT) aren’t the same thing as total bus passengers–riders who transfer once count as two unlinked passenger trips–but they’re collected in the same way that corridor bus counts in Indonesia would likely have occurred. I divide directional route miles (DRM) by two because DRM counts 1 mile of bidirectional service as 2 miles and is closer to what the Indonesian counts measure.

summary(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    0.2915   14.9500   32.9000   67.7300   64.3100 1582.0000       162

And this provides even more information about the distribution of ridership (Unlinked Passenger Trips anyway) per kilometer of bus route.

plot(density(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ), na.rm=T))

And this tells me how many systems perform better than my Indonesian system, which carries only 16 passengers per kilometer.

table(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )<16)
## 
## FALSE  TRUE 
##   317   158

Before going any further, I recommend trying these three commands to get a better feel for the data: head(NTD.ts) str(NTD.ts) summary(NTD.ts) In the interest of space, I’m excluding the outputs.

One of the handiest features of R is the ease with which it handles multiple data sets, so I’m going to add a couple variables and then create a new new data set that only looks at buses in 2013 that are directly operated by public agencies. I switched from Stata to R halfway into my dissertation (an otherwise foolish decision) because this made it so much easier to interact with and combine several large data sets.

NTD.ts$passenger.km <- (NTD.ts$UPT/365) / ((1.61*NTD.ts$DRM)/2)
NTD.ts$fare.recovery <- NTD.ts$FARES / NTD.ts$OPEXP_TOTAL
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0 & Service == "DO")

Now let’s plot passengers per kilometer against fare recovery.

plot(new.dat$fare.recovery, new.dat$passenger.km)

I’m a bit suspicious of the systems drawing more fare revenue than they spend on operations, so I want to look at them more closely: two University systems, an airport shuttle, and one I’m not sure about.

subset(new.dat, new.dat$fare.recovery>1)
##       Year TRS.ID Mode Service Last.Report.Year
## 58601 2013   2166   MB      DO             2013
## 59799 2013   2132   MB      DO             2013
## 59988 2013   4180   MB      DO             2013
## 60326 2013   8107   MB      DO             2013
##                                                    System.Name
## 58601                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799               New Jersey Transit Corporation-45(NJTC-45)
## 59988                University of Georgia Transit System(UGA)
## 60326 The University of Montana - ASUM Transportation(ASUM OT)
##       Small.Systems.Waiver      City State Census.Year
## 58601                    N Elizabeth    NJ        2010
## 59799                    N    Newark    NJ        2010
## 59988                    N    Athens    GA        2010
## 60326                    N  Missoula    MT        2010
##                        UZA.Name UZA UZA.Area.SQ.Miles UZA.Population
## 58601 New York-Newark, NY-NJ-CT   1              3450       18351295
## 59799 New York-Newark, NY-NJ-CT   1              3450       18351295
## 59988  Athens-Clarke County, GA 249                98         128754
## 60326              Missoula, MT 348                45          82157
##       OPEXP_TOTAL OPEXP_VO OPEXP_VM OPEXP_NVM OPEXP_GA    FARES VOMS
## 58601    15077606  9014072  4403391    708772   951371 16883604   52
## 59799     8299771  5067517  2285398    217667   729189 10034922   41
## 59988     5355484  4052395   791332     52669   459088  6281720   47
## 60326      598190   327957   164685     13613    91935   601851    6
##           VRM    VRH  DRM      UPT      PMT CPI passenger.km fare.recovery
## 58601 1750683 191541 69.8 10294583 32942666   1     501.9548      1.119780
## 59799 1079382 135429 68.9  5504389  8461884   1     271.8950      1.209060
## 59988  853644 114959 96.0 11058944  4202399   1     392.0610      1.172951
## 60326  153163  11578  6.3   421694   737965   1     227.8076      1.006120

In case I want to only look at a few items.

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$fare.recovery>1)
##       TRS.ID                                              System.Name
## 58601   2166                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799   2132               New Jersey Transit Corporation-45(NJTC-45)
## 59988   4180                University of Georgia Transit System(UGA)
## 60326   8107 The University of Montana - ASUM Transportation(ASUM OT)
##       OPEXP_TOTAL  DRM      UPT      PMT passenger.km fare.recovery
## 58601    15077606 69.8 10294583 32942666     501.9548      1.119780
## 59799     8299771 68.9  5504389  8461884     271.8950      1.209060
## 59988     5355484 96.0 11058944  4202399     392.0610      1.172951
## 60326      598190  6.3   421694   737965     227.8076      1.006120

Which systems are busiest?

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$passenger.km > 750)
##       TRS.ID
## 58641   2008
## 59050   5066
## 59876   5158
##                                                            System.Name
## 58641                                  MTA New York City Transit(NYCT)
## 59050                                   Chicago Transit Authority(CTA)
## 59876 University of Michigan Parking and Transportation Services(UMTS)
##       OPEXP_TOTAL    DRM       UPT        PMT passenger.km fare.recovery
## 58641  2487134393 1659.0 770962014 1614997081    1581.6043     0.3417458
## 59050   764280757 1311.2 300116357  728561319     778.9902     0.3909879
## 59876     6995443   20.7   7263234   20293476    1194.1832     0.2824377

I might also want to see how DRM and UPT look against eachother, since these two numbers for them basis of the calculation. It looks an awful lot like fare recovery against productivity.

plot(new.dat$DRM, new.dat$UPT)

Lastly I want to look at how passengers for kilometer has changed over time. First I’ll edit new.dat to include the other years. Then I’ll do a box plot by year.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F)
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

Clearly there’s some bad data. This merits further investigation, but for now I’ll just remove the worst offenders.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 )
boxplot(passenger.km ~ Year,data=new.dat, main="Bus ridership desnity over time", na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

There’s so much variation that it’s hard to read so I’ll make two plots, one for high productivity systems and one for low.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 1000 & passenger.km > 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & is.na(passenger.km)==F & passenger.km < 200 )
boxplot(passenger.km ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")


 

This entry was posted in Transportation Planning and tagged . Bookmark the permalink.