This page contains links to a long-formatted panel of the National Transit Database’s TS2.1 – Service Data and Operating Expenses Time-Series by Mode. I maintain the NTD data conventions, so that the column names match the NTD’s data glossary. I also added a column labeled CPI that is generated from the Bureau of Labor Statistics’ Inflation Calculator and provides the 2013 value of one dollar in the reported year. I have not reformatted the other time series NTD data, which I use less frequently and which are not available at the same level of disaggregation. For example, the capital expenditures data include system, mode, and year, but not service type. Some data products have also been discontinued, such as the detailed sources of local revenues data. The reformatted panel is available in the following formats:
and as a CSV (password: NTD3).
I haven’t looked closely at the Stata file but exported it using the foreign package. The code used to format the data is available here (password: NTD4), if you want to make any changes. Below, I describe three important choices about when to treat missing values as an NA or a 0. The NTD data do not take a particularly clear approach to the problem. For example, entries for missing cost data vary between a null entry and $0 with no apparent pattern. For additional information about how the data were constructed, see the code above.
- Keep zeros. In 1992, Denver did not have a light rail. Instead of replacing entries of zero with NAs, I keep the data. This will make it easier to analyze influence of new light rail in the future. It is, however, worth noting that this is somewhat inconsistent and introduces some problems. It is inconsistent because Honolulu, which as never had a light rail system, is excluded from the panel and effectively the data are treated as NAs not zeros. It is potentially problematic, because if I will miscalculate average light rail ridership or operating expenses in 1992, if I’m not careful:
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992), summary(OPEXP_TOTAL) )
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 0 0 6334000 5050000 62260000
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992 & NTD.ts$OPEXP_TOTAL > 0), summary(OPEXP_TOTAL) )
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 261200 4800000 11440000 17730000 21360000 62260000
- Replace known missing data with an NA. For example, the NTD does not report fare revenues earlier than 2002, so I replaced all fare entries with NAs.
- Systematically coerce likely incorrect zeros into NAs. The NTD are frequently messy, particularly for small systems and privately operated systems. In one year, an agency might report unlinked passenger trips, but fail to report any operating costs or passenger miles. Since ridership is particularly important, I used PMT and UPT to identify unreported data treated as zeros. There were only 12 instances where an agency reported positive PMT but no UPT. By contrast there were numerous instances where an agency reported positive UPT, but no PMT. I first coerced these 12 cases of UPT to be NAs. I then replaced zeros in remaining columns (including PMT) with NAs whenever the agency reported positive UPT. Note that this does not remove any data, but makes it a lot less likely to generate bad data (if for example dividing total fare revenues or operating expenses by unlinked passenger trips or passenger miles).