Planning Methods: Population trend model

Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville's population. For a review of OLS, please refer to class readings:

An Introduction to Statistical Learning, chapter 3

Video lectures available here (also, chapter 3)

Tufte's Data Analysis for Politics and Policy ($2 for ebook)

Let's start by looking at the population over time:

plot(pop$Year, pop$Libreville)

plot of chunk unnamed-chunk-2

It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we'll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.

lbv <- lm(pop$Libreville ~ pop$Year)
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.81 -24.46  -4.11  21.42  91.48 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.185e+04  1.314e+03  -16.62 3.84e-09 ***
## pop$Year     1.116e+01  6.637e-01   16.82 3.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 44.77 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9626, Adjusted R-squared:  0.9592 
## F-statistic:   283 on 1 and 11 DF,  p-value: 3.389e-09

What do the estimates mean?

The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year.

So if we want to know the estimated population in 2000, we could estimate it manually as -21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually.

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000
## [1] 481.7582

Or we could predict each year in the data:

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year
##  [1] -76.48352 -20.65934  35.16484  90.98901 146.81319 202.63736 258.46154
##  [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330

Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).

cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind
##    [,1] [,2]      [,3]
## 1  1950   15 -76.48352
## 2  1955   17 -20.65934
## 3  1960   29  35.16484
## 4  1965   49  90.98901
## 5  1970   77 146.81319
## 6  1975  155 202.63736
## 7  1980  234 258.46154
## 8  1985  293 314.28571
## 9  1990  366 370.10989
## 10 1995  439 425.93407
## 11 2000  496 481.75824
## 12 2005  559 537.58242
## 13 2010  631 593.40659
## 14 2015   NA 649.23077
## 15 2020   NA 705.05495
## 16 2025   NA 760.87912
## 17 2030   NA 816.70330

Is this a good predictor of population from 2015 to 2030?

The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let's start by looking at the regression line?

plot(pop$Year, pop$Libreville)

plot of chunk unnamed-chunk-7

The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)

plot(predict(lbv), resid(lbv))

plot of chunk unnamed-chunk-8

Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?

If you guessed underpredict, then I agree. While we don't necessarily know that the ongoing trend will continue, we're making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future.

Another way that we might test our model is to see how well it predicts population in the most recent years without using population from those year (i.e., back-casting).

lbv_back <- lm(Libreville ~ Year, subset(pop, pop$Year < 1995))

plot(pop$Year, pop$Libreville)

plot of chunk unnamed-chunk-9

This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data.

## [1] 1438.128
mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015] - predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)
## [1] 9702.728

Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.

There are a few ways that we might try to get better predictions.

The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:

plot(pop$Year, pop$Libreville)
lbv <- lm(Libreville ~ Year, subset(pop, pop$Year > 1970))
## Call:
## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year > 
##     1970))
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6667 -3.5595 -0.9524  3.5060  8.8095 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.636e+04  3.534e+02  -74.58 3.91e-10 ***
## Year         1.343e+01  1.774e-01   75.70 3.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 5.747 on 6 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.9988 
## F-statistic:  5731 on 1 and 6 DF,  p-value: 3.576e-10

plot of chunk unnamed-chunk-11

Though not necessarily well-suited to these data, adding a quadratic terms can often improve model fit:

lbv2 <- lm(pop$Libreville ~ pop$Year + I(pop$Year^2))
## Call:
## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.701 -11.787   6.736  15.260  29.637 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.190e+05  8.743e+04   4.792 0.000733 ***
## pop$Year      -4.341e+02  8.832e+01  -4.915 0.000609 ***
## I(pop$Year^2)  1.124e-01  2.230e-02   5.042 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 24.95 on 10 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9873 
## F-statistic: 468.3 on 2 and 10 DF,  p-value: 1.315e-10
plot(pop$Year, pop$Libreville)

lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)

plot of chunk unnamed-chunk-12

Or we could try predicting Libreville's population as a function of a larger geography, such as all Sub-Saharan African cities over 300,000.

plot(pop$Sub_Saharan_300kplus, pop$Libreville)

## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.44 -27.04 -23.48  41.54  51.95 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          22.016224  15.931117   1.382    0.194    
## Sub_Saharan_300kplus  0.004803   0.000242  19.850 5.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 38.14 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9704 
## F-statistic:   394 on 1 and 11 DF,  p-value: 5.795e-10

plot of chunk unnamed-chunk-13


Download population data for New York City and read it into R.

  1. Plot New York City's population from 1790 to 2010 and describe the major trends.

  2. Based on these trends and your knowledge of NYC, describe what you think NYC's population will be in 2020 and why.

  3. Use a linear regression to make a prediction of New York's 2020 population. Is the prediction any good?

  4. Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).

  5. Try a quadratic regression.

  6. Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).

  7. Use backcasting to predict NYC's 2000 and 2010 population.

  8. Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?

  9. Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.

Posted in Planning Methods | Comments Off on Planning Methods: Population trend model

Planning Methods: Loading and examining data

This page describes how to upload and look at data in the R package. First, we will downlaod a CSV of UN population estimates on central African cities to import into R: Central Africa West.

The first thing to do is to let R know where to find the CSV. You can do this by navigating through the bottom right window in R Studio (Files/More/Set As Working Directory) or setting your working directory using the setwd command.

Then read the CSV file: read.csv(“Central Africa West.csv”)
Note how the data are read directly printed on the command console. Next we will create a new object called pop so that we can call on the data more easily and interact with it.

Try: pop <- read.csv(“Central Africa West.csv”)

NB you can use = instead of <- but it is better coding practice to save the = sign for other uses.

To look at the data, try cutting and pasting the following commands into the command console and hitting enter.

## 'data.frame':    17 obs. of  12 variables:
##  $ Year                   : int  1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 ...
##  $ Period                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Douala                 : int  95 117 153 205 298 433 571 740 940 1184 ...
##  $ Yaounde                : int  32 49 75 112 183 292 415 578 777 1025 ...
##  $ Libreville             : int  15 17 29 49 77 155 234 293 366 439 ...
##  $ Brazzaville            : int  83 92 124 172 238 329 446 596 704 830 ...
##  $ Pointe_Noire           : int  16 35 64 89 116 154 217 300 363 439 ...
##  $ Sub_Saharan_300kplus   : int  3923 4909 7083 10779 16335 24143 34813 47767 62327 75996 ...
##  $ Central_Africa_300kplus: int  3660 4521 5651 7047 8921 11495 14465 18043 22566 28525 ...
##  $ Cameroon_urban         : int  417 557 747 1011 1375 2112 2851 3761 4787 5930 ...
##  $ Congo_urban            : int  201 253 320 408 522 672 860 1086 1295 1535 ...
##  $ Gabon_urban            : int  54 68 87 126 189 279 397 515 655 814 ...

This tells you the structure of the data. In this case, all of the variables are integers, but it is also common to see characters, factors, and other types of data.

##  [1] "Year"                    "Period"                 
##  [3] "Douala"                  "Yaounde"                
##  [5] "Libreville"              "Brazzaville"            
##  [7] "Pointe_Noire"            "Sub_Saharan_300kplus"   
##  [9] "Central_Africa_300kplus" "Cameroon_urban"         
## [11] "Congo_urban"             "Gabon_urban"

This command gives the names of all the variables in the data. The country, city, and regions contain population estimates in thousands.

##       Year          Period       Douala          Yaounde      
##  Min.   :1950   Min.   : 1   Min.   :  95.0   Min.   :  32.0  
##  1st Qu.:1970   1st Qu.: 5   1st Qu.: 205.0   1st Qu.: 112.0  
##  Median :1990   Median : 9   Median : 571.0   Median : 415.0  
##  Mean   :1990   Mean   : 9   Mean   : 804.8   Mean   : 693.8  
##  3rd Qu.:2010   3rd Qu.:13   3rd Qu.:1184.0   3rd Qu.:1025.0  
##  Max.   :2030   Max.   :17   Max.   :2361.0   Max.   :2349.0  
##                              NA's   :4        NA's   :4       
##    Libreville     Brazzaville      Pointe_Noire   Sub_Saharan_300kplus
##  Min.   : 15.0   Min.   :  83.0   Min.   : 16.0   Min.   :  3923      
##  1st Qu.: 49.0   1st Qu.: 172.0   1st Qu.: 89.0   1st Qu.: 16335      
##  Median :234.0   Median : 446.0   Median :217.0   Median : 62327      
##  Mean   :258.5   Mean   : 575.3   Mean   :293.1   Mean   : 91980      
##  3rd Qu.:439.0   3rd Qu.: 830.0   3rd Qu.:439.0   3rd Qu.:137283      
##  Max.   :631.0   Max.   :1574.0   Max.   :815.0   Max.   :300153      
##  NA's   :4       NA's   :4        NA's   :4                           
##  Central_Africa_300kplus Cameroon_urban   Congo_urban    Gabon_urban    
##  Min.   :  3660          Min.   :  417   Min.   : 201   Min.   :  54.0  
##  1st Qu.:  8921          1st Qu.: 1375   1st Qu.: 522   1st Qu.: 189.0  
##  Median : 22566          Median : 4787   Median :1295   Median : 655.0  
##  Mean   : 34795          Mean   : 6835   Mean   :1723   Mean   : 819.6  
##  3rd Qu.: 51883          3rd Qu.:10625   3rd Qu.:2600   3rd Qu.:1334.0  
##  Max.   :107747          Max.   :20492   Max.   :4804   Max.   :2122.0  

Another good way to get a sense for the data is to look at the first or last entries, using the head or tail commands.

##   Year Period Douala Yaounde Libreville Brazzaville Pointe_Noire
## 1 1950      1     95      32         15          83           16
## 2 1955      2    117      49         17          92           35
## 3 1960      3    153      75         29         124           64
## 4 1965      4    205     112         49         172           89
## 5 1970      5    298     183         77         238          116
## 6 1975      6    433     292        155         329          154
##   Sub_Saharan_300kplus Central_Africa_300kplus Cameroon_urban Congo_urban
## 1                 3923                    3660            417         201
## 2                 4909                    4521            557         253
## 3                 7083                    5651            747         320
## 4                10779                    7047           1011         408
## 5                16335                    8921           1375         522
## 6                24143                   11495           2112         672
##   Gabon_urban
## 1          54
## 2          68
## 3          87
## 4         126
## 5         189
## 6         279

For help with any of these commands, use the help function by typing ? before the command name. For example, try typing ?head into the command console


Walk through this brief introduction to R and R Studio.



Posted in Planning Methods | Comments Off on Planning Methods: Loading and examining data

Exploring US bus ridership with the National Transit Database

I frequently use the National Transit Database for analysis or quick summaries of key information about public transportation in the US. I also have my students use it to perform a back-of-the-envelope cost benefit analysis for a proposed light rail system in one of my transportation planning courses. In both instances, I find the downloadable excel sheets to be a bit clunky and slow to manipulate. Most recently I wanted a quick comparison of bus ridership per route kilometer for a system in Indonesia. As usual, I downloaded the excel sheets and spent about an hour playing with the data until I got a single reasonable answer. As usual, I’d have to do another bit of work if I wanted to change anything. Instead, I decided to clean up a subset of the data and rearrange it into a long-format panel. For a description of the data and some of the choices made (like when to make an entry a 0 or an NA), see here.

I’m also hoping that the clean data set will encourage my students to choose to work with the data in R. I use R in my Planning Methods and Transportation Planning courses because:

  1. Data sets are ever larger and more readily available. Many are useful to planners and for planning. The ability to manage and work with large amounts of data is an important skill for planners, particularly transportation planners.
  2. It’s a great program for managing, visualizing, and analyzing data. Many of its primary competitors are only really good at the third. Poor data management can be embarrassing.
  3. It has a strong user community that builds add-on packages to accomplish a range of tasks, provides tutorials, and answers questions online. It’s also frequently used in free, online courses and is a program that grows with users as they learn more. Here are some links to great free courses from professors at Johns Hopkins and Stanford.
  4. It’s free, so even the most cash-strapped municipality has access.

Unfortunately, the learning curve can be steep for new users. Processing data, in particular, can be slow and is rarely immediately rewarding. New users are unlikely to see much benefit to downloading, cleaning, and formatting data just to use it in a class assignment. I wouldn’t either.

Now back to my original question: how does the number of passengers per kilometer on my Indonesian city stack up to ridership in US cities?

After loading the file, this code does the trick. I only include directly operated routes (Service == “DO”) because the privately operated routes (Service == “PT”) are less accurately reported. Unlinked passenger trips (UPT) aren’t the same thing as total bus passengers–riders who transfer once count as two unlinked passenger trips–but they’re collected in the same way that corridor bus counts in Indonesia would likely have occurred. I divide directional route miles (DRM) by two because DRM counts 1 mile of bidirectional service as 2 miles and is closer to what the Indonesian counts measure.

summary(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    0.2915   14.9500   32.9000   67.7300   64.3100 1582.0000       162

And this provides even more information about the distribution of ridership (Unlinked Passenger Trips anyway) per kilometer of bus route.

plot(density(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & Service == "DO" & UPT != 0), (UPT/365) / ((1.61*DRM)/2) ), na.rm=T))

And this tells me how many systems perform better than my Indonesian system, which carries only 16 passengers per kilometer.

table(with(subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0), (UPT/365) / ((1.61*DRM)/2) )<16)
##   317   158

Before going any further, I recommend trying these three commands to get a better feel for the data: head(NTD.ts) str(NTD.ts) summary(NTD.ts) In the interest of space, I’m excluding the outputs.

One of the handiest features of R is the ease with which it handles multiple data sets, so I’m going to add a couple variables and then create a new new data set that only looks at buses in 2013 that are directly operated by public agencies. I switched from Stata to R halfway into my dissertation (an otherwise foolish decision) because this made it so much easier to interact with and combine several large data sets.

NTD.ts$ <- (NTD.ts$UPT/365) / ((1.61*NTD.ts$DRM)/2)
NTD.ts$fare.recovery <- NTD.ts$FARES / NTD.ts$OPEXP_TOTAL
new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB" & Year == 2013 & UPT != 0 & Service == "DO")

Now let’s plot passengers per kilometer against fare recovery.

plot(new.dat$fare.recovery, new.dat$

I’m a bit suspicious of the systems drawing more fare revenue than they spend on operations, so I want to look at them more closely: two University systems, an airport shuttle, and one I’m not sure about.

subset(new.dat, new.dat$fare.recovery>1)
##       Year TRS.ID Mode Service Last.Report.Year
## 58601 2013   2166   MB      DO             2013
## 59799 2013   2132   MB      DO             2013
## 59988 2013   4180   MB      DO             2013
## 60326 2013   8107   MB      DO             2013
##                                                    System.Name
## 58601                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799               New Jersey Transit Corporation-45(NJTC-45)
## 59988                University of Georgia Transit System(UGA)
## 60326 The University of Montana - ASUM Transportation(ASUM OT)
##       Small.Systems.Waiver      City State Census.Year
## 58601                    N Elizabeth    NJ        2010
## 59799                    N    Newark    NJ        2010
## 59988                    N    Athens    GA        2010
## 60326                    N  Missoula    MT        2010
##                        UZA.Name UZA UZA.Area.SQ.Miles UZA.Population
## 58601 New York-Newark, NY-NJ-CT   1              3450       18351295
## 59799 New York-Newark, NY-NJ-CT   1              3450       18351295
## 59988  Athens-Clarke County, GA 249                98         128754
## 60326              Missoula, MT 348                45          82157
## 58601    15077606  9014072  4403391    708772   951371 16883604   52
## 59799     8299771  5067517  2285398    217667   729189 10034922   41
## 59988     5355484  4052395   791332     52669   459088  6281720   47
## 60326      598190   327957   164685     13613    91935   601851    6
##           VRM    VRH  DRM      UPT      PMT CPI fare.recovery
## 58601 1750683 191541 69.8 10294583 32942666   1     501.9548      1.119780
## 59799 1079382 135429 68.9  5504389  8461884   1     271.8950      1.209060
## 59988  853644 114959 96.0 11058944  4202399   1     392.0610      1.172951
## 60326  153163  11578  6.3   421694   737965   1     227.8076      1.006120

In case I want to only look at a few items.

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$fare.recovery>1)
##       TRS.ID                                              System.Name
## 58601   2166                 Orange-Newark-Elizabeth, Inc.(Coach USA)
## 59799   2132               New Jersey Transit Corporation-45(NJTC-45)
## 59988   4180                University of Georgia Transit System(UGA)
## 60326   8107 The University of Montana - ASUM Transportation(ASUM OT)
##       OPEXP_TOTAL  DRM      UPT      PMT fare.recovery
## 58601    15077606 69.8 10294583 32942666     501.9548      1.119780
## 59799     8299771 68.9  5504389  8461884     271.8950      1.209060
## 59988     5355484 96.0 11058944  4202399     392.0610      1.172951
## 60326      598190  6.3   421694   737965     227.8076      1.006120

Which systems are busiest?

subset(new.dat[c(2, 6, 15, 24, 25, 26, 28,29)], new.dat$ > 750)
##       TRS.ID
## 58641   2008
## 59050   5066
## 59876   5158
##                                                            System.Name
## 58641                                  MTA New York City Transit(NYCT)
## 59050                                   Chicago Transit Authority(CTA)
## 59876 University of Michigan Parking and Transportation Services(UMTS)
##       OPEXP_TOTAL    DRM       UPT        PMT fare.recovery
## 58641  2487134393 1659.0 770962014 1614997081    1581.6043     0.3417458
## 59050   764280757 1311.2 300116357  728561319     778.9902     0.3909879
## 59876     6995443   20.7   7263234   20293476    1194.1832     0.2824377

I might also want to see how DRM and UPT look against eachother, since these two numbers for them basis of the calculation. It looks an awful lot like fare recovery against productivity.

plot(new.dat$DRM, new.dat$UPT)

Lastly I want to look at how passengers for kilometer has changed over time. First I’ll edit new.dat to include the other years. Then I’ll do a box plot by year.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" &
boxplot( ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

Clearly there’s some bad data. This merits further investigation, but for now I’ll just remove the worst offenders.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & & < 1000 )
boxplot( ~ Year,data=new.dat, main="Bus ridership desnity over time", na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

There’s so much variation that it’s hard to read so I’ll make two plots, one for high productivity systems and one for low.

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & & < 1000 & > 200 )
boxplot( ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")

new.dat <- subset(NTD.ts, NTD.ts$Mode == "MB"  & UPT != 0 & Service == "DO" & & < 200 )
boxplot( ~ Year,data=new.dat, na.rm=T,
    xlab="Year", ylab="UPT per kilometer of busway")


Posted in Transportation Planning | Tagged | Comments Off on Exploring US bus ridership with the National Transit Database

National Transit Database panel by system, mode, and service

This page contains links to a long-formatted panel of the National Transit Database’s TS2.1 – Service Data and Operating Expenses Time-Series by Mode. I maintain the NTD data conventions, so that the column names match the NTD’s data glossary. I also added a column labeled CPI that is generated from the Bureau of Labor Statistics’ Inflation Calculator and provides the 2013 value of one dollar in the reported year. I have not reformatted the other time series NTD data, which I use less frequently and which are not available at the same level of disaggregation. For example, the capital expenditures data include system, mode, and year, but not service type. Some data products have also been discontinued, such as the detailed sources of local revenues data. The reformatted panel is available in the following formats:

R (password: NTD1),

Stata (password: NTD2),

and as a CSV (password: NTD3).

I haven’t looked closely at the Stata file but exported it using the foreign package. The code used to format the data is available here (password: NTD4), if you want to make any changes. Below, I describe three important choices about when to treat missing values as an NA or a 0. The NTD data do not take a particularly clear approach to the problem. For example, entries for missing cost data vary between a null entry and $0 with no apparent pattern. For additional information about how the data were constructed, see the code above.

  1. Keep zeros. In 1992, Denver did not have a light rail. Instead of replacing entries of zero with NAs, I keep the data. This will make it easier to analyze influence of new light rail in the future. It is, however, worth noting that this is somewhat inconsistent and introduces some problems. It is inconsistent because Honolulu, which as never had a light rail system, is excluded from the panel and effectively the data are treated as NAs not zeros. It is potentially problematic, because if I will miscalculate average light rail ridership or operating expenses in 1992, if I’m not careful:
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992), summary(OPEXP_TOTAL) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0        0        0  6334000  5050000 62260000
with(subset(NTD.ts, NTD.ts$Mode == "LR" & NTD.ts$Year == 1992 & NTD.ts$OPEXP_TOTAL > 0), summary(OPEXP_TOTAL) )
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   261200  4800000 11440000 17730000 21360000 62260000
  1. Replace known missing data with an NA. For example, the NTD does not report fare revenues earlier than 2002, so I replaced all fare entries with NAs.
  2. Systematically coerce likely incorrect zeros into NAs. The NTD are frequently messy, particularly for small systems and privately operated systems. In one year, an agency might report unlinked passenger trips, but fail to report any operating costs or passenger miles. Since ridership is particularly important, I used PMT and UPT to identify unreported data treated as zeros. There were only 12 instances where an agency reported positive PMT but no UPT. By contrast there were numerous instances where an agency reported positive UPT, but no PMT. I first coerced these 12 cases of UPT to be NAs. I then replaced zeros in remaining columns (including PMT) with NAs whenever the agency reported positive UPT. Note that this does not remove any data, but makes it a lot less likely to generate bad data (if for example dividing total fare revenues or operating expenses by unlinked passenger trips or passenger miles).
Posted in Transportation Planning | Tagged | Comments Off on National Transit Database panel by system, mode, and service

The New Suburbs: Evolving travel behavior, the built environment, and subway investments in Mexico City

Mexico City is a suburban metropolis, yet most of its suburbs would be unfamiliar to urbanists accustomed to thinking about US metropolitan regions. Mexico City’s suburbs are densely populated—not thinly settled—and its residents rely primarily on informal transit rather than privately-owned automobiles for their daily transportation. These types of dense and transit-dependent suburbs have emerged as the fastest-growing form of human settlement in cities throughout Latin America, Asia, and Africa. Wealthier and at a later stage in its economic development than other developing-world metropolises, Mexico City is a compelling place to investigate the effects of rising incomes, increased car ownership, and transit investments in the dense, peripheral areas that have grown rapidly around informal transit in the past decades, and is a bellwether for cities like Dakar, Cairo, Lima, and Jakarta.

I begin this dissertation with a historical overview of the demographic, economic, and political trends that have helped shape existing urban form, transportation infrastructure, and travel behavior in Mexico City. Despite an uptick in car ownership and use, most households—both urban and suburban—continue to rely on public transportation. Furthermore, suburban Mexico City has lower rates of car ownership and use than its central areas. In subsequent chapters, I frame, pose, and investigate three interrelated questions about Mexico City’s evolving suburban landscape, the nature of households’ travel decisions, and the relationship between the built environment and travel behavior. Together, these inquiries tell a story that differs significantly from narratives about US suburbs, and provide insight into the future transportation needs and likely effects of land and transportation policy in these communities and others like them in Mexico and throughout the developing world.

First, how has the influence of the built environment on travel behavior changed as more households have moved into the suburbs and aggregate car use has increased? Using two large metropolitan household travel surveys from 1994 and 2007, I model two related-but-distinct household travel decisions: whether to drive on an average weekday, and if so, how far to drive. After controlling for income and other household attributes, I find that the influence of population and job density on whether a household undertakes any daily car trips is strong and has increased marginally over time. By contrast, high job and population densities have a much smaller influence on the total distance of weekday car travel that a household generates. For the subset of households whose members drive on a given weekday, job and population densities have no statistical effect at all. Contrary to expectations, a household’s distance from the urban center is strongly correlated with a lower probability of driving, even after controlling for income. This effect, however, appears to be diminishing over time, and when members of a household drive, they drive significantly more if they live farther from the urban center. The combination of informal transit, public buses, and the Metro has provided sufficient transit service to constrain car use in the densely populated suburban environments of Mexico City. Once suburban residents drive, however, they tend to drive a lot regardless of transit or the features of the built environment.

Second, how much are the recent trends of increased suburbanization, rising car-ownership, and the proliferation of massive commercially-built peripheral housing developments interrelated? To investigate this question, I first disentangle urban growth and car ownership trends by geographic area. The fastest-growing areas tend to be poorer and have had a much smaller impact on the size of the metropolitan car fleet than wealthier, more established neighborhoods in the center and western half of the metropolis. I then zoom in to examine several recent commercial housing developments. These developments, supported by publicly-subsidized mortgages, contain thousands of densely-packed, small, and modestly-priced housing units. Their residents remain highly reliant on public transportation, particularly informal transit, and the neighborhoods become less homogeneous over time as homeowners convert units and parking spaces to shops and offices. Finally, I use the 2007 household travel survey to model households’ intertwined decisions of where to live and whether to own a car. As expected, wealthier and smaller households are more likely to purchase vehicles. However, they prefer to live in more central areas where households with cars tend to drive shorter distances. If housing policy and production cannot adapt to provide more centrally-located housing, growing incomes will tend to increase car ownership but concentrate more of it in areas where car-owning households drive much farther.

Third, how has the Metro’s Line B, one of the first and only suburban high-capacity transit investments, influenced local and regional travel behavior and land use? To explore this question, I compare travel behavior and land use measures at six geographic scales, including the investment’s immediate catchment area, across two time periods: six years before and seven years after the investment opened. Line B, which opened in stages in 1999 and 2000, significantly expanded Metro coverage into the densely populated and fast-growing suburban municipality of Ecatepec. While the investment sparked a significant increase in local Metro use, most of this increase came from people relying on informal transit, rather than cars. While this shift reduced transit fares and increased transit speeds for local residents, it also increased government subsidies for the Metro and had no apparent effect on road speeds. Furthermore, the Metro remains highly dependent on informal transit to provide feeder service even within Ecatepec. In terms of land use, the investment increased density around the stations but appears to have had little to no effect on downtown commercial development, where it might have been expected to have a significant influence. In short, the effects of Line B demonstrate much of the promise and problem with expanding high capacity transit service into the suburbs. Ridership is likely to be high, but so too will be the costs and subsidies, while the effects on car ownership and urban form are likely to be modest.

View of Ecatepec from Line B (2012)

Individually, each chapter contributes to a specific body of transportation and planning literature drawn from the US as well as developing countries. Collectively, they point to connection between land use and transportation in Mexico City that is different from the connection in US and other rich-world cities. In particular, there is a physical disconnect between the generally suburban homes of transit users and the generally central location of high-capacity public transit. Addressing this disconnect by shifting housing production from the periphery to the center or by expanding high-capacity transit to the periphery would require significant amounts of time and public subsidy. Thus, contemporary policies to reduce car use or increase accessibility for the poor in the short and medium term would do well to focus on improving the flexible, medium-capacity informal transit around which the city’s dense and transit-dependent suburbs have grown and continue to grow.

Posted in Uncategorized | Leave a comment

Is a Half-Mile Circle the Right Standard for TODs?

Planners and researchers use transit catchment areas—the land around stations—as geographic units for predicting ridership, assessing the impacts of transit investments and, recently, for designing transit-oriented developments (TODs). In the US, a half-mile-radius circle has become the de facto standard for rail-transit catchment areas.

There is surprisingly little evidence to justify any particular catchment area. Why a half mile? Why not a quarter mile or two-fifths of a mile? Is there anything special about a half mile or is this simply a convenient figure that has become an industry standard? A half mile roughly corresponds to the distance someone can walk in 10 minutes at 3 miles per hour and is a common estimate for the distance people will walk to get to a rail station. The half-mile ring is a little more than 500 acres in size.

Read more.

Posted in Uncategorized | Leave a comment

Accessibility, it’s not just a wonkish term for being near stuff

I was recently at a transportation conference, where one of the speakers admonished academics for not doing more to get their research out the academy and into the hands of policy makers and the public. He compared this with think tanks, which spend around a quarter of their budgets on marketing.

Perhaps even more than marketing budgets,  inaccessible language, research methods, and publication practice help keep too many books and articles locked away in the ivory tower. While simplifying methods may not always be desirable–and it often is–we could certainly do a lot more to make our writing and publications more accessible.

I recently had the opportunity to rewrite an article with Robert Cervero in the Journal of the American Planning Association into an ACCESS article. If you click on the former, you will likely be directed to an abstract and the opportunity to purchase the article. Unless you have an institutional proxy server that grants you free access to the pdf,  you may look for a free draft working paper that’s available through University of California Transportation Center. This working paper, however, did not benefit from JAPA‘s peer review process (many thanks to David Sawicki, Randy Crane, and three anonymous reviewers), which led us to make serious modifications to our data collection and analysis methods. In short, it shows a much earlier iteration of a work in progress.

If you click on the ACCESS link, by contrast, you will be taken directly to a 2,000 word rewrite and summary of the full JAPA article. You can view this as an html, download a pdf, or order a paper subscription. If you have additional questions about the findings or how the study was conducted, the article provides the JAPA citation. ACCESS‘s goal “…is to translate academic research into readable prose that is useful for policymakers and practitioners. Articles in ACCESS are intended to catapult academic research into debates about public policy, and convert knowledge into action.”

Since the ACCESS piece came out, I have received far more inquiries about the work, Robert Cervero was on the local news, and a friend working in New Zealand pointed me to this discussion. This kind of feedback has been important to me, as a young, and hopefully budding, academic. I’m grateful to Don Shoup and the rest of the ACCESS editing team for their hard work. I’ll strive to remember that accessibility is a lot more than a measure of desired destinations, opportunities, and experiences that can be reached in a given amount time.



Posted in Uncategorized | Leave a comment

Informal Public Transportation Networks in Three Indonesian Cities

“As Indonesian cities today become more prosperous, the demand for mobility among the urban poor is rapidly growing. This is nowhere more the case than in Jakarta – each day city streets become frozen with congestion. Government efforts to supply public transportation are insufficient at best. Informal transportation providers driving ojeks are a common sight weaving among unmoving traffic, simultaneously offering a faster route and contributing to congestion. Growth is also coming to other medium-sized Indonesian cities like Jogja and Solo in Central Java and Palembang in Sumatra—and so is the traffic. Yet while motorized transport is increasing in these cities, the environmental problems of congestion and pollution have not reached the scale of Indonesia’s biggest cities. In fact, as this report describes, informal public transportation offers potential alternatives to the negative stresses of growth on urban transportation systems as well as innovative approaches to provide service to people in poverty.

This report looks at informal public transportation (IPT) from different perspectives and reconsiders its value not just in improving urban mobility, but also as a provider of employment and backbone of the informal economy.”
Read More

Posted in Uncategorized | Leave a comment

2011 Tranportation Research Board Presentation: Valuing Transit

A preview of the presentation is available here.

I’ve also uploaded a low-resolution pdf.

Posted in Uncategorized | Leave a comment

Kill a camel. Save the world.

I recently read this AP story in the Boston Globe. Apparently, the Australian government is considering awarding carbon credits for killing wild Australian camels. Each year, every six camels produces about the same amount of carbon as an average American car. Given the 1.2 million camels, that’s the equivalent of 200,000 cars. The article goes on to quote a government official about the menace that non-native camels pose to the Australian ecosystem and brand the animals a national menace. Even killing the creatures from helicopters, not a carbon-light mode of transportation, would qualify for credits, but only if the killing were humane.

The story reminded me a lot of how we’re approaching climate change and transportation in the US.

Whatever the merits of culling Australian camels or building a high speed rail network, carbon emissions play a small role in the costs and benefits. Yet, quantifiable, concrete, and topical, they often seem to be a central focus.


Posted in Uncategorized | Leave a comment