Planning Methods: Population trend model





Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville's population. For a review of OLS, please refer to class readings:

An Introduction to Statistical Learning, chapter 3

Video lectures available here (also, chapter 3)

Tufte's Data Analysis for Politics and Policy ($2 for ebook)

Let's start by looking at the population over time:

plot(pop$Year, pop$Libreville)

plot of chunk unnamed-chunk-2

It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we'll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.

lbv <- lm(pop$Libreville ~ pop$Year)
summary(lbv)
## 
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.81 -24.46  -4.11  21.42  91.48 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.185e+04  1.314e+03  -16.62 3.84e-09 ***
## pop$Year     1.116e+01  6.637e-01   16.82 3.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.77 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9626, Adjusted R-squared:  0.9592 
## F-statistic:   283 on 1 and 11 DF,  p-value: 3.389e-09

What do the estimates mean?

The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year.

So if we want to know the estimated population in 2000, we could estimate it manually as -21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually.

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000
## [1] 481.7582

Or we could predict each year in the data:

summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year
##  [1] -76.48352 -20.65934  35.16484  90.98901 146.81319 202.63736 258.46154
##  [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330

Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).

cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind
##    [,1] [,2]      [,3]
## 1  1950   15 -76.48352
## 2  1955   17 -20.65934
## 3  1960   29  35.16484
## 4  1965   49  90.98901
## 5  1970   77 146.81319
## 6  1975  155 202.63736
## 7  1980  234 258.46154
## 8  1985  293 314.28571
## 9  1990  366 370.10989
## 10 1995  439 425.93407
## 11 2000  496 481.75824
## 12 2005  559 537.58242
## 13 2010  631 593.40659
## 14 2015   NA 649.23077
## 15 2020   NA 705.05495
## 16 2025   NA 760.87912
## 17 2030   NA 816.70330

Is this a good predictor of population from 2015 to 2030?

The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let's start by looking at the regression line?

plot(pop$Year, pop$Libreville)
abline(lbv)

plot of chunk unnamed-chunk-7

The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)

plot(predict(lbv), resid(lbv))
abline(h=0)

plot of chunk unnamed-chunk-8

Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?


If you guessed underpredict, then I agree. While we don't necessarily know that the ongoing trend will continue, we're making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future.


Another way that we might test our model is to see how well it predicts population in the most recent years without using population from those year (i.e., back-casting).

lbv_back <- lm(Libreville ~ Year, subset(pop, pop$Year < 1995))

plot(pop$Year, pop$Libreville)
abline(lbv_back)

plot of chunk unnamed-chunk-9

This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data.

mean(resid(lbv_back)^2)
## [1] 1438.128
mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015] - predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)
## [1] 9702.728

Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.

There are a few ways that we might try to get better predictions.

The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:

plot(pop$Year, pop$Libreville)
lbv <- lm(Libreville ~ Year, subset(pop, pop$Year > 1970))
summary(lbv)
## 
## Call:
## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year > 
##     1970))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6667 -3.5595 -0.9524  3.5060  8.8095 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.636e+04  3.534e+02  -74.58 3.91e-10 ***
## Year         1.343e+01  1.774e-01   75.70 3.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.747 on 6 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.9988 
## F-statistic:  5731 on 1 and 6 DF,  p-value: 3.576e-10
abline(lbv)

plot of chunk unnamed-chunk-11

Though not necessarily well-suited to these data, adding a quadratic terms can often improve model fit:

lbv2 <- lm(pop$Libreville ~ pop$Year + I(pop$Year^2))
summary(lbv2)
## 
## Call:
## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.701 -11.787   6.736  15.260  29.637 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.190e+05  8.743e+04   4.792 0.000733 ***
## pop$Year      -4.341e+02  8.832e+01  -4.915 0.000609 ***
## I(pop$Year^2)  1.124e-01  2.230e-02   5.042 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.95 on 10 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9873 
## F-statistic: 468.3 on 2 and 10 DF,  p-value: 1.315e-10
plot(pop$Year, pop$Libreville)

lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)

plot of chunk unnamed-chunk-12

Or we could try predicting Libreville's population as a function of a larger geography, such as all Sub-Saharan African cities over 300,000.

plot(pop$Sub_Saharan_300kplus, pop$Libreville)

summary(lm(Libreville~Sub_Saharan_300kplus,pop))
## 
## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.44 -27.04 -23.48  41.54  51.95 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          22.016224  15.931117   1.382    0.194    
## Sub_Saharan_300kplus  0.004803   0.000242  19.850 5.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.14 on 11 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9704 
## F-statistic:   394 on 1 and 11 DF,  p-value: 5.795e-10
abline(lm(Libreville~Sub_Saharan_300kplus,pop))

plot of chunk unnamed-chunk-13

EXERCISE

Download population data for New York City and read it into R.

  1. Plot New York City's population from 1790 to 2010 and describe the major trends.

  2. Based on these trends and your knowledge of NYC, describe what you think NYC's population will be in 2020 and why.

  3. Use a linear regression to make a prediction of New York's 2020 population. Is the prediction any good?

  4. Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).

  5. Try a quadratic regression.

  6. Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).

  7. Use backcasting to predict NYC's 2000 and 2010 population.

  8. Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?

  9. Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.

This entry was posted in Planning Methods. Bookmark the permalink.