Now that you have uploaded and examined the UN data on central African cities, we are going to do some ordinary least squares regressions (OLS) to predict Libreville’s population. For a review of OLS, please refer to class readings:

An Introduction to Statistical Learning, chapter 3

Video lectures available here (also, chapter 3)

Tufte’s Data Analysis for Politics and Policy ($2 for ebook)

Let’s start by looking at the population over time:

`plot(pop$Year, pop$Libreville)`

It looks like we have a pretty linear trend from 1950 to 1975, with a steeper but still linear trend after 1975. If we predict the population growth by year using linear OLS and the full set of data, we’ll get a line that minimizes the squared distance between all the points in each year. In R, we can use the lm function (short for linear model) to estimate the slope and intercept of the line.

```
lbv <- lm(pop$Libreville ~ pop$Year)
summary(lbv)
```

```
##
## Call:
## lm(formula = pop$Libreville ~ pop$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.81 -24.46 -4.11 21.42 91.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.185e+04 1.314e+03 -16.62 3.84e-09 ***
## pop$Year 1.116e+01 6.637e-01 16.82 3.39e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.77 on 11 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9626, Adjusted R-squared: 0.9592
## F-statistic: 283 on 1 and 11 DF, p-value: 3.389e-09
```

What do the estimates mean?

The predicted population (the line that minimizes the squared distance between the points) is equal to the intercept plus the year times the estimated coefficent for year.

So if we want to know the estimated population in 2000, we could estimate it manually as -21848 + 2000 * 11.16. We can also use the outputs from the regression to save the trouble of entering the numbers manually.

`summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*2000`

`## [1] 481.7582`

Or we could predict each year in the data:

`summary(lbv)$coefficients[1] + summary(lbv)$coefficients[2]*pop$Year`

```
## [1] -76.48352 -20.65934 35.16484 90.98901 146.81319 202.63736 258.46154
## [8] 314.28571 370.10989 425.93407 481.75824 537.58242 593.40659 649.23077
## [15] 705.05495 760.87912 816.70330
```

Or look at the predictions next to the data (note how I use the predict funtion instead of the manual estimation from above).

`cbind(pop$Year, pop$Libreville, predict(lbv, newdata = pop))#NB when you see a new command like cbind or predict, don't forget to learn more about it: ?cbind`

```
## [,1] [,2] [,3]
## 1 1950 15 -76.48352
## 2 1955 17 -20.65934
## 3 1960 29 35.16484
## 4 1965 49 90.98901
## 5 1970 77 146.81319
## 6 1975 155 202.63736
## 7 1980 234 258.46154
## 8 1985 293 314.28571
## 9 1990 366 370.10989
## 10 1995 439 425.93407
## 11 2000 496 481.75824
## 12 2005 559 537.58242
## 13 2010 631 593.40659
## 14 2015 NA 649.23077
## 15 2020 NA 705.05495
## 16 2025 NA 760.87912
## 17 2030 NA 816.70330
```

Is this a good predictor of population from 2015 to 2030?

The R2 from the regression indicates that our model explains 96% of the variation in population. That sounds pretty good, but is it really a good predictor of future population? To try to answer that question, let’s start by looking at the regression line?

```
plot(pop$Year, pop$Libreville)
abline(lbv)
```

The fit looks decent, but we also appear to be systematically underpredicting the population before 1960, overpredicting between 1965 and 1990, and underpredicting again after 1990. We also predict negative population before 1995. This is quite apparent, when we plot the predicted values against the error terms. (Remember that we want our error terms to be uncorrelated and normally distributed around a mean of zero.)

```
plot(predict(lbv), resid(lbv))
abline(h=0)
```

Do you expcet a population prediction for 2020 based on the trend line from 1955 to 2010 to overestimate or underestimate population? Why?

If you guessed underpredict, then I agree. While we don’t necessarily know that the ongoing trend will continue, we’re making increasinly bad estimates from 1995 and on. If you knew Libreville, you would also probably not guess that population growth is going to slow in the near future. *****

Another way that we might test our model is to see how well our general modeling framework predicts values outside of the sample used in the regression. For cross-sectional, we might fit our model on a random sample of our data and then see how well it predicts other data.

Since we are predicting growth trends, however, it makes more sense to intentionally remove the most recent datapoints and see how well a model using older data predicts them.

```
lbv_back <- lm(Libreville ~ Year, subset(pop, pop$Year < 1995))
plot(pop$Year, pop$Libreville)
abline(lbv_back)
```

This does not look nearly so good. The R2 of lbv_back suggests that we explain 90% of the variance in the data, but the prediction error for the predictor years is much worse. We can compare the mean squared error of the new data to the old data.

`mean(resid(lbv_back)^2)`

`## [1] 1438.128`

`mean((pop$Libreville[pop$Year > 1994 & pop$Year < 2015] - predict(lbv_back, newdata = subset(pop, pop$Year > 1994 & pop$Year < 2015)))^2)`

`## [1] 9702.728`

Our average squared prediction error for the 1995 to 2010 is seven times higher than the prediction error between 1955 and 1990. In fact, we would have better predictions of the population from 1995 to 2010, just by taking the average population between 1995 and 2010 and calling it a day.

There are a few ways that we might try to get better predictions.

The data appear to show two distinct trends, one before and one after 1970. We could estimate a regression model using just the more recent data:

```
plot(pop$Year, pop$Libreville)
lbv <- lm(Libreville ~ Year, subset(pop, pop$Year > 1970))
summary(lbv)
```

```
##
## Call:
## lm(formula = Libreville ~ Year, data = subset(pop, pop$Year >
## 1970))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6667 -3.5595 -0.9524 3.5060 8.8095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.636e+04 3.534e+02 -74.58 3.91e-10 ***
## Year 1.343e+01 1.774e-01 75.70 3.58e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.747 on 6 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.999, Adjusted R-squared: 0.9988
## F-statistic: 5731 on 1 and 6 DF, p-value: 3.576e-10
```

`abline(lbv)`

Though not necessarily well-suited to these data, adding a quadratic terms can often improve model fit:

```
lbv2 <- lm(pop$Libreville ~ pop$Year + I(pop$Year^2))
summary(lbv2)
```

```
##
## Call:
## lm(formula = pop$Libreville ~ pop$Year + I(pop$Year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.701 -11.787 6.736 15.260 29.637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.190e+05 8.743e+04 4.792 0.000733 ***
## pop$Year -4.341e+02 8.832e+01 -4.915 0.000609 ***
## I(pop$Year^2) 1.124e-01 2.230e-02 5.042 0.000505 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.95 on 10 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9894, Adjusted R-squared: 0.9873
## F-statistic: 468.3 on 2 and 10 DF, p-value: 1.315e-10
```

```
plot(pop$Year, pop$Libreville)
lines(pop$Year[pop$Year < 2014], predict(lbv2), col=3)
```

Or we could try predicting Libreville’s population as a function of a larger geography, such as all Sub-Saharan African cities over 300,000.

```
plot(pop$Sub_Saharan_300kplus, pop$Libreville)
summary(lm(Libreville~Sub_Saharan_300kplus,pop))
```

```
##
## Call:
## lm(formula = Libreville ~ Sub_Saharan_300kplus, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.44 -27.04 -23.48 41.54 51.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.016224 15.931117 1.382 0.194
## Sub_Saharan_300kplus 0.004803 0.000242 19.850 5.79e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.14 on 11 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9728, Adjusted R-squared: 0.9704
## F-statistic: 394 on 1 and 11 DF, p-value: 5.795e-10
```

`abline(lm(Libreville~Sub_Saharan_300kplus,pop))`

Or we could take advantage of the knowledge of the population in the year immediately before our prediction. Our previous models ignored the fact that the 2010 data is a lot more important than any of the other years of data for predicting the population in the future.

First, we need to create a new time-lagged population variable. Walk through the code slowly using the command head(pop_n1) between each line of code to see exactly what you are doing.

```
pop_n1 <- pop[c("Libreville", "Year")]
pop_n1$Year <- pop_n1$Year + 5
pop_n1$Libreville_m5 <- pop_n1$Libreville
pop_n1 <- pop_n1[c(2:3)]
pop_n1 <- subset(pop_n1, pop_n1$Year < 2015)
pop <- merge(pop, pop_n1, by = "Year", all.x = T)
```

Now predict Libreville’s population using the population 5 years earlier and the year.

```
lbv3 <- lm(Libreville ~ Libreville_m5 + Year, pop)
summary(lbv3)
```

```
##
## Call:
## lm(formula = Libreville ~ Libreville_m5 + Year, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.877 -6.747 -1.215 8.818 15.653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.271e+03 1.961e+03 -4.726 0.00108 **
## Libreville_m5 6.705e-01 9.021e-02 7.432 3.96e-05 ***
## Year 4.740e+00 9.995e-01 4.742 0.00106 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.79 on 9 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9972, Adjusted R-squared: 0.9966
## F-statistic: 1601 on 2 and 9 DF, p-value: 3.271e-12
```

This model explains 99.7% of the variance in the data an indicates that the population in 2015 is equal to the population in 2010 x .67 + 2010 (the year) x 4.74. To estimate the population in 2020, you will need to add the 2015 population estimate.

```
plot(pop$Year, pop$Libreville)
lines(pop$Year[pop$Year < 2014 & pop$Year > 1950], predict(lbv3), col=3)
```

EXERCISE Download population data for New York City and read it into R.

- Plot New York City’s population from 1790 to 2010 and describe the major trends.
- Based on these trends and your knowledge of NYC, describe what you think NYC’s population will be in 2020 and why.
- Use a linear regression to make a prediction of New York’s 2020 population. Is the prediction any good? Why?
- Now try a linear regression using your preferred subset of the New York population data (e.g., 1970 to 2010).
- Try a quadratic regression.
- Now try a cubic function (hint: add the term I(pop^3) to the quadratic regression).
- Use earlier data (i.e., excluding the prediction years) to predict NYC’s 2000 and 2010 population. Describe the predictions in relationship to the earlier estimates.
- Compare your predicted 2020 populations from 4, 5, and 6. Which do you think is the best predictor? Why?
- Plot the predicted values from your preferred model against the error. Do the errors appear to be random? Describe any patterns that you see.
- Create a time-lagged population variable and use it to predict New York’s population. What is the expected population in 2020?