For this lab, upload the Zip Code data from the previous lab. This should include the following files wih the following dimensions. You can also download the data here.
dim(zbp02)
## [1] 2207 24
dim(zbp12)
## [1] 2125 22
The purposes of this lab is to reinforce the regression and visualization skills learned in lab 2 and lab 3 using the Zip Business Patterns data.
Start by summarzing the 2002 data, which you will use to preidct job changes in 2012.
summary(zbp02)
## ZIP jobs_plus10 jobs.tot jobs.23
## Min. :15001 Min. : 2.5 Min. : 2.50 Min. : 0.0
## 1st Qu.:16010 1st Qu.: 71.0 1st Qu.: 72.25 1st Qu.: 2.5
## Median :17235 Median : 411.0 Median : 387.50 Median : 22.0
## Mean :17318 Mean : 2654.3 Mean : 2597.33 Mean : 138.6
## 3rd Qu.:18618 3rd Qu.: 2327.0 3rd Qu.: 2234.25 3rd Qu.: 126.5
## Max. :19980 Max. :79883.5 Max. :77848.50 Max. :4476.5
## NA's :82 NA's :40 NA's :40
## jobs.31 jobs.42 jobs.44 jobs.48
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 2.5 1st Qu.: 0.00
## Median : 34.5 Median : 7.5 Median : 31.0 Median : 7.00
## Mean : 357.2 Mean : 128.8 Mean : 377.0 Mean : 73.68
## 3rd Qu.: 341.5 3rd Qu.: 74.5 3rd Qu.: 273.2 3rd Qu.: 47.25
## Max. :10076.0 Max. :5025.5 Max. :10491.0 Max. :5435.00
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.51 jobs.52 jobs.53 jobs.54
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0.00 Median : 7.00 Median : 0.00 Median : 5.0
## Mean : 72.96 Mean : 155.58 Mean : 39.46 Mean : 158.4
## 3rd Qu.: 14.50 3rd Qu.: 50.25 3rd Qu.: 17.50 3rd Qu.: 57.0
## Max. :4203.50 Max. :15671.50 Max. :1739.50 Max. :22421.0
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.56 jobs.61 jobs.62 jobs.71
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 5.0 Median : 0.00 Median : 16.5 Median : 0.00
## Mean : 145.6 Mean : 66.51 Mean : 362.8 Mean : 41.55
## 3rd Qu.: 56.0 3rd Qu.: 8.50 3rd Qu.: 208.2 3rd Qu.: 17.00
## Max. :9158.5 Max. :3734.50 Max. :8389.5 Max. :2858.00
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.72 jobs.81 jobs.95 jobs.99
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.: 2.5 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 19.5 Median : 19.5 Median : 0.00 Median : 0.000
## Mean : 213.5 Mean : 139.4 Mean : 26.79 Mean : 1.355
## 3rd Qu.: 154.2 3rd Qu.: 114.8 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :5445.5 Max. :4624.5 Max. :1961.00 Max. :214.500
## NA's :40 NA's :40 NA's :40 NA's :40
## jobs.21 jobs.11 jobs.22 jobs.55
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0.000 Median : 0.00 Median : 0.00 Median : 0.0
## Mean : 8.461 Mean : 2.05 Mean : 18.45 Mean : 69.4
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 2.5
## Max. :1538.500 Max. :194.00 Max. :1785.50 Max. :6602.0
## NA's :40 NA's :40 NA's :40 NA's :40
Like with the census tract data, not all of the zip coned match across the two years, resulting in missing data.
Plot the relationship between the number of jobs in 2002 and 2012.
plot(zbp02$jobs.tot, zbp02$jobs_plus10)
And look at the distribution of jobs across Zip Codes. Most Zip Codes have very few jobs, with a long tail of job-rich Zip Codes.
plot(density(zbp02$jobs_plus10, na.rm = T))
hist(zbp02$jobs_plus10)
Look at a specific Zip Code (19104).
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP == 19104], zbp02$jobs.tot[zbp02$ZIP==19104], col="red")
And all Philadelphia tracts, plus a few others. These tracts do not look particularly different from others in PA.
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
points(zbp02$jobs_plus10[zbp02$ZIP > 19018 & zbp02$ZIP < 19256],
zbp02$jobs.tot[zbp02$ZIP> 19018 & zbp02$ZIP < 19256], col="red")
Now, predict the 2012 jobs as a function of the number of jobs in 2002. The regression line fits the data almost perfectly.
reg1 <- lm(jobs_plus10 ~ jobs.tot, zbp02)
summary(reg1)
##
## Call:
## lm(formula = jobs_plus10 ~ jobs.tot, data = zbp02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7887.2 -158.2 -82.6 26.7 10332.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.743743 26.051260 3.176 0.00151 **
## jobs.tot 0.967799 0.004028 240.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1081 on 2083 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.9652, Adjusted R-squared: 0.9652
## F-statistic: 5.774e+04 on 1 and 2083 DF, p-value: < 2.2e-16
The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)
The R-squared statistic indicates that the number of jobs in 2002 predicts the number of jobs in 2012 quite well.
plot(zbp02$jobs_plus10, zbp02$jobs.tot, col = "tan")
abline(reg1)
It also looks like there might be some systematic differences in prediction quality by the number of jobs.
plot(predict(reg1), resid(reg1))
abline(h=0,col=3,lty=3)
Try to see which jobs are more or less likely to predict job losses or job increases.
summary(
lm(jobs_plus10 ~jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02)
)
##
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 +
## jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 +
## jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 +
## jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705.6 -159.0 -51.4 39.6 9224.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265 25.06739 1.905 0.056923 .
## jobs.23 1.48264 0.11445 12.954 < 2e-16 ***
## jobs.31 0.86957 0.03842 22.633 < 2e-16 ***
## jobs.42 1.75718 0.11463 15.329 < 2e-16 ***
## jobs.44 0.90575 0.05998 15.100 < 2e-16 ***
## jobs.48 0.66433 0.10652 6.237 5.41e-10 ***
## jobs.51 0.68900 0.13370 5.153 2.80e-07 ***
## jobs.52 0.59305 0.06731 8.811 < 2e-16 ***
## jobs.53 2.82325 0.31606 8.933 < 2e-16 ***
## jobs.54 1.51710 0.07039 21.552 < 2e-16 ***
## jobs.56 0.17040 0.08420 2.024 0.043112 *
## jobs.61 1.47967 0.11901 12.433 < 2e-16 ***
## jobs.62 0.85000 0.04409 19.281 < 2e-16 ***
## jobs.71 0.75343 0.20036 3.760 0.000174 ***
## jobs.72 1.52875 0.12252 12.478 < 2e-16 ***
## jobs.81 0.44943 0.18831 2.387 0.017095 *
## jobs.95 0.74847 0.19179 3.903 9.82e-05 ***
## jobs.99 5.25699 4.88099 1.077 0.281591
## jobs.21 1.13877 0.36736 3.100 0.001962 **
## jobs.11 3.63400 2.12568 1.710 0.087495 .
## jobs.22 0.88365 0.22376 3.949 8.11e-05 ***
## jobs.55 0.11145 0.11360 0.981 0.326664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.9714, Adjusted R-squared: 0.9711
## F-statistic: 3339 on 21 and 2063 DF, p-value: < 2.2e-16
Most of the paramter estimates are statistically different from zero with a high degree of confidence, but that is really not a very useful finding. You really want to know whether job types are statisticallyt different from 1 (i.e., are certain job types more or less likely to be decreasing over time?). You can roughly approximate this by looking at the coefficent estimate and the standard error. If the coefficient estimate plus or minus two standard errors crosses the number 1, than the estimate is not statistically different from one with 95% confidence.
You can aslo set up the regression to compare the coefficient estimate against the total number of jobs in 2002.
summary(
lm(jobs_plus10 ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02, offset= 1.00*jobs.tot)
)
##
## Call:
## lm(formula = jobs_plus10 ~ jobs.23 + jobs.31 + jobs.42 + jobs.44 +
## jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 + jobs.56 +
## jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 + jobs.95 +
## jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, data = zbp02,
## offset = 1 * jobs.tot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705.6 -159.0 -51.4 39.6 9224.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265 25.06739 1.905 0.056923 .
## jobs.23 0.48264 0.11445 4.217 2.58e-05 ***
## jobs.31 -0.13043 0.03842 -3.395 0.000699 ***
## jobs.42 0.75718 0.11463 6.605 5.03e-11 ***
## jobs.44 -0.09425 0.05998 -1.571 0.116262
## jobs.48 -0.33567 0.10652 -3.151 0.001650 **
## jobs.51 -0.31100 0.13370 -2.326 0.020107 *
## jobs.52 -0.40695 0.06731 -6.046 1.76e-09 ***
## jobs.53 1.82325 0.31606 5.769 9.20e-09 ***
## jobs.54 0.51710 0.07039 7.346 2.93e-13 ***
## jobs.56 -0.82960 0.08420 -9.853 < 2e-16 ***
## jobs.61 0.47967 0.11901 4.031 5.77e-05 ***
## jobs.62 -0.15000 0.04409 -3.402 0.000681 ***
## jobs.71 -0.24657 0.20036 -1.231 0.218608
## jobs.72 0.52875 0.12252 4.316 1.67e-05 ***
## jobs.81 -0.55057 0.18831 -2.924 0.003497 **
## jobs.95 -0.25153 0.19179 -1.311 0.189836
## jobs.99 4.25699 4.88099 0.872 0.383223
## jobs.21 0.13877 0.36736 0.378 0.705657
## jobs.11 2.63400 2.12568 1.239 0.215437
## jobs.22 -0.11635 0.22376 -0.520 0.603127
## jobs.55 -0.88855 0.11360 -7.822 8.22e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.9714, Adjusted R-squared: 0.9711
## F-statistic: 3339 on 21 and 2063 DF, p-value: < 2.2e-16
Each job in Real Estate in a Zip Code in 2002 (sector 53) correlates with another 1.62 jobs in 2012. Better, yet use the types of jobs in 2002 to predict the net change in jobs from 2002 to 2012.
reg2 <- lm(jobs_plus10-jobs.tot ~ jobs.23 +jobs.31 +jobs.42 +jobs.44 + jobs.48+ jobs.51 + jobs.52 +
jobs.53 + jobs.54 +jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 +
jobs.81 + jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55, zbp02)
summary(reg2)
##
## Call:
## lm(formula = jobs_plus10 - jobs.tot ~ jobs.23 + jobs.31 + jobs.42 +
## jobs.44 + jobs.48 + jobs.51 + jobs.52 + jobs.53 + jobs.54 +
## jobs.56 + jobs.61 + jobs.62 + jobs.71 + jobs.72 + jobs.81 +
## jobs.95 + jobs.99 + jobs.21 + jobs.11 + jobs.22 + jobs.55,
## data = zbp02)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705.6 -159.0 -51.4 39.6 9224.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.75265 25.06739 1.905 0.056923 .
## jobs.23 0.48264 0.11445 4.217 2.58e-05 ***
## jobs.31 -0.13043 0.03842 -3.395 0.000699 ***
## jobs.42 0.75718 0.11463 6.605 5.03e-11 ***
## jobs.44 -0.09425 0.05998 -1.571 0.116262
## jobs.48 -0.33567 0.10652 -3.151 0.001650 **
## jobs.51 -0.31100 0.13370 -2.326 0.020107 *
## jobs.52 -0.40695 0.06731 -6.046 1.76e-09 ***
## jobs.53 1.82325 0.31606 5.769 9.20e-09 ***
## jobs.54 0.51710 0.07039 7.346 2.93e-13 ***
## jobs.56 -0.82960 0.08420 -9.853 < 2e-16 ***
## jobs.61 0.47967 0.11901 4.031 5.77e-05 ***
## jobs.62 -0.15000 0.04409 -3.402 0.000681 ***
## jobs.71 -0.24657 0.20036 -1.231 0.218608
## jobs.72 0.52875 0.12252 4.316 1.67e-05 ***
## jobs.81 -0.55057 0.18831 -2.924 0.003497 **
## jobs.95 -0.25153 0.19179 -1.311 0.189836
## jobs.99 4.25699 4.88099 0.872 0.383223
## jobs.21 0.13877 0.36736 0.378 0.705657
## jobs.11 2.63400 2.12568 1.239 0.215437
## jobs.22 -0.11635 0.22376 -0.520 0.603127
## jobs.55 -0.88855 0.11360 -7.822 8.22e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 984.3 on 2063 degrees of freedom
## (122 observations deleted due to missingness)
## Multiple R-squared: 0.2035, Adjusted R-squared: 0.1954
## F-statistic: 25.1 on 21 and 2063 DF, p-value: < 2.2e-16
Note how all the parameter estimates are the same, but there is now a much lower and much more useful R-squared. Instead of showing that Zip Codes with a lot of jobs in 2002 tend to have a lot of jobs in 2012, the model now shows that the number and types of jobs in each Zip Code can be used to predict whether a Zip Code gained or lost jobs. Instead of explaining 97% of the variation in the relationship, the model now claims to explain 17%.
The distribution of this dependent variable also looks a lot more normal with most Zip Codes not having changed very much.
plot(density(zbp02$jobs_plus10- zbp02$jobs.tot, na.rm=T))
The residual plit also looks a bit more homoschedastic.
plot(predict(reg2), resid(reg2))
abline(h=0,col=3,lty=3)
EXERCISE
-
Try to make the most parsimonious model that does a good job of predicting the change in jobs from 2002 to 2012. Use a A full model vs. reduced model Anova test to compare this model to the fully specific model (with all job types) and descibe the results.
-
Try predicting the percent change in jobs instead of the absolute change. Compare and contrast the two models. Which model do you prefer and why? Hint: use the identity function to generate the percent change variable inside of your regression: I(jobs_plus10/jobs.tot-1). Input industry codes as a percentage of total jobs: I(jobs.23/jobs.tot).
-
What types of jobs are the best indicators of job growth and job losses? Desribe the quantitative relationship as expressed by the two models in question 2. Does this contradict or concur with your expectations?
-
Look at the differences in jobs in sector 99 across the two time periods. What do you think is happening?