6. Multivariate Regression: Introduction
Introduction
A simple regression allows us to study the relationship between the independent and dependent variable.
Simple regression: Y = π½β + π½βX
Where:
- π½β = Intercept
- π½β = Coefficient (slope) of X
- Y = Dependent variable
- X = Independent variable
With multivariate regression, we can use more than one variable to model the dependent variable:
Multivariate Regression: Y = π½β + π½βXβ + π½βXβ
Where:
- π½β = Intercept
- π½β = Coefficient (slope) of Xβ
- π½β = Coefficient (slope) of Xβ
Example: Support for Trump in California Counties, 2016
So far, we have using the example of a regression using data from California at the county level, with support for Trump in 2016 as the dependent variable, and percent white as the independent variable. From now on, let us add one more variable: the change in household income between 2012 and 2016. The intuition is simple: in counties where household income decreased from 2012 to 2016, voters are more likely to support Trump than in counties where household income increased from 2012 to 2016. After all, Barack Obama (a Democrat) was in the presidency during those years. In theory, we should expect people who lived in counties that lost income to be more likely to support a Republican candidate.
The regression equation then becomes:
Pct Trump = π½β + π½β * Pct White + π½β * Inc Change
Where income change is measured as percentage points.
After running the regression, we get these results:
Coefficient |
P-value |
95% Confidence Interval |
||
Percent White |
.3 |
.002 |
.11 |
.48 |
Income Change |
-1.56 |
.000 |
-2.43 |
-.787 |
Intercept (constant) |
22 |
.000 |
11.4 |
32 |
Notice the p-values: they tell us that we can be confident that there is a negative correlation between income change and voting for Trump (if income increases, support for Trump is lower). They also tell us that we can be confident that there is a positive correlation between how white a county is and how much it supports Trump (Trump had more votes in counties with more white people).
A regression output like this one tells us that our regression equation is:
Pct Trump = 22 + .3*Pct White β 1.56*Inc Change
This equation allows us to go beyond being confident that there are correlations between race, income change, and voting for Trump. It allows us to quantify these correlations and make predictions for the Trump vote based on data on race and income change. A few examples:
- If a county in California is 50% white and income increased by 1% between 2012 and 2016, we should expect support for Trump in this county to be...
Pct Trump = 22 + .3*50 - 1.56*1 = 35.44
- If a county in California is 90% white and income decreased by 2% between 2012 and 2016, we should expect support for Trump in this county to be...
Pct Trump = 22 + .3*90 - 1.56*(-2) = 52.12
Interpreting Regression Slopes
Remember how in a simple regression, we interpreted the slope by simply stating that "for an increase of __ in X, we should expect an increase of __ in Y." In a multivariate regression, it's a bit more complicated.
Our results are telling us that for an increase of one percentage point in income change, we should expect support for Trump to increase by 0.3 percentage points. However, this prediction only holds if we are comparing between counties with the same percentage of white residents.
Because of this, when interpreting slopes in multivariate regression, we have to be more careful. This is how we interpret the slopes for percent White and income change:
- In California counties, for each increase of 1 percentage point in the White population, we should expect the 2016 Trump vote to increase by .3 percentage points, holding constant the income change.
- In California counties, for each increase of 1 percentage point in income change, we should expect the 2016 Trump vote to decrease by 1.56 percentage points, holding constant the percentage of White residents.
In a multivariate regression, the estimated coefficient of a variable represents the effect of that variable, holding constant all other variables in the regression. We can also express this by saying "all else equal", as in βall else equal, the effect of X on Y is ______.β
Implications for Understanding Confounding Variables
We can use multivariate regressions to emulate experiments that take confounding variables into consideration, strengthening results. We can consider a regression where we compare the effects of Clean Energy Research on CO2 Emissions, holding income constant:
CO2 Emissions = Clean Energy Research + Wealth
If a regression like were to give us a positive, statistically significant slope for Clean Energy Research, we would be able to brush aside the concern that Wealth is acting as a confounding variable. A result like this would be saying that clean energy research is associated with more CO2 emissions, holding constant the level of wealth. In other words, it would mean that more clean energy research means more CO2 emissions, even when we compare between countries with similar levels of wealth. However, it would still be difficult to make the causal claim that investing in clean energy research leads to more CO2 emissions. After all, wealth is not the only variable that could be acting as a confound. |
Alex and Leo discuss this in the Observational Research Design video. In the next video, we discuss it again. We think this repetition worth it: understanding how multivariate regression relates to confounding variables is a crucial skill. We hope that this module will help you consolidate this knowledge.