Multiple Regression
Data-Based Economics, ESCP, 2024-2025
The problem
Remember dataset from last time
type | income | education | prestige | |
---|---|---|---|---|
accountant | prof | 62 | 86 | 82 |
pilot | prof | 72 | 76 | 83 |
architect | prof | 75 | 92 | 90 |
author | prof | 55 | 90 | 76 |
chemist | prof | 64 | 86 | 90 |
- Last week we “ran” a linear regression:
. Result: - Should we have looked at “prestige” instead ?
- Which one is better?
Prestige or Education
- if the goal is to predict: the one with higher explained variance
prestige
has higher ( )
- unless we are interested in the effect of education
Multiple regression
- What about using both?
- 2 variables model:
- will probably improve prediction power (explained variance)
might not be meaningful on its own anymore (education and prestige are correlated)
- 2 variables model:
Fitting a model
Now we are trying to fit a plane to a cloud of points.
Minimization Criterium
- Take all observations:
- Objective: sum of squares
- Minimize loss function in
, , - Again, we can perform numerical optimization (machine learning approach)
- … but there is an explicit formula
Ordinary Least Square
- Matrix Version (look for
): - Note that constant can be interpreted as a “variable”
- Loss function
- Result of minimization
:
Solution
- Result:
- Questions:
- is it a better regression than the other?
- is the coefficient in front of
education
significant? - how do we interpret it?
- can we build confidence intervals?
Explained Variance
Explained Variance
As in the 1d case we can compare: - the variability of the model predictions (
Coefficient of determination (same formula):
Or:
where
Adjusted R squared
Fact:
- adding more regressors always improve
- why not throw everything in? (kitchen sink regressions)
- two many regressors: overfitting the data
Penalise additional regressors: adjusted R^2
Where:
: number of observations number of variables
In our example:
Regression | |
|
---|---|---|
education | 0.525 | 0.514 |
prestige | 0.702 | 0.695 |
education + prestige | 0.7022 | 0.688 |
Interpretation and variable change
Making a regression with statsmodels
import statsmodels
We use a special API
inspired by R
:
import statsmodels.formula.api as smf
Performing a regression
- Running a regression with
statsmodels
model = smf.ols('income ~ education', df) # model
res = model.fit() # perform the regression
res.describe()
- ‘income ~ education’ is the model formula
OLS Regression Results
==============================================================================
Dep. Variable: income R-squared: 0.525
Model: OLS Adj. R-squared: 0.514
Method: Least Squares F-statistic: 47.51
Date: Tue, 02 Feb 2021 Prob (F-statistic): 1.84e-08
Time: 05:21:25 Log-Likelihood: -190.42
No. Observations: 45 AIC: 384.8
Df Residuals: 43 BIC: 388.5
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
==============================================================================
Intercept 10.6035 5.198 2.040 0.048 0.120 21.087
education 0.5949 0.086 6.893 0.000 0.421 0.769
==============================================================================
Omnibus: 9.841 Durbin-Watson: 1.736
Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.609
Skew: 0.776 Prob(JB): 0.00497
Kurtosis: 4.802 Cond. No. 123.
==============================================================================
Formula mini-language
- With
statsmodels
formulas, can be supplied with R-style syntax - Examples:
Formula | Model |
---|---|
income ~ education |
|
income ~ prestige |
|
income ~ prestige - 1 |
|
income ~ education + prestige |
Formula mini-language
- One can use formulas to apply transformations to variables
Formula | Model |
---|---|
log(P) ~ log(M) + log(Y) |
|
log(Y) ~ i |
- This is useful if the true relationship is nonlinear
- Also useful, to interpret the coefficients
Coefficients interpetation
Example:
(
police_spending
andprevention_policies
in million dollars)reads: when holding other variables constant a 0.1 million increase in police spending reduces crime rate by 0.001%
Iinterpretation?
- problematic because variables have different units
- we can say that prevention policies are more efficient than police spending ceteris paribus
Take logs:
- now we have an estimate of elasticities
- a
increase in police spending leads to a decrease in the number of crimes
Statistical Inference
Hypotheses
-
Recall what we do:
- we have the data
- we choose a model:
- from the data we compute estimates:
- estimates are a precise function of data
- exact formula not important here
- we have the data
We need some hypotheses on the data generation process:
multivariate normal with covariance matrix
Under these hypotheses:
is an unbiased estimate of true parameter- i.e.
- i.e.
- one can prove
can be estimated by : degrees of freedoms
- one can estimate:
- it is the
-th diagonal element of
- it is the
Is the regression significant?
- Approach is very similar to the one-dimensional case
: all coeficients are 0- i.e. true model is
- i.e. true model is
: some coefficients are not 0
Statistics:
: mean-squared error of constant model : mean-squared error of full model
Under:
- the model assumptions about the data generation process
- the H0 hypothesis
… the distribution of
It is remarkable that it doesn’t depend on
One can produce a p-value.
- probability to obtain this statistics given hypothesis H0
- if very low, H0 is rejected
Is each coefficient significant ?
Given a coefficient
: true coefficient is 0 : true coefficient is not zero
Statistics (student-t):
- where
is -th diagonal element of - it compares the estimated value of a coefficient to its estimated standard deviation
Under the inference hypotheses, distribution of
- it is a student distribution
Procedure:
- Compute
. Check acceptance threshold at probability (ex 5%) - Coefficient is significant with probability
if
Or just look at the p-value:
- probability that
would be as high as it is, assuming
Confidence intervals
Same as in the 1d case.
- Take estimate
with an estimate of its standard deviation - Compute student
at confidence level (ex: ) such that:
. . .
Produce confidence intervals at
Interpretation:
- for a given confidence interval at confidence level
… - the probability that our coefficient was obtained, if the true coefficient were outside of it, is smaller than
Other tests
- The tests seen so far rely on strong statistical assumptions (normality, homoscedasticity, etc..)
- Some tests can be used to test these assumptions:
- Jarque-Bera: is the distribution of data truly normal
- Durbin-Watson: are residuals autocorrelated (makes sense for time-series)
- …
- In case assumptions are not met…
- … still possible to do econometrics
- … but beyond the scope of this course
Variable selection
Variable selection
- I’ve got plenty of data:
: gdp : investment : inflation : education : unemployment- …
- Many possible regressions:
- …
. . .
- Which one do I choose ?
- putting everything together is not an option (kitchen sink regression)
Not enough coefficients
Suppose you run a regression:
. . .
But unknowingly to you, the actual model is
. . .
The residual
- specification hypotheses are violated
- estimated
will have a bias (omitted variable bias) - to correct the bias we add
- even though we are not interested in
by itself - we control for
- even though we are not interested in
Example
- Suppose I want to check Okun’s law. I consider the following model:
- I obtain:
- Then I inspect visually the residuals: not normal at all!
- Conclusion: my regression is misspecified,
is a biased (useless) estimate - I need to control for additional variables. For instance:
- Until the residuals are actually white noise
Colinear regressors
- What happens if two regressors are (almost) colinear?
where - Intuitively: parameters are not unique
- if
is the right model… - then
is exactly as good…
- if
- Mathematically:
is not invertible. - When regressors are almost colinear, coefficients can have a lot of variability.
- Test:
- correlation statistics
- correlation plot
Choosing regressors
Which regressors to choose ?
Method 1 : remove coefficients with lowest t (less significant) to maximize adjusted R-squared
- remove regressors with lowest t
- not the one you are interested in ;)
- regress again
- see if adjusted
is decreasing- if so continue
- otherwise cancel last step and stop
Method 2 : choose combination to maximize Akaike Information Criterium
- AIC:
is likelihood- computed by all good econometric softwares