Multiple Regression

Data-Based Economics, ESCP, 2024-2025

Author

Pablo Winant

Published

February 5, 2025

The problem

Remember dataset from last time

type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
  • Last week we “ran” a linear regression: y=α+βx. Result: income=xx+0.72education
  • Should we have looked at “prestige” instead ? income=xx+0.83prestige
  • Which one is better?

Prestige or Education

  • if the goal is to predict: the one with higher explained variance
    • prestige has higher R2 (0.832)
  • unless we are interested in the effect of education

Multiple regression

  • What about using both?
    • 2 variables model: income=α+β1education+β2prestige
    • will probably improve prediction power (explained variance)
    • β1 might not be meaningful on its own anymore (education and prestige are correlated)

Fitting a model

Now we are trying to fit a plane to a cloud of points.


Minimization Criterium

  • Take all observations: (incomen,educationn,prestigen)n[0,N]
  • Objective: sum of squares L(α,β1,β2)=i(α+β1educationn+β2prestigenincomenen=prediction error)2
  • Minimize loss function in α, β1, β2
  • Again, we can perform numerical optimization (machine learning approach)
    • … but there is an explicit formula

Ordinary Least Square

Y=[income1incomeN] X=[1education1prestige11educationNprestigeN]

  • Matrix Version (look for B=(α,β1,β2)): Y=XB+E
  • Note that constant can be interpreted as a “variable”
  • Loss function L(A,B)=(YXB)(YXB)
  • Result of minimization min(A,B)L(A,B) : [αβ1β2]=(XX)1XY

Solution

  • Result: income=10.43+0.03×education+0.62×prestige
  • Questions:
    • is it a better regression than the other?
    • is the coefficient in front of education significant?
    • how do we interpret it?
    • can we build confidence intervals?

Explained Variance

Explained Variance

As in the 1d case we can compare: - the variability of the model predictions (MSS) - the variance of the data (TSS, T for total)

Coefficient of determination (same formula):

R2=MSSTSS

Or:

R2=1RSSSST

where RSS is the non explained variance

Adjusted R squared

Fact:

  • adding more regressors always improve R2
  • why not throw everything in? (kitchen sink regressions)
  • two many regressors: overfitting the data

Penalise additional regressors: adjusted R^2

Radj2=1(1R2)N1Np1

Where:

  • N: number of observations
  • p number of variables

In our example:

Regression R2 Radj2
education 0.525 0.514
prestige 0.702 0.695
education + prestige 0.7022 0.688

Interpretation and variable change

Making a regression with statsmodels

import statsmodels

We use a special API inspired by R:

import statsmodels.formula.api as smf

Performing a regression

  • Running a regression with statsmodels
model = smf.ols('income ~ education',  df)  # model
res = model.fit()  # perform the regression
res.describe()
  • ‘income ~ education’ is the model formula
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 income   R-squared:                       0.525
Model:                            OLS   Adj. R-squared:                  0.514
Method:                 Least Squares   F-statistic:                     47.51
Date:                Tue, 02 Feb 2021   Prob (F-statistic):           1.84e-08
Time:                        05:21:25   Log-Likelihood:                -190.42
No. Observations:                  45   AIC:                             384.8
Df Residuals:                      43   BIC:                             388.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
==============================================================================
Intercept     10.6035      5.198      2.040      0.048       0.120      21.087
education      0.5949      0.086      6.893      0.000       0.421       0.769
==============================================================================
Omnibus:                        9.841   Durbin-Watson:                   1.736
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.609
Skew:                           0.776   Prob(JB):                      0.00497
Kurtosis:                       4.802   Cond. No.                         123.
==============================================================================

Formula mini-language

  • With statsmodels formulas, can be supplied with R-style syntax
  • Examples:
Formula Model
income ~ education incomei=α+βeducationi
income ~ prestige incomei=α+βprestigei
income ~ prestige - 1 incomei=βprestigei (no intercept)
income ~ education + prestige incomei=α+β1educationi+β2prestigei

Formula mini-language

  • One can use formulas to apply transformations to variables
Formula Model
log(P) ~ log(M) + log(Y) log(Pi)=α+α1log(Mi)+α2log(Yi) (log-log)
log(Y) ~ i log(Pi)=α+ii (semi-logs)
  • This is useful if the true relationship is nonlinear
  • Also useful, to interpret the coefficients

Coefficients interpetation

Example:

  • (police_spending and prevention_policies in million dollars) \text{number_or_crimes} = 0.005\% - 0.001 \text{pol_spend} - 0.005 \text{prev_pol} + 0.002 \text{population density}

  • reads: when holding other variables constant a 0.1 million increase in police spending reduces crime rate by 0.001%

Iinterpretation?

  • problematic because variables have different units
  • we can say that prevention policies are more efficient than police spending ceteris paribus

Take logs: \log(\text{number_or_crimes}) = 0.005\% - 0.15 \log(\text{pol_spend}) - 0.4 \log(\text{prev_pol}) + 0.2 \log(\text{population density})

  • now we have an estimate of elasticities
  • a 1% increase in police spending leads to a 0.15% decrease in the number of crimes

Statistical Inference

Hypotheses

  • Recall what we do:
    • we have the data X,Y
    • we choose a model: Y=α+Xβ
    • from the data we compute estimates: β^=(XX)1XY α^=YXβ
    • estimates are a precise function of data
      • exact formula not important here

We need some hypotheses on the data generation process:

  • Y=Xβ+ϵ
  • E[ϵ]=0
  • ϵ multivariate normal with covariance matrix σ2In
    • i,σ(ϵi)=σ
    • i,j,cov(ϵi,ϵj)=0

Under these hypotheses:

  • β^ is an unbiased estimate of true parameter β
    • i.e. E[β^]=β
  • one can prove Var(β^)=σ2In
  • σ can be estimated by σ^=Si(yipredi)2Np
    • Np: degrees of freedoms
  • one can estimate: σ(βk^)
    • it is the i-th diagonal element of σ^2XX

Is the regression significant?

  • Approach is very similar to the one-dimensional case
Fisher Criterium (F-test)
  • H0: all coeficients are 0
    • i.e. true model is y=α+ϵ
  • H1: some coefficients are not 0

Statistics: F=MSRMSE (same as 1d)

  • MSR: mean-squared error of constant model
  • MSE: mean-squared error of full model

Under:

  • the model assumptions about the data generation process
  • the H0 hypothesis

… the distribution of F is known

It is remarkable that it doesn’t depend on σ !

One can produce a p-value.

  • probability to obtain this statistics given hypothesis H0
  • if very low, H0 is rejected

Is each coefficient significant ?

Student Test

Given a coefficient βk:

  • H0: true coefficient is 0
  • H1: true coefficient is not zero

Statistics (student-t): t=βk^σ^(βk^)

  • where σ^(βk) is i-th diagonal element of σ^2XX
  • it compares the estimated value of a coefficient to its estimated standard deviation

Under the inference hypotheses, distribution of t is known.

  • it is a student distribution

Procedure:

  • Compute t. Check acceptance threshold t at probability α (ex 5%)
  • Coefficient is significant with probability 1α if t>t

Or just look at the p-value:

  • probability that t would be as high as it is, assuming H0

Confidence intervals

Same as in the 1d case.

  • Take estimate βi with an estimate of its standard deviation σ^(βi)
  • Compute student t at α confidence level (ex: α=5) such that:
    • P(|t|>t)<α

. . .

Produce confidence intervals at α confidence level:

  • [βitσ^(βi),βi+tσ^(βi)]

Interpretation:

  • for a given confidence interval at confidence level α
  • the probability that our coefficient was obtained, if the true coefficient were outside of it, is smaller than α

Other tests

  • The tests seen so far rely on strong statistical assumptions (normality, homoscedasticity, etc..)
  • Some tests can be used to test these assumptions:
    • Jarque-Bera: is the distribution of data truly normal
    • Durbin-Watson: are residuals autocorrelated (makes sense for time-series)
  • In case assumptions are not met…
    • … still possible to do econometrics
    • … but beyond the scope of this course

Variable selection

Variable selection

  • I’ve got plenty of data:
    • y: gdp
    • x1: investment
    • x2: inflation
    • x3: education
    • x4: unemployment
  • Many possible regressions:
    • y=α+β1x1
    • y=α+β2x2+β3x4

. . .

  • Which one do I choose ?
    • putting everything together is not an option (kitchen sink regression)

Not enough coefficients

Suppose you run a regression: y=α+β1x1+ϵ and are genuinely interested in coefficient β1

. . .

But unknowingly to you, the actual model is y=α+β1x1+β2x2+η

. . .

The residual yαβ1x1 is not white noise

  • specification hypotheses are violated
  • estimated β1^ will have a bias (omitted variable bias)
  • to correct the bias we add x2
    • even though we are not interested in x2 by itself
    • we control for x2

Example

  • Suppose I want to check Okun’s law. I consider the following model: \text{gdp_growth} = \alpha + \beta \times \text{unemployment}
  • I obtain: \text{gdp_growth} = 0.01 - 0.1 \times \text{unemployment} + e_i
  • Then I inspect visually the residuals: not normal at all!
  • Conclusion: my regression is misspecified, 0.1 is a biased (useless) estimate
  • I need to control for additional variables. For instance: \text{gdp_growth} = \alpha + \beta_1 \text{unemployment} + \beta_2 \text{interest rate}
  • Until the residuals are actually white noise

Colinear regressors

  • What happens if two regressors are (almost) colinear? y=α+β1x1+β2x2 where x2=κx1
  • Intuitively: parameters are not unique
    • if y=α+β1x1 is the right model…
    • then y=α+β1λx1+β2(1λ)1κx2 is exactly as good…
  • Mathematically: (XX) is not invertible.
  • When regressors are almost colinear, coefficients can have a lot of variability.
  • Test:
    • correlation statistics
    • correlation plot

Choosing regressors

y=α+β1x1+...βnxn

Which regressors to choose ?

Method 1 : remove coefficients with lowest t (less significant) to maximize adjusted R-squared

  • remove regressors with lowest t
    • not the one you are interested in ;)
  • regress again
  • see if adjusted R2 is decreasing
    • if so continue
    • otherwise cancel last step and stop

Method 2 : choose combination to maximize Akaike Information Criterium

  • AIC: plog(L)
  • L is likelihood
  • computed by all good econometric softwares

Coming next

Intro to causality