Multiple Regression

Data-Based Economics, ESCP, 2024-2025

Pablo Winant

2025-02-05

The problem

Remember dataset from last time

type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
  • Last week we “ran” a linear regression: \(y = \alpha + \beta x\). Result: \[\text{income} = xx + 0.72 \text{education}\]
  • Should we have looked at “prestige” instead ? \[\text{income} = xx + 0.83 \text{prestige}\]
  • Which one is better?

Prestige or Education

  • if the goal is to predict: the one with higher explained variance
    • prestige has higher \(R^2\) (\(0.83^2\))
  • unless we are interested in the effect of education

Multiple regression

  • What about using both?
    • 2 variables model: \[\text{income} = \alpha + \beta_1 \text{education} + \beta_2 \text{prestige}\]
    • will probably improve prediction power (explained variance)
    • \(\beta_1\) might not be meaningful on its own anymore (education and prestige are correlated)

Fitting a model

Now we are trying to fit a plane to a cloud of points.

Minimization Criterium

  • Take all observations: \((\text{income}_n,\text{education}_n,\text{prestige}_n)_{n\in[0,N]}\)
  • Objective: sum of squares \[ L(\alpha, \beta_1, \beta_2) = \sum_i \left( \underbrace{ \alpha + \beta_1 \text{education}_n + \beta_2 \text{prestige}_n - \text{income}_n }_{e_n=\text{prediction error} }\right)^2 \]
  • Minimize loss function in \(\alpha\), \(\beta_1\), \(\beta_2\)
  • Again, we can perform numerical optimization (machine learning approach)
    • … but there is an explicit formula

Ordinary Least Square

\[Y = \begin{bmatrix} \text{income}_1 \\\\ \vdots \\\\ \text{income}_N \end{bmatrix}\] \[X = \begin{bmatrix} 1 & \text{education}_1 & \text{prestige}_1 \\\\ \vdots & \vdots & \vdots \\\\ 1 &\text{education}_N & \text{prestige}_N \end{bmatrix}\]

  • Matrix Version (look for \(B = \left( \alpha, \beta_1 , \beta_2 \right)\)): \[Y = X B + E\]
  • Note that constant can be interpreted as a “variable”
  • Loss function \[L(A,B) = (Y - X B)' (Y - X B)\]
  • Result of minimization \(\min_{(A,B)} L(A,B)\) : \[\begin{bmatrix}\alpha & \beta_1 & \beta_2 \end{bmatrix} = (X'X)^{-1} X' Y \]

Solution

  • Result: \[\text{income} = 10.43 + 0.03 \times \text{education} + 0.62 \times \text{prestige}\]
  • Questions:
    • is it a better regression than the other?
    • is the coefficient in front of education significant?
    • how do we interpret it?
    • can we build confidence intervals?

Explained Variance

Explained Variance

As in the 1d case we can compare: - the variability of the model predictions (\(MSS\)) - the variance of the data (\(TSS\), T for total)

Coefficient of determination (same formula):

\[R^2 = \frac{MSS}{TSS}\]

Or:

\[R^2 = 1-\frac{RSS}{SST}\]

where \(RSS\) is the non explained variance

Adjusted R squared

Fact:

  • adding more regressors always improve \(R^2\)
  • why not throw everything in? (kitchen sink regressions)
  • two many regressors: overfitting the data

Penalise additional regressors: adjusted R^2

\[R^2_{adj} = 1-(1-R^2)\frac{N-1}{N-p-1}\]

Where:

  • \(N\): number of observations
  • \(p\) number of variables

In our example:

Regression \(R^2\) \(R^2_{adj}\)
education 0.525 0.514
prestige 0.702 0.695
education + prestige 0.7022 0.688

Interpretation and variable change

Making a regression with statsmodels

import statsmodels

We use a special API inspired by R:

import statsmodels.formula.api as smf

Performing a regression

  • Running a regression with statsmodels
model = smf.ols('income ~ education',  df)  # model
res = model.fit()  # perform the regression
res.describe()
  • ‘income ~ education’ is the model formula
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 income   R-squared:                       0.525
Model:                            OLS   Adj. R-squared:                  0.514
Method:                 Least Squares   F-statistic:                     47.51
Date:                Tue, 02 Feb 2021   Prob (F-statistic):           1.84e-08
Time:                        05:21:25   Log-Likelihood:                -190.42
No. Observations:                  45   AIC:                             384.8
Df Residuals:                      43   BIC:                             388.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
==============================================================================
Intercept     10.6035      5.198      2.040      0.048       0.120      21.087
education      0.5949      0.086      6.893      0.000       0.421       0.769
==============================================================================
Omnibus:                        9.841   Durbin-Watson:                   1.736
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.609
Skew:                           0.776   Prob(JB):                      0.00497
Kurtosis:                       4.802   Cond. No.                         123.
==============================================================================

Formula mini-language

  • With statsmodels formulas, can be supplied with R-style syntax
  • Examples:
Formula Model
income ~ education \(\text{income}_i = \alpha + \beta \text{education}_i\)
income ~ prestige \(\text{income}_i = \alpha + \beta \text{prestige}_i\)
income ~ prestige - 1 \(\text{income}_i = \beta \text{prestige}_i\) (no intercept)
income ~ education + prestige \(\text{income}_i = \alpha + \beta_1 \text{education}_i + \beta_2 \text{prestige}_i\)

Formula mini-language

  • One can use formulas to apply transformations to variables
Formula Model
log(P) ~ log(M) + log(Y) \(\log(P_i) = \alpha + \alpha_1 \log(M_i) + \alpha_2 \log(Y_i)\) (log-log)
log(Y) ~ i \(\log(P_i) = \alpha + i_i\) (semi-logs)
  • This is useful if the true relationship is nonlinear
  • Also useful, to interpret the coefficients

Coefficients interpetation

Example:

  • (police_spending and prevention_policies in million dollars) \[\text{number_or_crimes} = 0.005\% - 0.001 \text{pol_spend} - 0.005 \text{prev_pol} + 0.002 \text{population density}\]

  • reads: when holding other variables constant a 0.1 million increase in police spending reduces crime rate by 0.001%

Iinterpretation?

  • problematic because variables have different units
  • we can say that prevention policies are more efficient than police spending ceteris paribus

Take logs: \[\log(\text{number_or_crimes}) = 0.005\% - 0.15 \log(\text{pol_spend}) - 0.4 \log(\text{prev_pol}) + 0.2 \log(\text{population density})\]

  • now we have an estimate of elasticities
  • a \(1\%\) increase in police spending leads to a \(0.15\%\) decrease in the number of crimes

Statistical Inference

Hypotheses

  • Recall what we do:
    • we have the data \(X,Y\)
    • we choose a model: \[ Y = \alpha + X \beta \]
    • from the data we compute estimates: \[\hat{\beta} = (X'X)^{-1} X' Y \] \[\hat{\alpha} = Y- X \beta \]
    • estimates are a precise function of data
      • exact formula not important here

We need some hypotheses on the data generation process:

  • \(Y = X \beta + \epsilon\)
  • \(\mathbb{E}\left[ \epsilon \right] = 0\)
  • \(\epsilon\) multivariate normal with covariance matrix \(\sigma^2 I_n\)
    • \(\forall i, \sigma(\epsilon_i) = \sigma\)
    • \(\forall i,j, cov(\epsilon_i, \epsilon_j) = 0\)

Under these hypotheses:

  • \(\hat{\beta}\) is an unbiased estimate of true parameter \(\beta\)
    • i.e. \(\mathbb{E} [\hat{\beta}] = \beta\)
  • one can prove \(Var(\hat{\beta}) = \sigma^2 I_n\)
  • \(\sigma\) can be estimated by \(\hat{\sigma}=S\frac{\sum_i (y_i-{pred}_i)^2}{N-p}\)
    • \(N-p\): degrees of freedoms
  • one can estimate: \(\sigma(\hat{\beta_k})\)
    • it is the \(i\)-th diagonal element of \(\hat{\sigma}^2 X'X\)

Is the regression significant?

  • Approach is very similar to the one-dimensional case

Fisher Criterium (F-test)

  • \(H0\): all coeficients are 0
    • i.e. true model is \(y=\alpha + \epsilon\)
  • \(H1\): some coefficients are not 0

Statistics: \[F=\frac{MSR}{MSE}\] (same as 1d)

  • \(MSR\): mean-squared error of constant model
  • \(MSE\): mean-squared error of full model

Under:

  • the model assumptions about the data generation process
  • the H0 hypothesis

… the distribution of \(F\) is known

It is remarkable that it doesn’t depend on \(\sigma\) !

One can produce a p-value.

  • probability to obtain this statistics given hypothesis H0
  • if very low, H0 is rejected

Is each coefficient significant ?

Student Test

Given a coefficient \(\beta_k\):

  • \(H0\): true coefficient is 0
  • \(H1\): true coefficient is not zero

Statistics (student-t): \[t = \frac{\hat{\beta_k}}{\hat{\sigma}(\hat{\beta_k})}\]

  • where \(\hat{\sigma}(\beta_k)\) is \(i\)-th diagonal element of \(\hat{\sigma}^2 X'X\)
  • it compares the estimated value of a coefficient to its estimated standard deviation

Under the inference hypotheses, distribution of \(t\) is known.

  • it is a student distribution

Procedure:

  • Compute \(t\). Check acceptance threshold \(t*\) at probability \(\alpha\) (ex 5%)
  • Coefficient is significant with probability \(1-\alpha\) if \(t>t*\)

Or just look at the p-value:

  • probability that \(t\) would be as high as it is, assuming \(H0\)

Confidence intervals

Same as in the 1d case.

  • Take estimate \(\color{red}{\beta_i}\) with an estimate of its standard deviation \(\color{red}{\hat{\sigma}(\beta_i)}\)
  • Compute student \(\color{red}{t^{\star}}\) at \(\color{red}{\alpha}\) confidence level (ex: \(\alpha=5\\%\)) such that:
    • \(P(|t|>t^{\star})<\alpha\)

Produce confidence intervals at \(\alpha\) confidence level:

  • \([\color{red}{\beta_i} - t^{\star} \color{red}{\hat{\sigma}(\beta_i)}, \color{red}{\beta_i} + t^{\star} \color{red}{\hat{\sigma}(\beta_i)}]\)

Interpretation:

  • for a given confidence interval at confidence level \(\alpha\)
  • the probability that our coefficient was obtained, if the true coefficient were outside of it, is smaller than \(\alpha\)

Other tests

  • The tests seen so far rely on strong statistical assumptions (normality, homoscedasticity, etc..)
  • Some tests can be used to test these assumptions:
    • Jarque-Bera: is the distribution of data truly normal
    • Durbin-Watson: are residuals autocorrelated (makes sense for time-series)
  • In case assumptions are not met…
    • … still possible to do econometrics
    • … but beyond the scope of this course

Variable selection

Variable selection

  • I’ve got plenty of data:
    • \(y\): gdp
    • \(x_1\): investment
    • \(x_2\): inflation
    • \(x_3\): education
    • \(x_4\): unemployment
  • Many possible regressions:
    • \(y = α + \beta_1 x_1\)
    • \(y = α + \beta_2 x_2 + \beta_3 x_4\)
  • Which one do I choose ?
    • putting everything together is not an option (kitchen sink regression)

Not enough coefficients

Suppose you run a regression: \[y = \alpha + \beta_1 x_1 + \epsilon\] and are genuinely interested in coefficient \(\beta_1\)

But unknowingly to you, the actual model is \[y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \eta\]

The residual \(y - \alpha - \beta_1 x_1\) is not white noise

  • specification hypotheses are violated
  • estimated \(\hat{\beta_1}\) will have a bias (omitted variable bias)
  • to correct the bias we add \(x_2\)
    • even though we are not interested in \(x_2\) by itself
    • we control for \(x_2\)

Example

  • Suppose I want to check Okun’s law. I consider the following model: \[\text{gdp_growth} = \alpha + \beta \times \text{unemployment}\]
  • I obtain: \[\text{gdp_growth} = 0.01 - 0.1 \times \text{unemployment} + e_i\]
  • Then I inspect visually the residuals: not normal at all!
  • Conclusion: my regression is misspecified, \(0.1\) is a biased (useless) estimate
  • I need to control for additional variables. For instance: \[\text{gdp_growth} = \alpha + \beta_1 \text{unemployment} + \beta_2 \text{interest rate}\]
  • Until the residuals are actually white noise

Colinear regressors

  • What happens if two regressors are (almost) colinear? \[y = \alpha + \beta_1 x_1 + \beta_2 x_2\] where \(x_2 = \kappa x_1\)
  • Intuitively: parameters are not unique
    • if \(y = \alpha + \beta_1 x_1\) is the right model…
    • then \(y = \alpha + \beta_1 \lambda x_1 + \beta_2 (1-\lambda) \frac{1}{\kappa} x_2\) is exactly as good…
  • Mathematically: \((X'X)\) is not invertible.
  • When regressors are almost colinear, coefficients can have a lot of variability.
  • Test:
    • correlation statistics
    • correlation plot

Choosing regressors

\[y = \alpha + \beta_1 x_1 + ... \beta_n x_n\]

Which regressors to choose ?

Method 1 : remove coefficients with lowest t (less significant) to maximize adjusted R-squared

  • remove regressors with lowest t
    • not the one you are interested in ;)
  • regress again
  • see if adjusted \(R^2\) is decreasing
    • if so continue
    • otherwise cancel last step and stop

Method 2 : choose combination to maximize Akaike Information Criterium

  • AIC: \(p - log(L)\)
  • \(L\) is likelihood
  • computed by all good econometric softwares

Coming next

Intro to causality