| type | income | education | prestige | |
|---|---|---|---|---|
| rownames | ||||
| accountant | prof | 62 | 86 | 82 |
| pilot | prof | 72 | 76 | 83 |
| architect | prof | 75 | 92 | 90 |
| author | prof | 55 | 90 | 76 |
| chemist | prof | 64 | 86 | 90 |
Data-Based Economics, ESCP, 2025-2026
2026-01-21
Duncan’s Occupational Prestige Data
Many occupations in 1950.
Education and prestige associated to each occupation
\(x\): education
\(y\): income
\(z\): Percentage of respondents in a social survey who rated the occupation as “good” or better in prestige
Import the data from statsmodels’ dataset:
For any variable \(v\) with \(N\) observations:
| income | education | prestige | |
|---|---|---|---|
| count | 45.000000 | 45.000000 | 45.000000 |
| mean | 41.866667 | 52.555556 | 47.688889 |
| std | 24.435072 | 29.760831 | 31.510332 |
| min | 7.000000 | 7.000000 | 3.000000 |
| 25% | 21.000000 | 26.000000 | 16.000000 |
| 50% | 42.000000 | 45.000000 | 41.000000 |
| 75% | 64.000000 | 84.000000 | 81.000000 |
| max | 81.000000 | 100.000000 | 97.000000 |
Can we visualize correlations?
Using matplotlib (3d)

The pairplot made with seaborn gives a simple sense of correlations as well as information about the distribution of each variable.
Now we want to build a model to represent the data:
Consider the line: \[y = α + β x\]
Several possibilities. Which one do we choose to represent the model?
We need some criterium.









The mathematical problem \(\min_{\alpha,\beta} L(\alpha,\beta)\) has one unique solution1
Solution is given by the explicit formula: \[\hat{\alpha} = \overline{y} - \hat{\beta} \overline{x}\] \[\hat{\beta} = \frac{Cov({x,y})}{Var(x)} = Cor(x,y) \frac{\sigma(y)}{\sigma({x})}\]
\(\hat{\alpha}\) and \(\hat{\beta}\) are estimators.
In our example we get the result: \[\underbrace{y}_{\text{income}} = 10 + 0.59 \underbrace{x}_{education}\]
We can say:
But:
It is possible to make predictions with the model:

OK, but that seems noisy, how much do I really predict ? Can I get a sense of the precision of my prediction ?

Imagine the true model is: \[y = α + β x + \epsilon\] \[\epsilon_i \sim \mathcal{N}\left({0,\sigma^{2}}\right)\]
Using this data-generation process, I have drawn randomly \(N\) data points (a.k.a. gathered the data)
Then computed my estimate \(\hat{α}\), \(\hat{β}\)
How confident am I in these estimates ?














Given the true model, all estimators are random variables of the data generating process
Given the values \(\alpha\), \(\beta\), \(\sigma\) of the true model, we can model the distribution of the estimates.
Some closed forms:
These statististics or any function of them can be computed exactly, given the data.
Their distribution depends, on the data-generating process
Can we produce statistics whose distribution is known given mild assumptions on the data-generating process?
Test
Fisher Statistics: \[\boxed{F=\frac{Explained Variance}{Unexplained Variance}}\]
Distribution of \(F\) is known theoretically.
In our case, \(Fstat=40.48\).
What was the probability it was that big, under the \(H0\) hypothesis?
In social science, typical required p-value is 5%.
In practice, we abstract from the precise calculation of the Fisher statistics, and look only at the p-value.

The student test can also be used to construct confidence intervals.
Given estimate, \(\hat{\beta}\) with standard deviation \(\sigma(\hat{\beta})\)
Given a probability threshold \(\alpha\) (for instance \(\alpha=0.05\)) we can compute \(t^{\star}\) such that \(P(|t|>t*)=\alpha\)
We construct the confidence interval: \[I^{\alpha} = [\hat{\beta}-t\sigma(\hat{\beta}), \hat{\beta}+t\sigma(\hat{\beta})]\]
Interpretation (⚠: subtle)
Claim (common intuition): “More civilian firearms deter crime ⇒ fewer homicides.”
Simple cross-country OLS (26 high-income countries, early 1990s)
\[\text{HomicideRate}_i = \alpha + \beta\,\text{GunAvailability}_i + \varepsilon_i\]
What the bivariate regression finds (Table 2):
Takeaway: The “more guns → less homicide” claim is not supported by this simple cross-country correlation.
⚠️ Caution (OLS lesson): This is not causal (possible confounding, reverse causality, measurement error). Use it to practice interpreting (), p-values, and thinking about omitted variables.
Source: Hemenway, D. & Miller, M. (2000). Firearm Availability and Homicide Rates across 26 High-Income Countries. Journal of Trauma, 49(6), 985–988. Paper