Introduction to Panel Data

Data-Based Economics, ESCP, 2025-2026

Author

Pablo Winant

Published

January 28, 2026

Panels: Longitudinal Data


Motivation


A Simple Question

  • Question: Does drinking coffee improve productivity?
  • Data: We survey 1000 employees over 5 years.
  • Cross-Sectional View: Compare Alice (drinks 5 cups) vs. Bob (drinks 0 cups).
    • Finding: Alice is more stressed and less productive. \(\rightarrow\) Coffee is bad?
  • Time-Series View: Follow Alice when she varies her coffee intake.
    • Finding: On days she drinks coffee, she gets more done. \(\rightarrow\) Coffee is good!
  • The Puzzle: Alice works in a high-pressure trading floor (stress causes coffee). Bob works in a library.
TipThe Power of Panels

Panel data lets us separate the individual effect (Alice vs Bob) from the causal effect (Coffee vs No Coffee).


Data Dimensions: Time and Space


Two Dimensions

  • Until now, we have been rather loose about where the data comes from.

  • Trying to explain \(N\) observations: \(y_n = a + b x_n, n\in [1,N]\)

  • All these lonely observations, where do they all come from?

Individuals (Space)
(Cross-section)

Dates (Time)
(Time-series)


Time Series

  • very simple study: structural break
    • does regression on \([T_1, \overline{T}]\) yield (significantly) different results on \([\overline{T}, T_2]\)
  • going further: time series analysis
    • data is typically autocorrelated
    • example (AR1) \(x_t = a + b x_{t-1} + \epsilon_t\)

Longitudinal Data / Panel Data

1. Repeated Cross-Sections

  • Idea: Survey different random people each year.
  • Example: “US Census 2000, 2010, 2020”
  • Structure:
    • Index individual \(i\) and time \(t\).
    • But \(i=1\) in 2000 is NOT the same person as \(i=1\) in 2010.
  • Analysis:
    • We can study trends (e.g. inequality over time).
    • But we cannot see who got richer.

2. Longitudinal (Panel) Data

  • Idea: Follow the same individuals over time.
  • Example: “Tracking the same 5000 families for 20 years (PSID)”
  • Structure:
    • \(y_{i,t}\) is person \(i\) at date \(t\).
    • \(y_{i,t+1}\) is the same person, later.
  • Power:
    • We can control for unobserved characteristics of \(i\).
    • We can study detailed dynamics.


Balanced vs Unbalanced

  • Balanced: all individuals in the sample are followed from 1 to T
  • Unbalanced: some indivuduals start later, stop earlier

  • Crude solutions:
    • truncate the dates between \([T_1, T_2]\) so that dataset is balanced
    • eliminate individuals who are not present in the full sample
  • Not very good:
    • can limit a lot the size of the sample
    • can induce a “selection bias”
  • Real gurus know how to deal with missing values
    • many algorithms can be adapted

Long and wide format

  • we tend to prefer here the long format (w.r.t id and date)
  • there can be many columns though (for each variable)

Micro-Panel vs Macro-Panel

  • micro-panel: \(T<<N\)
    • Panel Study of Income Dynamics (PSID): 5000 consumers since 1967 (US)
      • reinterview same individuals from year to year
      • but some go in/out of the panel
    • Survey of Consumer and Finance (SCF)
    • …
  • macro-panel: \(T\approx N\)
    • WIIW: 23 countries since the 60s (central, east and southern Europe)

The Curse of Heterogeneity


The Trap of Pooled Regression

Context: We want to explain economic growth (\(y\)) using trade openness (\(x\)) for several countries.

The Mistake: “Pooled Regression”

  • We throw all data points (all countries, all years) into one big pot.
  • Formula: \(y_{i,t} = a + b x_{i,t} + \epsilon_{i,t}\)
  • Result: We find a negative relationship (the green line)!

The Reality:

  • Look closely at each country (dots of same color).
  • Within each country, the relationship is positive!
  • Conclusion: We missed the “Country Effect” (Unobserved Heterogeneity).

The Solution: Individual Fixed Effects

The Intuition

We admit that some individuals are just naturally “higher” or “lower” than others. We give each individual their own starting point (intercept).

The Model

\[ y_{i,t} = \underbrace{\alpha_i}_{\text{Individual Intercept}} + \beta x_{i,t} + \epsilon_{i,t}\]

  • \(\alpha_i\) is the Fixed Effect. It captures EVERYTHING that is:
    • Specific to individual \(i\)
    • Constant over time (time-invariant)
    • Even if we can’t observe it! (e.g., “Culture”, “Talent”)

Fixed effect (implementation)

  • Fixed effect regression: \[ y_{i,t} = a + a_i + b x_{i,t} + \epsilon_{i,t}\] is equivalent to \[ y_{i,t} = a.1 + a_1 d_{i=1} + \cdots + a_I d_{i=I} + b x_{i,t} + \epsilon_{i,t}\] where \(d\) is a dummy variable such that \(d_{i=j} = \begin{cases}1, & \text{if}\ i=j \\\\ 0, & \text{otherwise}\end{cases}\)
  • Minor problem: \(1, d_{i=1}, ... d_{i=I}\) are not independent: \(\sum_{j=1}^I \delta_{i=j}=1\)
    • Solution: ignore one of them, exactly like the dummies for categorical variables
  • Now the regression can be estimated with OLS…

Estimation methods

  • … Now the regression can be estimated with OLS (or other)
    • naive approach fails for big panels (lots of dummy regressors makes \(X^{\prime}X\) hard to invert)
    • smart approach decomposes computation in several steps
      • “between” and “within” estimator (for advanced panel course)
    • software does it for us…
  • Like always, we get estimates and significance numbers / confidence intervals

Time Fixed Effects

  • Sometimes, we know the whole dataset is affected by common time-varying shocks
    • assume there isn’t a variable we can use to capture them (unobservable shocks)
  • We can use time-fixed effects to capture them: \[ y_{i,t} = a + a_t + b x_{i,t}\]
  • Analysis is very similar to individual fixed effects

Both Fixed Effects

  • We can capture time heterogeneity and individual heterogeneity at the same time. \[ y_{i,t} = a + a_i + a_t + b x_{i,t}\]
  • More of it soon.

Limitation of fixed effects

  • Problem with fixed effect model:
    • each individual has a unique fixed effect
    • it is impossible to predict it from other characteristics
      • … and to compare an individual’s fixed effect to the predicted value
  • Solution:
    • instead of assuming that specific effect is completely free, constrain it to follow a distribution: \[y_{i,t} = \alpha + \beta x_{i,t} + \epsilon_{i,t}\] \[\epsilon_{i,t} = \epsilon_i + \epsilon_t + \epsilon\]
    • where \(\epsilon_{i}\), \(\epsilon_t\) and \(\epsilon\) are random variables with usual normality assumptions

Other models

  • Composed coefficients: (coefficients can also be heterogenous in both dimension)

    \[y_{i,t} = \alpha_i + \alpha_t + (\beta_i + \beta_t) x_{i,t} + \epsilon_{i,t}\]

  • Random coefficients …


Diff in Diff


A Quasi-Experiment

The Setup

  • Scenario: A school introduces a tutoring program.
  • Problem: Usually, students who need help sign up.
    • Comparing Tutors vs Non-Tutors is biased (selection bias).
  • The Diff-in-Diff Trick:
    • Don’t compare levels. Compare changes.
    • Compare the improvement of tutored students vs the improvement of non-tutored students.

Formalization

  • Two groups: Treated (\(T=1\)) and Control (\(T=0\)).
  • Two periods: Pre (\(t=0\)) and Post (\(t=1\)).
  • Goal: Isolate the effect of the Treatment.

Graphical Intuition


The Logic of Diff-in-Diff

  • Goal: Estimate the extra growth of the treated group compared to the control group.
NoteThe Calculation

\[ \text{DID} = (\bar{y}_{\text{Treat}, \text{Post}} - \bar{y}_{\text{Treat}, \text{Pre}}) - (\bar{y}_{\text{Control}, \text{Post}} - \bar{y}_{\text{Control}, \text{Pre}}) \]

  • Interpretation:
    • \((\bar{y}_{C,\text{Post}} - \bar{y}_{C,\text{Pre}})\): The “Normal” time trend (what would have happened anyway).
    • \((\bar{y}_{T,\text{Post}} - \bar{y}_{T,\text{Pre}})\): The actual change for the Treated.
    • The difference is the Causal Effect.
  • Key Assumption: Parallel Trends.
    • Without treatment, both groups would have followed the same trend.

Implementing Diff-in-Diff

We estimate it using a regression with an Interaction Term:

\[y_{i,t} = \alpha + \beta_1 \underbrace{\text{Treat}_i}_{\text{Group}} + \beta_2 \underbrace{\text{Post}_t}_{\text{Time}} + \delta (\underbrace{\text{Treat}_i \times \text{Post}_t}_{\text{Interaction}}) + \epsilon_{i,t}\]

  • \(\beta_1\): Baseline difference between groups.
  • \(\beta_2\): Common time trend.
  • \(\delta\): The Difference-in-Differences estimator.
ImportantInterpretation

\(\delta\) captures the extra jump that the treated group experiences in the second period. This is our causal effect!


Generalized Diff-in-Diff

  • Scenario:

    • Multiple time periods.
    • Different individuals get treated at different times (Staggered adoption).
  • The Model: Two-Way Fixed Effects (TWFE) \[y_{i,t} = \underbrace{\alpha_{i}}_{\text{Entity FE}} + \underbrace{\alpha_t}_{\text{Time FE}} + \delta D_{i,t} + \beta x_{i,t} + \epsilon_{i,t}\]

  • \(D_{i,t}\): The “Post-Treatment” dummy

    • \(D_{i,t} = 1\) if individual \(i\) is actively treated at time \(t\).
    • \(D_{i,t} = 0\) otherwise.
  • Conclusion: This is the standard workhorse model for policy evaluation in economics.


In practice: Python

  • Using the linearmodels library (built on top of pandas/statsmodels).

  • Entity Fixed Effects:

    from linearmodels import PanelOLS
    # ... load data ...
    mod = PanelOLS.from_formula('invest ~ 1 + value + capital + EntityEffects', data=df)
    res = mod.fit()
  • Two-Way Fixed Effects:

    mod = PanelOLS.from_formula('invest ~ 1 + value + capital + EntityEffects + TimeEffects', data=df)
    res = mod.fit()
TipPro Tip

PanelOLS handles the dummy variable math efficiently. Don’t try to create 1000 columns of dummies yourself!


Conclusion


Takeaways

  1. More Data, More Power: Panel data combines the best of cross-sections (“differences between people”) and time-series (“changes over time”).
  2. Control the Unobservable:
  • For simple cases, you can just add dummy variables to control for heterogeneity, or do different regressions for different groups
  • Use Fixed Effects to absorb constant individual characteristics (culture, ability) that would otherwise bias your results.
  1. Think Causally: Diff-in-Diff allows for powerful quasi-experimental analysis (Treated vs Control, Before vs After).
  2. Tools: Use linearmodels in Python to handle the complexity automatically.