Dataframes

Data-Based Economics

Author

Year 2024-2025

Published

January 21, 2025


Tabular Data


DataFrame

  • A DataFrame (aka a table) is a 2-D labeled data structure with columns
    • each column has a specific type and a column name
    • types: quantitative, qualitative (ordered, non-ordered, …)
  • First column is special: the index
  • first goal of an econometrician: constitute a good dataframe
    • aka “cleaning the data”

DataFrames are everywhere

  • sometimes data comes from several linked dataframes
    • relational database
    • can still be seen conceptually as one dataframe…
    • … through a join operation
  • dataframes / relational databases are so ubiquitous a language has been developed for them
    • SQL
    • in the 80s…
  • probably worth looking at if you have some “data” ambitions
  • you will see the shadow of SQL everywhere - words like: join, merge, select, insert…
  • plenty of resources to learn (example: sqbolt)


Pandas



pandas

  • pandas = panel + datas
    • a python library created by WesMcKinney
    • very optimized
  • essentially a dataframe object
  • many options but if in doubt:
  • tons of online tutorials

creating a dataframe (1)

  • Import pandas
    • preferably with standard alias pd
    import pandas as pd
  • Import a dataframe
    • each line a different entry in a dictionary
    # from a dictionary
    d = {
      "country": ["USA", "UK", "France"],
      "comics": [13, 10, 12]   
    }
    pd.DataFrame(d)
country comics
0 USA 13
1 UK 10
2 France 12

creating a dataframe (2)

  • there are many other ways to create a dataframe
    • for instance using a numpy matrix (numpy is a linear algebra library)
    # from a matrix
    import numpy as np
    M = np.array(
        [[18, 150],
         [21, 200],
         [29, 1500]]
    )   
    df = pd.DataFrame( M, columns=["age", "travel"] )
    df
age travel
0 18 150
1 21 200
2 29 1500

File Formats

Common file formats

  • comma separated files: csv file
    • often distributed online
    • can be exported easily from Excel or LibreOffice
  • stata files: use pd.read_dta()
  • excel files: use pd.read_excel() or xlsreader if unlucky
    • note that excel does not store a dataframe (each cell is potentially different)
    • postprocessing is needed

Comma separated file

  • one can actually a file from python
txt = """year,country,measure
2018,"france",950.0
2019,"france",960.0
2020,"france",1000.0
2018,"usa",2500.0
2019,"usa",2150.0
2020,"usa",2300.0
"""
open('dummy_file.csv','w').write(txt) # we write it to a file
  • and import it
df = pd.read_csv('dummy_file.csv') # what index should we use ?
df
year country measure
0 2018 france 950.0
1 2019 france 960.0
2 2020 france 1000.0
3 2018 usa 2500.0
4 2019 usa 2150.0
5 2020 usa 2300.0

“Annoying” Comma Separated File

  • Sometimes, comma-separated files, are not quite comma-separated…
    • inspect the file with a text editor to see what it contains
      • the kind of separator, whether there are quotes…
      txt = """year;country;measure
      2018;"france";950.0
      2019;"france";960.0
      2020;"france";1000.0
      2018;"usa";2500.0
      2019;"usa";2150.0
      2020;"usa";2300.0
      """
      open('annoying_dummy_file.csv','w').write(txt) # we write it to a file
  • add relevant options to pd.read_csv and check result
pd.read_csv("annoying_dummy_file.csv", sep=";")
year country measure
0 2018 france 950.0
1 2019 france 960.0
2 2020 france 1000.0
3 2018 usa 2500.0
4 2019 usa 2150.0
5 2020 usa 2300.0

Exporting a DataFrame

  • pandas can export to many formats: df.to_...

  • to (standard) CSV

print( df.to_csv() )
,year,country,measure
0,2018,france,950.0
1,2019,france,960.0
2,2020,france,1000.0
3,2018,usa,2500.0
4,2019,usa,2150.0
5,2020,usa,2300.0
  • or to stata
df.to_stata('dummy_example.dta')

Data Sources

Types of Data Sources

  • Where can we get data from ?
  • Official websites
    • often in csv form
    • unpractical applications
    • sometimes unavoidable
    • open data trend: more unstructured data
  • Data providers
    • supply an API (i.e. easy to use function)

Data providers

  • commercial ones:
    • bloomberg, macrobond, factsets, quandl …
  • free ones available as a python library
    • dbnomics: many official time-series
    • qeds: databases used by quantecon
    • vega-datasets: distributed with altair

import vega_datasets
df = vega_datasets.data('iris')
df
sepalLength sepalWidth petalLength petalWidth species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns



DBnomics example

  • DBnomics aggregates time series from various public sources
  • data is organized as provider/database/series
  • try to find the identifer of one or several series
import dbnomics
df = dbnomics.fetch_series('AMECO/ZUTN/EA19.1.0.0.0.ZUTN')
  • tip: in case one python package is missing, it can be installed on the fly as in
!pip install dbnomics

Inspect / describe data

Inspecting data

  • once the data is loaded as df, we want to look at some basic properties:
  • general
    • df.head(5) # 5 first lines
    • df.tail(5) # 5 first lines
    • df.describe() # general summary
  • central tendency
    • df.mean() # average
    • df.median() # median
  • spread
    • df.std() # standard deviations
    • df.var() # variance
    • df.min(), df.max() # bounds
  • counts (for categorical variable
    • df.count()

df.head(2)
sepalLength sepalWidth petalLength petalWidth species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
df.describe()
sepalLength sepalWidth petalLength petalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Manipulating DataFrames

Changing names of columns

  • Columns are defined by property df.columns
df.columns
Index(['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'species'], dtype='object')
  • This property can be set with a list of the right length
df.columns = ['sLength', 'sWidth', 'pLength', 'pWidth', 'species']
df.head(2)
sLength sWidth pLength pWidth species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa

Indexing a column

  • A column can be extracted using its name as in a dictionary (like df['sLength'])
series = df['sWidth'] # note the resulting object: a series
series
0      3.5
1      3.0
      ... 
148    3.4
149    3.0
Name: sWidth, Length: 150, dtype: float64
  • The result is a series object (typed values with a name and an index)
  • It has its own set of methods
    • try:
      • series.mean(), series.std()
      • series.plot()
      • series.diff()
        • creates \(y_t = x_t-x_{t-1}\)
      • series.pct_change()
        • creates \(y_t = \frac{x_t-x_{t-1}}{x_{t-1}}\)

Creating a new column

  • It is possible to create a new column by combining existing ones
df['totalLength'] = df['pLength'] + df['sLength']
# this would also work
df['totalLength'] = 0.5*df['pLength'] + 0.5*df['sLength']
df.head(2)
sLength sWidth pLength pWidth species totalLength
0 5.1 3.5 1.4 0.2 setosa 6.5
1 4.9 3.0 1.4 0.2 setosa 6.3

Replacing a column

  • An existing column can be replaced with the same syntax.
df['totalLength'] = df['pLength'] + df['sLength']*0.5
df.head(2)
sLength sWidth pLength pWidth species totalLength
0 5.1 3.5 1.4 0.2 setosa 3.95
1 4.9 3.0 1.4 0.2 setosa 3.85

Selecting several columns

  • Index with a list of column names
e = df[ ['pLength', 'sLength'] ]
e.head(3)
pLength sLength
0 1.4 5.1
1 1.4 4.9
2 1.3 4.7

Selecting lines (1)

  • use index range
    • ☡: in Python the end of a range is not included !
    df[2:4]
sLength sWidth pLength pWidth species totalLength
2 4.7 3.2 1.3 0.2 setosa 3.65
3 4.6 3.1 1.5 0.2 setosa 3.80

Selecting lines (2)

  • let’s look at unique species
df['species'].unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
  • we would like to keep only the lines with virginica
bool_ind = df['species'] == 'virginica' # this is a boolean serie
  • the result is a boolean series, where each element tells whether a line should be kept or not
e = df[ bool_ind ]
e.head(4)
  • if you want you can keep the recipe:
df[df['species'] == 'virginica']
  • to keep lines where species is equal to virginica
sLength sWidth pLength pWidth species totalLength
100 6.3 3.3 6.0 2.5 virginica 9.15
101 5.8 2.7 5.1 1.9 virginica 8.00
102 7.1 3.0 5.9 2.1 virginica 9.45
103 6.3 2.9 5.6 1.8 virginica 8.75

Selecting lines and columns

  • sometimes, one wants finer control about which lines and columns to select:
    • use df.loc[...] which can be indexed as a matrix
df.loc[0:4, 'species']
0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

Combine everything

  • Here is an example combiing serveral techniques
    • Let’s change the way totalLength is computed, but only for ‘virginica’
    index = (df['species']=='virginica')
    df.loc[index,'totalLength'] = df.loc[index,'sLength'] + 1.5*df[index]['pLength']

Reshaping DataFrames

The following code creates two example databases.

txt_wide = """year,france,usa
2018,950.0,2500.0
2019,960.0,2150.0
2020,1000.0,2300.0
"""
open('dummy_file_wide.csv','w').write(txt_wide) # we write it to a file
71
txt_long = """year,country,measure
2018,"france",950.0
2019,"france",960.0
2020,"france",1000.0
2018,"usa",2500.0
2019,"usa",2150.0
2020,"usa",2300.0
"""
open('dummy_file_long.csv','w').write(txt_long) # we write it to a file
136
df_long = pd.read_csv("dummy_file_long.csv")
df_wide = pd.read_csv("dummy_file_wide.csv")

Wide vs Long format (1)

Compare the following tables

df_wide
year france usa
0 2018 950.0 2500.0
1 2019 960.0 2150.0
2 2020 1000.0 2300.0
df_long
year country measure
0 2018 france 950.0
1 2019 france 960.0
2 2020 france 1000.0
3 2018 usa 2500.0
4 2019 usa 2150.0
5 2020 usa 2300.0

Wide vs Long format (2)

  • in long format: each line is an independent observation
    • two lines may belong to the same category (year, or country)
    • all values are given in the same column
    • their types/categories are given in another column
  • in wide format: some observations are grouped
    • in the example it is grouped by year
    • values of different kinds are in different columns
    • the types/categories are stored as column names
  • both representations are useful

Tidy data:

  • tidy data:
    • every column is a variable.
    • every row is an observation.
    • every cell is a single value.
  • a very good format for:
    • quick visualization
    • data analysis

Converting from Wide to Long

df_wide.melt(id_vars='year')
year variable value
0 2018 france 950.0
1 2019 france 960.0
2 2020 france 1000.0
3 2018 usa 2500.0
4 2019 usa 2150.0
5 2020 usa 2300.0

Converting from Long to Wide

df_ = df_long.pivot(index='year', columns='country')
df_
measure
country france usa
year
2018 950.0 2500.0
2019 960.0 2150.0
2020 1000.0 2300.0
# the result of pivot has a "hierarchical index"
# let's change columns names
df_.columns = df_.columns.get_level_values(1)
df_
country france usa
year
2018 950.0 2500.0
2019 960.0 2150.0
2020 1000.0 2300.0

groupby

  • groupby is a very powerful function which can be used to work directly on data in the long format.
    • for instance to compute averages per country
    df_long.groupby("country").mean()
year measure
country
france 2019.0 970.000000
usa 2019.0 2316.666667
  • You can perform several aggregations at the same time:
df_long.groupby("country").agg(['mean','std'])

Merging

Merging two dataframes

  • Suppose we have two dataframes, with related observations
  • How can we construct one single database with all informations?
  • Answer:
    • concat if long format
    • merge databases if wide format
  • Lots of subtleties when data gets complicated
    • we’ll see them in due time

txt_long_1 = """year,country,measure
2018,"france",950.0
2019,"france",960.0
2020,"france",1000.0
2018,"usa",2500.0
2019,"usa",2150.0
2020,"usa",2300.0
"""
open("dummy_long_1.csv",'w').write(txt_long_1)
txt_long_2 = """year,country,recipient
2018,"france",maxime
2019,"france",mauricette
2020,"france",mathilde
2018,"usa",sherlock
2019,"usa",watson
2020,"usa",moriarty
"""
open("dummy_long_2.csv",'w').write(txt_long_2)
df_long_1 = pd.read_csv('dummy_long_1.csv')
df_long_2 = pd.read_csv('dummy_long_2.csv')

Merging two DataFrames with pandas

df_long_1.merge(df_long_2)
year country measure recipient
0 2018 france 950.0 maxime
1 2019 france 960.0 mauricette
2 2020 france 1000.0 mathilde
3 2018 usa 2500.0 sherlock
4 2019 usa 2150.0 watson
5 2020 usa 2300.0 moriarty