Data-Based Economics
2025-01-21
age | travel | |
---|---|---|
0 | 18 | 150 |
1 | 21 | 200 |
2 | 29 | 1500 |
csv
file
pd.read_dta()
pd.read_excel()
or xlsreader
if unlucky
txt = """year,country,measure
2018,"france",950.0
2019,"france",960.0
2020,"france",1000.0
2018,"usa",2500.0
2019,"usa",2150.0
2020,"usa",2300.0
"""
open('dummy_file.csv','w').write(txt) # we write it to a file
year | country | measure | |
---|---|---|---|
0 | 2018 | france | 950.0 |
1 | 2019 | france | 960.0 |
2 | 2020 | france | 1000.0 |
3 | 2018 | usa | 2500.0 |
4 | 2019 | usa | 2150.0 |
5 | 2020 | usa | 2300.0 |
pd.read_csv
and check resultyear | country | measure | |
---|---|---|---|
0 | 2018 | france | 950.0 |
1 | 2019 | france | 960.0 |
2 | 2020 | france | 1000.0 |
3 | 2018 | usa | 2500.0 |
4 | 2019 | usa | 2150.0 |
5 | 2020 | usa | 2300.0 |
pandas can export to many formats: df.to_...
to (standard) CSV
,year,country,measure
0,2018,france,950.0
1,2019,france,960.0
2,2020,france,1000.0
3,2018,usa,2500.0
4,2019,usa,2150.0
5,2020,usa,2300.0
dbnomics
: many official time-seriesqeds
: databases used by quanteconvega-datasets
: distributed with altairsepalLength | sepalWidth | petalLength | petalWidth | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
… | … | … | … | … | … |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
df
, we want to look at some basic properties:df.head(5)
# 5 first linesdf.tail(5)
# 5 first linesdf.describe()
# general summarydf.mean()
# averagedf.median()
# mediandf.std()
# standard deviationsdf.var()
# variancedf.min()
, df.max()
# boundssepalLength | sepalWidth | petalLength | petalWidth | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
sepalLength | sepalWidth | petalLength | petalWidth | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
df.columns
Index(['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'species'], dtype='object')
sLength | sWidth | pLength | pWidth | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
df['sLength']
)0 3.5
1 3.0
...
148 3.4
149 3.0
Name: sWidth, Length: 150, dtype: float64
series.mean()
, series.std()
series.plot()
series.diff()
series.pct_change()
df['totalLength'] = df['pLength'] + df['sLength']
# this would also work
df['totalLength'] = 0.5*df['pLength'] + 0.5*df['sLength']
df.head(2)
sLength | sWidth | pLength | pWidth | species | totalLength | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 6.5 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 6.3 |
sLength | sWidth | pLength | pWidth | species | totalLength | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 3.95 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 3.85 |
pLength | sLength | |
---|---|---|
0 | 1.4 | 5.1 |
1 | 1.4 | 4.9 |
2 | 1.3 | 4.7 |
sLength | sWidth | pLength | pWidth | species | totalLength | |
---|---|---|---|---|---|---|
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 3.65 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 3.80 |
array(['setosa', 'versicolor', 'virginica'], dtype=object)
virginica
df[df['species'] == 'virginica']
species
is equal to virginica
sLength | sWidth | pLength | pWidth | species | totalLength | |
---|---|---|---|---|---|---|
100 | 6.3 | 3.3 | 6.0 | 2.5 | virginica | 9.15 |
101 | 5.8 | 2.7 | 5.1 | 1.9 | virginica | 8.00 |
102 | 7.1 | 3.0 | 5.9 | 2.1 | virginica | 9.45 |
103 | 6.3 | 2.9 | 5.6 | 1.8 | virginica | 8.75 |
df.loc[...]
which can be indexed as a matrix0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
Name: species, dtype: object
The following code creates two example databases.
txt_wide = """year,france,usa
2018,950.0,2500.0
2019,960.0,2150.0
2020,1000.0,2300.0
"""
open('dummy_file_wide.csv','w').write(txt_wide) # we write it to a file
71
txt_long = """year,country,measure
2018,"france",950.0
2019,"france",960.0
2020,"france",1000.0
2018,"usa",2500.0
2019,"usa",2150.0
2020,"usa",2300.0
"""
open('dummy_file_long.csv','w').write(txt_long) # we write it to a file
136
Compare the following tables
year | variable | value | |
---|---|---|---|
0 | 2018 | france | 950.0 |
1 | 2019 | france | 960.0 |
2 | 2020 | france | 1000.0 |
3 | 2018 | usa | 2500.0 |
4 | 2019 | usa | 2150.0 |
5 | 2020 | usa | 2300.0 |
measure | ||
---|---|---|
country | france | usa |
year | ||
2018 | 950.0 | 2500.0 |
2019 | 960.0 | 2150.0 |
2020 | 1000.0 | 2300.0 |
# the result of pivot has a "hierarchical index"
# let's change columns names
df_.columns = df_.columns.get_level_values(1)
df_
country | france | usa |
---|---|---|
year | ||
2018 | 950.0 | 2500.0 |
2019 | 960.0 | 2150.0 |
2020 | 1000.0 | 2300.0 |
groupby
is a very powerful function which can be used to work directly on data in the long format.
year | measure | |
---|---|---|
country | ||
france | 2019.0 | 970.000000 |
usa | 2019.0 | 2316.666667 |
concat
if long formatmerge
databases if wide formattxt_long_1 = """year,country,measure
2018,"france",950.0
2019,"france",960.0
2020,"france",1000.0
2018,"usa",2500.0
2019,"usa",2150.0
2020,"usa",2300.0
"""
open("dummy_long_1.csv",'w').write(txt_long_1)
year | country | measure | recipient | |
---|---|---|---|---|
0 | 2018 | france | 950.0 | maxime |
1 | 2019 | france | 960.0 | mauricette |
2 | 2020 | france | 1000.0 | mathilde |
3 | 2018 | usa | 2500.0 | sherlock |
4 | 2019 | usa | 2150.0 | watson |
5 | 2020 | usa | 2300.0 | moriarty |