Data-Based Economics, ESCP, 2024-2025
2025-02-19
Definition Candidates:
Arthur Samuel: Field of study that gives computers the ability to learn without being explicitly programmed
Tom Mitchell: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.
tabular data
supervised: regression:
Age | Activity | Salary |
---|---|---|
23 | Explorer | 1200 |
40 | Mortician | 2000 |
45 | Mortician | 2500 |
33 | Movie Star | 3000 |
35 | Explorer | ??? |
supervised: regression:
Age | Salary | Activity |
---|---|---|
23 | 1200 | Explorer |
40 | 2000 | Mortician |
45 | 2500 | Mortician |
33 | 3000 | Movie Star |
35 | 3000 | ??? |
supervised: classification
Age | Salary | Activity |
---|---|---|
23 | 1200 | Explorer |
40 | 2000 | Mortician |
45 | 2500 | Mortician |
33 | 3000 | Movie Star |
35 | 3000 | ??? |
unsupervised
Age | Salary | Activity |
---|---|---|
23 | 1200 | Explorer |
40 | 2000 | Mortician |
45 | 2500 | Mortician |
33 | 3000 | Movie Star |
35 | 3000 | Explorer |
unsupervised
unsupervised: clustering
unsupervised: clustering
Women buying dresses during the year:
\[\underbrace{y}_{\text{explained variable}} = a \underbrace{x}_{\text{explanatory variable}} + b\]
\[\underbrace{y}_{\text{labels}} = a \underbrace{x}_{\text{features}} + b\]
Econometrics | Machine learning |
---|---|
Regressand / independent variable / explanatory variable | Features |
Regressor / dependent variable / explained variable | Labels |
Regression | Model Training |
Long data is characterized by a high number of observations.
We need a way to fit a model on a subset of the data at a time.
Traditional regression:
Incremental learning:
How do we minimize a function \(f(a,b)\)?
Gradient descent:
Wide Data is characterized by a high number of features compared to the number of observations.
Problem:
Main Idea: penalize non-zero coefficients to encourage scarcity
Remarks:
To perform Lasso and ridge regression:
In machine learning we can’t perform statistical inference easily. How do we assess the validity of a model?
Basic idea (independent of how complex the algorithm is)
Performance can be:
In case the training method depends itself on many parameters (the hyperparameters) we make three samples instead:
Golden Rule: the test set should not be used to estimate the model, and should not affect the choice any training parameter (hyperparameter).
Traintest
The test set reveals that orange model is overfitting.
Holdout validation approach:
How to choose the sizes of the subsets?
A more robust solution: \(k\)-fold validation
statsmodels
:
model.fit()
returns another object)linearmodels
statsmodels
(very similar interface)
sklearn
:
.fit
operation)Basic sklearn workflow:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()
X = dataset['data']
y = dataset['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)
model.predict(X_new)
The workflow is always the same, no matter what the model is
sklearn.linear_model.Lasso
instead of LinearRegression