First neural neuron, the perception, invented by a psychologist, Dr Rosenblatt (1952):
Evolution of the model:
Multilayer perceptron:
\[A_0(\tau_1 (A_1 \tau_2 (A_2 (.... ) + B_2)) )+ B_1\]
where the \(\tau_i\) are activation functions, A the weights, B the biases.
Recurrent neural network
Overall a neural network is a (very) nonlinear approximation function \(f(x; \theta)\) which depends on a lot of deep parameters \(\theta\).
They can be trained to perform a specific objective. For instance to fit some data \((x_n,y_n)\)
\[\min_{\theta} \sum_i (f(x_i, \theta) - y_i)^2\]
More accurately if we have access to a random subset \(\epsilon\in \cal{D}\) of the data (or the generating process), we want to perform
\[\min_{\theta} \Xi(\theta) = \min_{\theta} E_{\epsilon} \underbrace{\sum_{(x_i,y_i)\in D} (f(x_i, \theta) - y_i)^2}_{\xi(\epsilon, \theta)}\]
\(\xi(\theta)\) is called “empirical risk”, while \(\Xi(\theta)\) is called “theoretical risk”.
This is can be done for regression or classification tasks. Check tensor playground
How do we optimize the function \(\Xi(\theta)\) ?
Consider the scalar function \(\Xi(\theta)\) where \(\theta\) is a vector. How do we optimize it?
Denote the gradient of the objective by:
\[\nabla_{\theta}= \begin{bmatrix} \frac{\partial}{\partial \theta_1} \\\\...\\\\\frac{\partial}{\partial \theta_n} \end{bmatrix}\]
Gradient descent: follow to steepest slope, apply a learning rate \(\gamma\).
\[\theta \leftarrow \theta - \gamma \nabla_{\theta}\Xi(\theta)\]
Momentum: (ball goes down the hill, \(\gamma\) is air-resistence)
\[v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta)\] \[\theta \leftarrow \theta - v_t\]
Nesterov Momentum: (slow down before going up…)
\[v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J\left(\theta-\gamma v_{t-1}\right)\] \[\theta \leftarrow \theta - v_t\]
Learning rate annealing
\[\eta_t = \eta_0 / ({1+\kappa t})\]
Parameter specific updates (ADAM)
\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\]
\[\theta_{t+1} \leftarrow \theta_t-\frac{\eta}{\sqrt{\frac{v_t}{1-\beta_2^t}+\epsilon}}\frac{m_t}{1-\beta_1^t}\]
see AdaGrad, AdaMax, Rmsprop
Given \(\epsilon \sim \mathcal{D}\), minimize \[\Xi(\theta)=E J(\theta,\epsilon)\]
Idea: draw a random \(\epsilon_t\) at each step and do: \[\theta\leftarrow \theta + \gamma \nabla J(\theta,\epsilon_t)\]
It works !1
Can escape local minima (with annealing)
Gradient is estimated from a random mini-batch \((\epsilon_1, ... \epsilon_{N_m})\)
\[\theta\leftarrow \theta + \gamma \sum_{i=1:N_m} \nabla J(\theta,\epsilon^i)\]
Common case: dataset set finite \((\epsilon_1, ..., \epsilon_N)\)
We are just left to frame the economic model as an objective for the empirical risk!
\[\underset{ c_{t},w_{t+1} }{\max}E_{0}[\sum_{t=0}^{\infty }\beta ^{t}e^{\color{\red}\chi_{\color{\red}t}}u( {c_{t}})]\]
s.t. \(w_{t+1}=re^{\color{\red}\varrho_{\color{\red}t}}( w_{t}-c_{t})+e^{\color{\red}y_{\color{\red}t}}e^{\color{\red}p_{\color{\red}t} }\),
\(c_{t}\leq w_{t}\),
\(( z_{0},w_{0})\) given.
\(c_{t}\) = consumption; \(w_{t}\) = cash-on-hand; \(r\in ( 0,\frac{1}{\beta })\).
each \(z_t\in \{ y_t,p_t,\varrho_t ,\chi_t \}\) follows an AR(1) process: \(z_{t+1}=\rho z_{t}+\sigma\epsilon_{t+1}\).
\(c-w\leq 0\), \(h\geq 0\) and \(( c-w) h=0\), \[h\equiv u^{\prime }(c)e^{\chi -\varrho }-\beta rE[ u^{\prime}( c^{\prime }) e^{\chi ^{\prime }}] = \text{Lagrange multiplier}\]
\(V( z,w) =\underset{c,w^{\prime }}{\max } \\\{ u(c)+\beta E_{\epsilon }[ V( z^{\prime },w^{\prime }) ] \\\}\)
\[\Xi (\theta ) \equiv E_{z_{0},w_{0},\epsilon_{1},...,\epsilon_{T} }[ \sum_{t=0}^{T}\beta ^{t}u( c(z_{t},w_{t};\theta ) ) ]\]
Lifetime reward in the baseline model.
Euler residuals in the baseline model.
Consumption decision rule in the baseline model.
\(E_{(z_{0},w_{0})}[ E_{( \epsilon_{1},...,\epsilon_{T})}u( \cdot ) ] =E_{( z_{0},w_{0},\epsilon_{1},...,\epsilon_{T}) }[ u( \cdot ) ]\) because expectation operators are related linearly
\[E_{\eta}[\left(E_{ \epsilon}[ f(\eta, \epsilon) ]\right)^ 2]=E_{(\eta, \epsilon_{1,}\epsilon_{2}) }[ f( \eta, \epsilon_{1}) f(\eta, \epsilon_{2}) ]\]
Comments of the method