AI for Research, ESCP, 2025-2026
2026-02-11
A rose is a rose is a rose
Gertrude Stein
Brexit means Brexit means Brexit
John Crace
Elementary my dear Watson
P.G. Woodehouse
There is an easy way for the government to end the strike without withdrawing the pension reform,
Generative language models perform text completion
They generate plausible1 text following a prompt.
The type of answer, will depend on the kind of prompt.
To use GPT-4 profficiently, you have to experiment with the prompt.
It is the same as learning how to do google queries
+noir +film -"pinot noir"“Prompting” is becoming a discipline in itself… (or is it?)
By providing enough context, it is possible to perform amazing tasks
The Caesar code
Zodiac 408 Cipher


Later in 2001, in a prison, somewhere in California
Solved by Stanford’s Persi Diaconis and his students using Monte Carlo Markov Chains
Take a letter \(x_n\), what is the probability of the next letter being \(x_{n+1}\)?
\[\pi_{X,Y} = P(x_{n+1}=Y, x_{n}=X)\]
for \(X=\{a, b, .... , z\} , Y=\{a,b,c, ... z\}\)
The language model can be trained using dataset of english language.
And used to determine whether a given cipher-key is consistent with english language.
It yields a very efficient algorithm to decode any caesar code (with very small sample)
MCMCs can also be used to generate text:
I think therefore I
III think therefore I I I I I INot good but promising (🤷)
Going further
fore I> ???An example using MCMC
He ha ‘s kill’d me Mother , Run away I pray you Oh this is Counter you false Danish Dogges .Can we augment memory?
26) after 50 letters, you need to take into account 5.6061847e+70 combinations !
wjai dfniDespite the constant negative press covfefe 🤔

\[\forall X, P(x_n=X| x_{n-1}, ..., x_{n-k}) = \varphi^{NL}( x_{n-1}, ..., x_{n-k}; \theta )\]
with a smaller vector of parameters \(\theta\)
In 2015

speech recognition
LSTM behind “Google Translate”, “Alexa”, …
A special kind of encoder/decoder architecture.
Most successful models since 2017

Take some data \((x_n)\in R^x\).
Consider two functions:
What could possibly the value of training the coefficients with:
\[\min_{\theta^E, \theta^D} \left( \varphi^D( \varphi^E(x_n; \theta^E), \theta^D) - x_n\right)^2\]?
i.e. train the nets \(\varphi^D\) and \(\varphi^E\) to predict the “data from the data”? (it is called autoencoding)
The relation \(\varphi^D( \varphi^E(x_n; \theta^E), \theta^D) ~ x_n\) can be rewritten as
\[x_n \xrightarrow{\varphi^E(; \theta^E)} h \xrightarrow{\varphi^D(; \theta^D)} x_n \]
When that relation is (mostly) satisfied and \(\mathbb{R}^h << \mathbb{R}^x\), \(h\) can be viewed as a lower dimension representation of \(x\). It encodes the information as a lower dimension vector \(h\) and is called learned embeddings.
This very powerful approach can be applied to combine encoders/decoders from different contexts (ex Dall-E)
Main flaw with the recursive approach:
With the attention mechanism, each predicted word/embedding is determined by all preceding words/embeddings, with different weights that are endogenous.
