Intro to Large Language Models

Data-Based Economics, ESCP, 2024-2025

Pablo Winant

2025-04-01

Language-Based AI

Do you like poetry?

A rose is a rose is a rose

Gertrude Stein

Brexit means Brexit means Brexit

John Crace

Elementary my dear Watson

P.G. Woodehouse

There is an easy way for Europe to respond to the trade war started by Donald Trump

If we could complete any sentence. . . Would we be able to solve any problem?

⮕ We are witnessing the advent of language-based AI.

Complete Text

All¹ Generative Language Models so far perform text completion

They generate plausible² text following a prompt.

The type of answer, will depend on the kind of prompt.

GPT Playground

To use AI you have to experiment with the prompt.

try the Playground mode

It is the same as learning how to do google queries

altavista: +noir +film -"pinot noir"
nowadays: ???

“Prompt engineering” is becoming a discipline in itself…

Some Examples

By providing enough context, it is possible to perform amazing tasks

Look at the demos

Even Chat Interfaces have a hidden prompt.

What is a language model?

Language Models and Cryptography

The Caesar code

Zodiac 408 Cipher

Later in 2001, in a prison, somewhere in California

Solved by Stanford’s Persi Diaconis and his students using Monte Carlo Markov Chains

Monte Carlo Markov Chains

Take a letter \(x_n\), what is the probability of the next letter being \(x_{n+1}\)?

\[\pi_{X,Y} = P(x_{n+1}=Y, x_{n}=X)\]

for \(X=\{a, b, .... , z\} , Y=\{a,b,c, ... z\}\)

The language model can be trained using dataset of english language.

And used to determine whether a given cipher-key is consistent with english language.

It yields a very efficient algorithm to decode any caesar code (with very small sample)

MCMC to generate text

MCMCs can also be used to generate text:

take initial prompt: I think therefore I
- last letter is I
- most plausible character afterwards is
- most plausible character afterwards is I
Result: I think therefore I I I I I I

Not good but promising (🤷)

MCMC to generate text

Going further

augment memory
- fore I> ???
change basic unit (use phonems or words)

An example using MCMC

using words and 3 states He ha ‘s kill’d me Mother , Run away I pray you Oh this is Counter you false Danish Dogges .

Big MCMC

Can we augment memory?

if you want to compute the most frequent letter (among 26) after 50 letters, you need to take into account 5.6061847e+70 combinations !
- impossible to store, let alone do the training
but some combinations are useless:
- wjai dfni
- Despite the constant negative press covfefe 🤔

Neural Networks

Neural networks make it possible to increase the state-space to represent

\[\forall X, P(x_n=X| x_{n-1}, ..., x_{n-k}) = \varphi^{NL}( x_{n-1}, ..., x_{n-k}; \theta )\]

with a smaller vector of parameters \(\theta\)

Neural netowrks reduce endogenously the dimensionality.

Recurrent Neural Networks

In 2015

Neural Network reduce dimensionality of data discovering structure
hidden state encodes meaning of the model so far

Long Short Term Memory

2000->2019 : Emergence of Long Short Term Memory models
- speech recognition
- LSTM behind “Google Translate”, “Alexa”, …

The Rise of transformers

A special kind of encoder/decoder architecture.

Most successful models since 2017

Position Encodings
- model is not sequential anymore
- tries to learn sequence
Attention
- attention is all you need
Self-Attention

Explanations here or here

Encoders / Decoders (1/3)

Take some data \((x_n)\in R^x\).

Consider two functions:

an encoder \[\varphi^E(x; \theta^E) = h \in \mathbb{R^h}\]
a decoder: \[\varphi^D(h; \theta^D) = x' \in \mathbb{R^x}\]

Train the coefficients with:

\[\min_{\theta^E, \theta^D} \left( \varphi^D( \varphi^E(x_n; \theta^E), \theta^D) - x_n\right)^2\]

i.e. train the nets \(\varphi^D\) and \(\varphi^E\) to predict the “data from the data” (it is called autoencoding)

Encoders / Decoders (2/3)

The relation \(\varphi^D( \varphi^E(x_n; \theta^E), \theta^D) ~ x_n\) can be rewritten as

\[x_n \xrightarrow{\varphi^E(; \theta^E)} h \xrightarrow{\varphi^D(; \theta^D)} x_n \]

When that relation is (mostly) satisfied and \(\mathbb{R}^h << \mathbb{R}^x\), \(h\) can be viewed as a lower dimension representation of \(x\). It encodes the information as a lower dimension vector \(h\) and is called learned embeddings.

In particular words have a vector representation in this space!

Encoders / Decoders (3/3)

instead of \(\underbrace{x_n}_{\text{prompt}} \rightarrow \underbrace{y_n}_{\text{text completion}}\)
one can learn \(\underbrace{h_n}_{\text{prompt (low dim)}} \xrightarrow{\varphi^C( ; \theta^C)} \underbrace{h_n^c}_{\text{text completion (low dim)}}\)
- it is easier to learn
and perform the original task as \[\underbrace{x_n}_{\text{prompt}} \xrightarrow{\varphi^E} h_n \xrightarrow{\varphi^C} h_n^C \xrightarrow{\varphi^D} \underbrace{y_n}_{\text{text completion}}\]

This very powerful approach can be applied to combine encoders/decoders from different contexts (ex Dall-E)

Attention

Main flaw with the recursive approach:

the context made to predict new words/embeddings puts a lower weight on further words/embeddings
this is related to the so-called vanishing gradient problem

With the attention mechanism, each predicted word/embedding is determined by all preceding words/embeddings, with different weights that are endogenous.

Quick summary

Short History of Language models
- frequency tables
- monte carlo markov chains
- deep learning -> recurrent neural networks
- long-short-term memory (>2000)
- encoders-decoders
- transformers (>2018)
  - attention is all you need
  - checkout the video from 3blue1brown on attention. It’s brilliant! . . .
Since 2010 main breakthrough came through the development of deep-learning techniques (software/hardware)
Recently, models/algorithms have improved tremendously

Variants of GPT

GPT

Most famous engine developped by OpenAI: Generative Pre-trained Transformer (aka GPT)

GPT1 (1018)
- 0.1 billion parameters
- had to be fine-tuned to a particular problem
- transfer learning (few shots learning)
GPT2:
- multitask
- no mandatory fine tuning
GPT3:
- bigger: 175 billions parameters
GPT4:
- even bigger: 1000 billions parameters ???
- on your harddrive: 1Tb

Corpus

GPT-3 was trained¹ on

CommonCrawl
WebText (proprietary db, with opensource alternative )
Wikipedia
many books

⇒ 45 TB of data

cured into a smaller datasets

⇒ size ???

Dataset (mostly) ends in 2021.

How is the model trained?

Several concepts are relevant here:

unsupervised learning
- autoencoding
- ⇒ build a representation of the text
fine tuning
reinforcement learning

What is learning?

A machine can perform a task \(f(x; \theta)\) for some input \(x\) in a data-generating process \(\mathcal{X}\) and and some parameters \(\theta\).

A typical learning task consists in optimizing a loss function (aka theoretical risk): \[\min _{\theta} \mathcal{L}(\theta) = \mathbb{E}_{\theta} f(x; \theta)\]

The central learning method to minimize the objective is called stochastic gradient descent.

Learning Set

In practice one has access to a dataset \((x_n) \subset \mathcal{X}\) and minimizes the “empirical” risk function

\[L\left( (x_n)_{n=1:N}, \theta \right) = \frac{1}{N} \sum_{n=1}^N f(x; \theta)\]

Regular case: in usual cases, we assume that the dataset is generated by the true model (data-generating process)

Two important variants:

transfer learning:
- goal is to use the model \(\mathcal{X}\) but the training dataset is generated from another data-generating process \(\mathcal{Y}\)
- \(\mathcal{Y}\) can be a subset of \(\mathcal{X}\) or (partially) disjoint
- do you need some data from \(\mathcal{Y}\) (few shots learning) or non at all (zero-shot learning)
reinforcement learning
- the learning algorithm can generate some data to improve learning

Transfer learning

GPT is inherently a transfer learning machine
- why?
earlier versions (GPT-1, GPT-2) needed some examples before being able to perform any given task:
- fine-tuning: retrain some coefficients of the wole NN
new versions (>GPT-3) can perform zero-shot tasks just by text completion
- fine-tuning can be emulated by prompting
- there is still a fine-tuning API

Reinforcement Learning

A reinforcement learning algorithm can take actions which have two effects:

provide some reward to the algorithm
generate (more) data to improve the quality of future actions

Example:

choose a restaurant
drive a car
famous examples: breakout, hide and seek

Reinforcement Learning for GPT-4

The GPT-4 model has been fine-tuned with reinforcement learning. The language model was rewarded for providing the right kind of answer:

the feedback came from kenyan workers (sic!)

Two main variants on top of foundation model GPT Base:¹

instructGPT
- alignment, non-toxicity, …
- factual correctness
chatGPT
- follow a conversation
- organization of answer
- not just a context on top of GPT

There is information about how GPT-3 was trained (check technical paper or summary)

Many models are available

Which of the following model should you use?

Lots of GPT variants:

o3-mini
gpt-4o
gpt-4-turbo-preview
text-ada-001
…

And a lot of competititors:

mistral, deepseek, llama, …
all with different subvariants
They usually provide a compatible API (you can use openai)

The different variants of GPT

What are the differences between the various engines?

architecture
model size
- full size
- quantized model size
training set of foundation model (GPT Base)
type of fine-tuning (instruct/chat/code)
type of interface

Checkout the awesome list!

What are the trends?

many foundation Models
- move from opensource to closedsource
- but: opensource is still very alive
- opensource is gaining momentum again
  - llama, mistral, deepseek have opensource specificiations
  - openweights
research to reduce size of models / training time
- quantized versions
many more versions specialized (fine-tuned) to specific tasks
- example: all _code models

Conclusion

One common misconception

Language models hallucinate facts…
- therefore are definitely unreliable
There are possible workarounds
- avoid tasks where hallucinations occur
  - ex: ask for paper citations
- more structure in the prompting
  - ex: detail your reasoning (aka chain of thought prompting)

It’s all ongoing

And research being done…
- on using fine-tuning for more correctness (e.g. instructGPT)
- on using specific context (Retrieval Augmented Generation)
  - ex: GPTs
- on developing mixed systems
  - GPT Assistants and Scite
At the same time, lots of products get designed
- psst: under the hood, they make API calls…
- … and provide interfaces to other APIs
Don’t be fooled by the marketing:
- do it yourself !