Intro to Large Language Models

Data-Based Economics, ESCP, 2024-2025

Pablo Winant

2025-04-01

Language-Based AI

Do you like poetry?

A rose is a rose is a rose

Gertrude Stein

Brexit means Brexit means Brexit

John Crace

Elementary my dear Watson

P.G. Woodehouse

 

There is an easy way for Europe to respond to the trade war started by Donald Trump

  

If we could complete any sentence. . . Would we be able to solve any problem?

  

⮕ We are witnessing the advent of language-based AI.

Complete Text

All1 Generative Language Models so far perform text completion

They generate plausible2 text following a prompt.

The type of answer, will depend on the kind of prompt.

GPT Playground

To use AI you have to experiment with the prompt.

It is the same as learning how to do google queries

  • altavista: +noir +film -"pinot noir"
  • nowadays: ???

“Prompt engineering” is becoming a discipline in itself…

Some Examples

By providing enough context, it is possible to perform amazing tasks

Look at the demos

Even Chat Interfaces have a hidden prompt.

What is a language model?

Language Models and Cryptography

The Caesar code

Zodiac 408 Cipher

Zodiac 408 Cipher

Key for Zodiac 408
Figure 1: Solved in a week by Bettye and Donald Harden using frequency tables.

Later in 2001, in a prison, somewhere in California

Solved by Stanford’s Persi Diaconis and his students using Monte Carlo Markov Chains

Monte Carlo Markov Chains

Take a letter \(x_n\), what is the probability of the next letter being \(x_{n+1}\)?

\[\pi_{X,Y} = P(x_{n+1}=Y, x_{n}=X)\]

for \(X=\{a, b, .... , z\} , Y=\{a,b,c, ... z\}\)

The language model can be trained using dataset of english language.

And used to determine whether a given cipher-key is consistent with english language.

It yields a very efficient algorithm to decode any caesar code (with very small sample)

MCMC to generate text

MCMCs can also be used to generate text:

  • take initial prompt: I think therefore I
    • last letter is I
    • most plausible character afterwards is
    • most plausible character afterwards is I
  • Result: I think therefore I I I I I I

Not good but promising (🤷)

MCMC to generate text

Going further

  • augment memory
    • fore I> ???
  • change basic unit (use phonems or words)

An example using MCMC

  • using words and 3 states He ha ‘s kill’d me Mother , Run away I pray you Oh this is Counter you false Danish Dogges .

Big MCMC

Can we augment memory?

  • if you want to compute the most frequent letter (among 26) after 50 letters, you need to take into account 5.6061847e+70 combinations !
    • impossible to store, let alone do the training
  • but some combinations are useless:
    • wjai dfni
    • Despite the constant negative press covfefe 🤔

Neural Networks

Neural Network
  • Neural networks make it possible to increase the state-space to represent

\[\forall X, P(x_n=X| x_{n-1}, ..., x_{n-k}) = \varphi^{NL}( x_{n-1}, ..., x_{n-k}; \theta )\]

with a smaller vector of parameters \(\theta\)

  • Neural netowrks reduce endogenously the dimensionality.

Recurrent Neural Networks

In 2015

  • Neural Network reduce dimensionality of data discovering structure
  • hidden state encodes meaning of the model so far

Long Short Term Memory

  • 2000->2019 : Emergence of Long Short Term Memory models
    • speech recognition

    • LSTM behind “Google Translate”, “Alexa”, …

The Rise of transformers

A special kind of encoder/decoder architecture.

Most successful models since 2017

  1. Position Encodings
    • model is not sequential anymore
    • tries to learn sequence
  2. Attention
  3. Self-Attention

Explanations here or here

Encoders / Decoders (1/3)

Take some data \((x_n)\in R^x\).

Consider two functions:

  • an encoder \[\varphi^E(x; \theta^E) = h \in \mathbb{R^h}\]
  • a decoder: \[\varphi^D(h; \theta^D) = x' \in \mathbb{R^x}\]

Train the coefficients with:

\[\min_{\theta^E, \theta^D} \left( \varphi^D( \varphi^E(x_n; \theta^E), \theta^D) - x_n\right)^2\]

i.e. train the nets \(\varphi^D\) and \(\varphi^E\) to predict the “data from the data” (it is called autoencoding)

Encoders / Decoders (2/3)

The relation \(\varphi^D( \varphi^E(x_n; \theta^E), \theta^D) ~ x_n\) can be rewritten as

\[x_n \xrightarrow{\varphi^E(; \theta^E)} h \xrightarrow{\varphi^D(; \theta^D)} x_n \]

When that relation is (mostly) satisfied and \(\mathbb{R}^h << \mathbb{R}^x\), \(h\) can be viewed as a lower dimension representation of \(x\). It encodes the information as a lower dimension vector \(h\) and is called learned embeddings.

In particular words have a vector representation in this space!

Encoders / Decoders (3/3)

  • instead of \(\underbrace{x_n}_{\text{prompt}} \rightarrow \underbrace{y_n}_{\text{text completion}}\)
  • one can learn \(\underbrace{h_n}_{\text{prompt (low dim)}} \xrightarrow{\varphi^C( ; \theta^C)} \underbrace{h_n^c}_{\text{text completion (low dim)}}\)
    • it is easier to learn
  • and perform the original task as \[\underbrace{x_n}_{\text{prompt}} \xrightarrow{\varphi^E} h_n \xrightarrow{\varphi^C} h_n^C \xrightarrow{\varphi^D} \underbrace{y_n}_{\text{text completion}}\]

This very powerful approach can be applied to combine encoders/decoders from different contexts (ex Dall-E)

Attention

Main flaw with the recursive approach:

  • the context made to predict new words/embeddings puts a lower weight on further words/embeddings
  • this is related to the so-called vanishing gradient problem

With the attention mechanism, each predicted word/embedding is determined by all preceding words/embeddings, with different weights that are endogenous.

Attention

Quick summary

  • Short History of Language models
    • frequency tables
    • monte carlo markov chains
    • deep learning -> recurrent neural networks
    • long-short-term memory (>2000)
    • encoders-decoders
    • transformers (>2018)
  • Since 2010 main breakthrough came through the development of deep-learning techniques (software/hardware)
  • Recently, models/algorithms have improved tremendously

Variants of GPT

GPT

Most famous engine developped by OpenAI: Generative Pre-trained Transformer (aka GPT)

  • GPT1 (1018)
    • 0.1 billion parameters
    • had to be fine-tuned to a particular problem
    • transfer learning (few shots learning)
  • GPT2:
    • multitask
    • no mandatory fine tuning
  • GPT3:
    • bigger: 175 billions parameters
  • GPT4:
    • even bigger: 1000 billions parameters ???
    • on your harddrive: 1Tb

Corpus

GPT-3 was trained1 on

⇒ 45 TB of data

  • cured into a smaller datasets

⇒ size ???

Dataset (mostly) ends in 2021.

How is the model trained?

Several concepts are relevant here:

  • unsupervised learning

    • autoencoding
    • ⇒ build a representation of the text
  • fine tuning

  • reinforcement learning

What is learning?

A machine can perform a task \(f(x; \theta)\) for some input \(x\) in a data-generating process \(\mathcal{X}\) and and some parameters \(\theta\).

A typical learning task consists in optimizing a loss function (aka theoretical risk): \[\min _{\theta} \mathcal{L}(\theta) = \mathbb{E}_{\theta} f(x; \theta)\]

The central learning method to minimize the objective is called stochastic gradient descent.

.

Learning Set

In practice one has access to a dataset \((x_n) \subset \mathcal{X}\) and minimizes the “empirical” risk function

\[L\left( (x_n)_{n=1:N}, \theta \right) = \frac{1}{N} \sum_{n=1}^N f(x; \theta)\]

Regular case: in usual cases, we assume that the dataset is generated by the true model (data-generating process)

Two important variants:

  • transfer learning:
    • goal is to use the model \(\mathcal{X}\) but the training dataset is generated from another data-generating process \(\mathcal{Y}\)
    • \(\mathcal{Y}\) can be a subset of \(\mathcal{X}\) or (partially) disjoint
    • do you need some data from \(\mathcal{Y}\) (few shots learning) or non at all (zero-shot learning)
  • reinforcement learning
    • the learning algorithm can generate some data to improve learning

Transfer learning

  • GPT is inherently a transfer learning machine
    • why?
  • earlier versions (GPT-1, GPT-2) needed some examples before being able to perform any given task:
    • fine-tuning: retrain some coefficients of the wole NN
  • new versions (>GPT-3) can perform zero-shot tasks just by text completion
    • fine-tuning can be emulated by prompting
    • there is still a fine-tuning API

Reinforcement Learning

A reinforcement learning algorithm can take actions which have two effects:

  • provide some reward to the algorithm
  • generate (more) data to improve the quality of future actions

Example:

Reinforcement Learning for GPT-4

The GPT-4 model has been fine-tuned with reinforcement learning. The language model was rewarded for providing the right kind of answer:

  • the feedback came from kenyan workers (sic!)

Two main variants on top of foundation model GPT Base:1

  • instructGPT
    • alignment, non-toxicity, …
    • factual correctness
  • chatGPT
    • follow a conversation
    • organization of answer
    • not just a context on top of GPT

There is information about how GPT-3 was trained (check technical paper or summary)

Many models are available

Which of the following model should you use?

Lots of GPT variants:

  • o3-mini
  • gpt-4o
  • gpt-4-turbo-preview
  • text-ada-001

And a lot of competititors:

  • mistral, deepseek, llama, …
  • all with different subvariants
  • They usually provide a compatible API (you can use openai)

The different variants of GPT

What are the differences between the various engines?

  • architecture
  • model size
    • full size
    • quantized model size
  • training set of foundation model (GPT Base)
  • type of fine-tuning (instruct/chat/code)
  • type of interface

Checkout the awesome list!

What are the trends?

  • many foundation Models
    • move from opensource to closedsource
    • but: opensource is still very alive
    • opensource is gaining momentum again
      • llama, mistral, deepseek have opensource specificiations
      • openweights
  • research to reduce size of models / training time
    • quantized versions
  • many more versions specialized (fine-tuned) to specific tasks
    • example: all _code models

Conclusion

One common misconception

  • Language models hallucinate facts…
    • therefore are definitely unreliable
  • There are possible workarounds
    • avoid tasks where hallucinations occur
      • ex: ask for paper citations
    • more structure in the prompting
      • ex: detail your reasoning (aka chain of thought prompting)

It’s all ongoing

  • And research being done…
    • on using fine-tuning for more correctness (e.g. instructGPT)
    • on using specific context (Retrieval Augmented Generation)
      • ex: GPTs
    • on developing mixed systems
  • At the same time, lots of products get designed
    • psst: under the hood, they make API calls…
    • … and provide interfaces to other APIs
  • Don’t be fooled by the marketing:
    • do it yourself !