The transformer was introduced in 2017 in "Attention Is All You Need." It became the foundation for GPT, BERT, and virtually every modern language model.

The problem with RNNs

Before transformers, sequence models processed text left-to-right, one token at a time. This made it hard to relate words far apart in a sentence.

Attention: relating every word to every other word

For every token, attention computes a weighted sum of all other tokens โ€” with weights determined by how relevant each is. This lets the model directly connect words regardless of distance.