Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion episodes/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@

The term LLM now is often (and wrongly) used as a synonym of Artificial Intelligence. We could therefore think that today we just need to learn how to manipulate LLMs in order to fulfill our research goals involving textual data. The truth is that Language Modeling has always been part of the core tasks of NLP, therefore, by learning NLP you will understand better where are the main ideas behind LLMs coming from.

![NLP is an interdisciplinary field, and LLMs are just a subset of it](fig/intro0_cs_nlp.png)

Check warning on line 115 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: fig/intro0_cs_nlp.png

LLM is a blanket term for an assembly of large neural networks that are trained on vast amounts of text data with the objective of optimising for language modelling. Generative models are optimised to output human-like text, but can also used to perform other tasks. Indeed, the surprising and fascinating properties that emerge from training models at this scale allows us to solve different complex tasks such as answering elaborate questions, translating languages, solving complex problems, generating narratives that emulate reasoning, and many more. All of this with a single tool.

Expand All @@ -125,13 +125,13 @@

Let's go back to our problem of segmenting text and see what ChatGPT has to say about tokenizing Chinese text:

![ChatGPT Just Works! Does it...?](fig/intro1.png)

Check warning on line 128 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: fig/intro1.png

We got what sounds like a straightforward confident answer. However, it is not clear how the model arrived at this solution. Second, we do not know whether the solution is correct or not. In this case ChatGPT made some assumptions for us, such as choosing a specific kind of tokenizer to give the answer, and since we do not speak the language, we do not know if this is indeed the best approach to tokenize Chinese text. If we understand the concept of Token (which we will today!), then we can be more informed about the quality of the answer, whether it is useful to us, and therefore make a better use of the model.

And by the way, ChatGPT was **almost** correct, in the specific case of the gpt-4 tokenizer, the model will return 12 tokens (not 11!) for the given Chinese sentence.

![GPT-4 Tokenization Example](fig/intro1b.png)

Check warning on line 134 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: fig/intro1b.png

We can also argue if the statement "Chinese is generally tokenized character by character" is an overstatement or not. In any case, the real question here is: Are we ok with *almost correct answers*? Please note that this is not a call to avoid using LLM's but a call for a careful consideration of usage and more importantly, an attempt to explain the mechanisms behind via NLP concepts.

Expand Down Expand Up @@ -380,7 +380,7 @@
Removing uppercases to e.g. avoid treating "Dog" and "dog" as two different words could also be useful, for example to train word vector representations, where we want to merge both occurrences as they represent exactly the same concept. Lowercasing can be done with Python directly as:

```python
lower_text = text_flat.lower()
lower_text = text.lower()
lower_text[:100] # Beware that this is a python string operation
```

Expand Down
102 changes: 91 additions & 11 deletions learners/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,48 @@
title: 'Reference'
---

## Table of Contents

- [Glossary](#glossary)
- [Alphabetical Index](#alphabetical-index)
- [NLP Fundamentals](#nlp-fundamentals)
- [Linguistic Properties of Text](#linguistic-properties-of-text)
- [NLP Tasks](#nlp-tasks)
- [Machine Learning & Deep Learning](#machine-learning--deep-learning)
- [Pre-trained Models & Transfer Learning](#pre-trained-models--transfer-learning)
- [Text Preprocessing & NLP Pipelines](#text-preprocessing--nlp-pipelines)
- [Word Representations](#word-representations)
- [Transformer Architecture](#transformer-architecture)
- [Model Evaluation](#model-evaluation)
- [Large Language Models](#large-language-models)
- [Prompting & Generation Control](#prompting--generation-control)
- [LLM Behavior & Limitations](#llm-behavior--limitations)
- [External References](#external-references)
- [Books](#books)
- [Key Papers](#key-papers)
- [Tools & Libraries](#tools--libraries)
- [Datasets & Linguistic Resources](#datasets--linguistic-resources)
- [To Know More...](#to-know-more)

---

## Glossary

### Index of Terms
### Alphabetical Index

#### A
- [Accuracy](#accuracy)
- [Ambiguity](#ambiguity)
- [Attention Mechanism](#attention-mechanism)
- [Authorship Attribution](#authorship-attribution)

#### B
- [Backpropagation](#backpropagation)
- [Base Model](#base-model)
- [BERT](#bert)
- [Bias & Fairness](#bias-fairness)

#### C
- [Chunking](#chunking)
- [Co-reference Resolution](#coreference-resolution)
- [Compositionality](#compositionality)
Expand All @@ -24,6 +54,8 @@ title: 'Reference'
- [Convolutional Neural Network (CNN)](#cnn)
- [Corpus / Corpora](#corpus)
- [Cosine Similarity](#cosine-similarity)

#### D
- [Data Formatting](#data-formatting)
- [Decoder](#decoder)
- [Deep Learning](#deep-learning)
Expand All @@ -32,39 +64,61 @@ title: 'Reference'
- [Document Clustering](#document-clustering)
- [Domain-Specific Data](#domain-specific-data)
- [Downstream Task](#downstream-task)

#### E
- [ELMo](#elmo)
- [Embedder LLM vs. Generative LLM](#embedder-vs-generative)
- [Encoder](#encoder)
- [Entity Linking](#entity-linking)

#### F
- [F1-score](#f1-score)
- [FastText](#fasttext)
- [Fine-tuning](#fine-tuning)

#### G
- [Greedy Decoding](#greedy-decoding)
- [Guardrails](#guardrails)

#### H
- [Hallucination / Confabulation](#hallucination)
- [Hyperparameters](#hyperparameters)

#### I
- [Information Retrieval (IR)](#information-retrieval)
- [IOB Notation](#iob-notation)

#### K
- [Knowledge Base](#knowledge-base)

#### L
- [Language Identification](#language-identification)
- [Language Modeling](#language-modeling)
- [Large Language Model (LLM)](#llm)
- [Lemma / Lemmatization / Inflection](#lemmatization)
- [Long Short-Term Memory (LSTM)](#lstm)
- [Lowercasing](#lowercasing)

#### M
- [Machine Translation (MT)](#machine-translation)
- [Masked Language Model (MLM)](#mlm)
- [Morphological Ambiguity](#morphological-ambiguity)
- [Multi-layer Perceptron (MLP)](#mlp)
- [Multi-turn Conversation](#multi-turn)

#### N
- [Named Entity Recognition (NER)](#ner)
- [Natural Language](#natural-language)
- [Natural Language Processing (NLP)](#nlp)
- [Neural Network](#neural-network)
- [NLP Pipeline](#nlp-pipeline)
- [Non-determinism](#non-determinism)

#### O
- [Open-weights Model](#open-weights)
- [Overfitting](#overfitting)

#### P
- [Parameters](#parameters)
- [Paraphrasing](#paraphrasing)
- [Part-of-Speech (POS) Tagging](#pos-tagging)
Expand All @@ -74,12 +128,18 @@ title: 'Reference'
- [Pre-trained Model](#pretrained-model)
- [Precision](#precision)
- [Prompt Engineering](#prompt-engineering)

#### Q
- [Question Answering (QA)](#question-answering)

#### R
- [Recall](#recall)
- [Recurrent Neural Network (RNN)](#rnn)
- [Reinforcement Learning from Human Feedback (RLHF)](#rlhf)
- [Relation Extraction](#relation-extraction)
- [RoBERTa](#roberta)

#### S
- [Self-Attention](#self-attention)
- [Semantic Ambiguity](#semantic-ambiguity)
- [Semantic Role Labeling (SRL)](#srl)
Expand All @@ -94,6 +154,8 @@ title: 'Reference'
- [Supervised Fine-Tuning (SFT)](#sft)
- [Supervised Learning](#supervised-learning)
- [System / User / Assistant Roles](#conversation-roles)

#### T
- [Temperature](#temperature)
- [Text Classification](#text-classification)
- [Text Generation](#text-generation)
Expand All @@ -107,9 +169,15 @@ title: 'Reference'
- [Transfer Learning](#transfer-learning)
- [Transformer](#transformer)
- [True Positives / True Negatives / False Positives / False Negatives](#tp-tn-fp-fn)

#### U
- [ULMFiT](#ulmfit)
- [Unsupervised Learning](#unsupervised-learning)

#### V
- [Vector Space](#vector-space)

#### W
- [Word Embedding](#word-embedding)
- [Word Sense Disambiguation (WSD)](#wsd)
- [Word2Vec](#word2vec)
Expand Down Expand Up @@ -491,16 +559,26 @@ title: 'Reference'

## External References

### Books
- Jurafsky, D. & Martin J. (2026). *Speech and Language Processing (3rd ed. draft)*. [Available online](https://web.stanford.edu/~jurafsky/slp3/)
- Alammar, J., & Grootendorst, M. (2024). *Hands-on large language models*. O'Reilly Media.
- Tunstall, L., von Werra, L., & Wolf, T. (2022). *Natural language processing with transformers*. O'Reilly Media.



### Key Papers

- Mikolov et al. (2013). *Efficient Estimation of Word Representations in Vector Space* (Word2Vec). <https://arxiv.org/pdf/1301.3781>
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1409.0473
- Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420. https://arxiv.org/pdf/1510.00726
- Vaswani et al. (2017). *Attention Is All You Need* (Transformer architecture). <https://arxiv.org/pdf/1706.03762>
- Devlin et al. (2019). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. <https://aclanthology.org/N19-1423.pdf>
- Joulin et al. (2016). *Bag of Tricks for Efficient Text Classification* (FastText). <https://arxiv.org/pdf/1607.01759>
- Peters et al. (2018). *Deep Contextualized Word Representations* (ELMo). <https://aclanthology.org/N18-1202.pdf>
- Howard & Ruder (2018). *Universal Language Model Fine-tuning for Text Classification* (ULMFiT). <https://aclanthology.org/P18-1031.pdf>
- Lenci (2018). *Distributional Models of Word Meaning* (survey on distributional semantics). <https://arxiv.org/pdf/1905.01896>
- Huang et al. (2024). *Survey on Hallucination in Large Language Models* (confabulation). <https://arxiv.org/abs/2406.04175>
- Devlin et al. (2019). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. (BERT) <https://aclanthology.org/N19-1423.pdf>
- Boleda (2020). *Distributional Semantics
and Linguistic Theory*. <https://arxiv.org/pdf/1905.01896>
- Emily M. Bender & Alexander Koller. 2020. *Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data*. <https://aclanthology.org/2020.acl-main.463.pdf>
- Emily M. Bender, et. al (2021). *On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜*. <https://doi.org/10.1145/3442188.3445922/>

### Tools & Libraries

Expand Down Expand Up @@ -530,10 +608,12 @@ title: 'Reference'
- [Wikidata](https://www.wikidata.org/) — A free, open knowledge base with structured data about entities, their properties, and relationships; used to enrich NLP applications.
- [Dolma](https://github.com/allenai/dolma) — An open dataset of 3 trillion tokens from diverse sources (web, books, code, encyclopedic material) used to train English LLMs.

### Further Reading
### To Know More...

- Ruder, S. (2020). *NLP Beyond English* — A blog post surveying challenges and opportunities for NLP in non-English and minority languages. <https://www.ruder.io/nlp-beyond-english/>
- McCormick, C. (2016). *Word2Vec Tutorial: The Skip-Gram Model* — An intuitive walkthrough of how Word2Vec is trained. <https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/>
- HuggingFace. *Fine-tuning a pre-trained model* — Official tutorial for adapting BERT-like models to custom classification tasks. <https://huggingface.co/docs/transformers/v4.57.1/en/training#fine-tuning>
- spaCy. *Models & Languages* — Overview of available pre-trained spaCy models across many languages. <https://spacy.io/models>
- HuggingFace. *Text Generation Strategies* — Documentation on decoding strategies (greedy, sampling, beam search) in the Transformers library. <https://huggingface.co/docs/transformers/generation_strategies>
- **Challenges and opportunities for NLP in non-English and minority languages**. Ruder, S. (2020). *NLP Beyond English*. Blogpost: <https://www.ruder.io/nlp-beyond-english/>
- **How do children start to produce recognisable words**. Deb Roy - *The birth of a word*. Video Talk: <https://www.ted.com/talks/deb_roy_the_birth_of_a_word/>
- **Chomsky's hierarchy of languages**. Wikipedia Article: <https://en.wikipedia.org/wiki/Chomsky_hierarchy/>
- **On the limits of benchmarks as the only source of model evolution:** Raji, I. D., Denton, E., Bender, E. M., Hanna, A., & Paullada, A. (2021). AI and the everything in the whole wide world benchmark. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. https://openreview.net/forum?id=j6NxpQbREA1
- **An intuitive walkthrough of how Word2Vec is trained:** McCormick, C. (2016). *Word2Vec Tutorial: The Skip-Gram Model*. <https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/>
- **Official tutorial for adapting BERT-like models to custom classification tasks:** HuggingFace. *Fine-tuning a pre-trained model*. <https://huggingface.co/docs/transformers/v4.57.1/en/training#fine-tuning>
- **Documentation on decoding strategies (greedy, sampling, beam search) in the Transformers library:** HuggingFace. *Text Generation Strategies*. <https://huggingface.co/docs/transformers/generation_strategies>
Loading