Skip to content

Commit

Permalink
Add assignments page and improved lectures page. Removed content from…
Browse files Browse the repository at this point in the history
… main page to these pages
  • Loading branch information
yoavartzi committed Aug 18, 2024
1 parent c944262 commit 180e95a
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 41 deletions.
6 changes: 3 additions & 3 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ plugins:
# Exclude from processing.
# The following items will not be processed, by default. Create a custom list
# to override the default setting.
# exclude:
# - Gemfile
# - Gemfile.lock
exclude:
- assignments/
- Gemfile.lock
# - node_modules
# - vendor/bundle/
# - vendor/cache/
Expand Down
3 changes: 2 additions & 1 deletion _includes/header.html
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
<header>

<!-- <div class="wrapper"> -->
{%- assign default_paths = site.pages | map: "path" -%}
{%- assign sorted_pages = site.pages | sort: "order" -%}
{%- assign default_paths = sorted_pages | map: "path" -%}
{%- assign page_paths = site.header_pages | default: default_paths -%}
{%- assign titles_size = site.pages | map: 'title' | join: '' | size -%}
{% assign current_page_url = page.url | relative_url %}
Expand Down
39 changes: 39 additions & 0 deletions assignments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: page
title: Assignments
order: 20
---


The class includes four assignments. The assignments are available upon request (i.e., just email [Yoav](https://yoavartzi.com/)). The assignments were developed by [Anne Wu](https://annshin.github.io/) and [Omer Gul](https://momergul.github.io/). The description below is rudimentary.

The assignments are all available on the LM-class repository: [lm-class/tree/main/assignments](https://github.com/lil-lab/lm-class/tree/main/assignments). The repository does not include the test data. Please email [Yoav Artzi](mailto:[email protected]) for the test data. It is critical for the function of the class that the test data remains hidden.

You will notice that each assignment includes various comments about pending improvements, some minor others more critical. We expect to get around to address these issues in the next iteration of the class.

Each assignment includes instructions, report template, starter repository, code to manage a leader board, and a grading schema. The assignments emphasize building, and provide only basic starter code. Each assignment is designed to compare and contrast different approaches. The assignments are complementary to the class material, and require significant amount of self-learning and exploration.

The leader board for each assignment uses a held-out test set. The students get unlabeled test examples, and submit the labels for evaluation on a pre-determined schedule. We use GitHub Classroom to manage the assignment. Submitting results to the leader board requires committing a results file. Because we have access to all repositories, computing the leader board simply requires pulling all students repositories at pre-determined times. GitHub Classroom also allows us to include unit tests that are executed each time the students push their code. Passing the unit tests is a component of the grading.

The assignments include significant implementation work and experimental development. We mandate a minimal milestone 7-10 days before the assignment deadline. We observed that this is a necessary forcing function to get students to work on the assignment early.

Grading varies between assignments. Generally though, it is made of three components: automatic unit tests (i.e., auto-grading), a written report, and performance on the test set. Performance is a significant component of the grading and follows a per-assignment formula. We generally do not consider leader board placement for grading, except a tiny bonus for the top performer.

## Assignment 1

This assignment is released promptly at the beginning of the semester. The students are asked to implement, experiment with, and contrast a linear perceptron and multi-layer perceptron (i.e., a neural network) for text classification tasks. The linear perceptron is implemented from scratch with external framework (i.e., no PyTorch, SciPy, or NumPy). The students learn how the properties of language support efficient implementation of methods, how to design features, how to conduct error analysis, and how manual features compare to learned representations. This assignment currently does not have a milestone, but we plan to add one for the next iteration of the course.

## Assignment 2

The focus of this assignment is language modeling (i.e., next-word prediction). The students implement, experiment with, and contrast an n-gram LM and a transformer-based LM. We ask the students to experiment with different word-level n-gram LMs by varying the n-gram size, smoothing techniques, and tokenization (words vs. sub-words). We also ask them to contrast n-gram and transformer-based LMs. Because of the computational requirements of neural models, this comparison is done on character-based LMs. We provided a buggy implementation of the transformer block. The students were asked to fix it, and build the LM on top of it. We provided auto-grading unit tests to evaluate progress on fixing the transformer block. We provided two benchmarks to evaluate on: perplexity on a held-out set and word error rate on a speech recognition re-ranking task. The leader board used the recognition task, because perplexity is sensitive to tokenization.

## Assignment 3

The focus of this assignment are self-supervised representations. The students compare word2vec, BERT, GPT-2 representations. They implement and train word2vec, and experimentally compare it to off-the-shelf BERT and GPT-2 models. The goal of implementing word2vec is to develop a deep technical understanding of estimating self-supervised representations and observe the impact of training design decisions, within the resources available to students. The goal of comparing the three models is to gain insights into the quality of their representations. The experiments use context-independent and context-dependent word similarity datasets, exposing the impact of the different training paradigms.

## Assignment 4

The focus of this assignment is supervised sequence prediction. It uses a language-to-SQL code generation task using the ATIS benchmark, with evaluation of both surface-form correctness and database execution. The student implements three approaches to the problem: prompting with an LLM (including ICL), fine-tuning a pretrained encoder-decoder model, and training the same model from scratch. The assignment uses a T5 model that is provided to the students, and training from scratch uses the same model but with randomly initialized parameters.
37 changes: 2 additions & 35 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

layout: page
title: LM-class
order: 0
---

<strong><small>Version: 2024.1beta</small></strong>
Expand All @@ -17,47 +18,13 @@ title: LM-class
</div>
</div>

<p>The class emphasizes technical depth rather than coverage. It does not aim to provide a broad overview of everything that is happening in the field. The objective is to give students a strong base and the tools to expand their knowledge and update it on their own.</p>
<p>The materials include <a href="/lectures">lectures</a> and <a href="/assignments">assignments</a>. The class emphasizes technical depth rather than coverage. It does not aim to provide a broad overview of everything that is happening in the field. The objective is to give students a strong base and the tools to expand their knowledge and update it on their own.</p>

<p>LM-class was created by <a href="https://yoavartzi.com/">Yoav Artzi</a>, <a href="https://annshin.github.io/">Anne Wu</a>, and <a href="https://momergul.github.io/">Omer Gul</a>. Much of the material was adapted or inspired by existing NLP classes. Each lecture and assignment includes a slide at the end with acknowledgements. If I missed any attribution, I am really sorry. Please let me know, so I can correct. Acknowledgements are listed <a href="#acknowledgements">below</a>.</p>

<p>The materials are distributed under the <a href="https://creativecommons.org/licenses/by-nc/4.0/">CC BY-NC 4.0</a> license, and we hope they will find broad and diverse use.</p>

## Lectures

The [lectures](lectures) are organized into three sections:

1. Warming up: this section quickly brings the students up to speed with the basics. The goal is to prepare the students for the first assignment. Beyond a quick introduction, it includes: data basics, linear perceptron, and multi-layer perceptron.
1. Learning with raw data: this section focuses on representation learning from raw data (i.e., without any annotation or user labor). It is divided into three major parts: word embeddings, next-word-prediction language modeling, and masked language modeling. Through these subsections we introduce many of the fundamental technical concepts and methods of the field.
1. Learning with annotated data: this section focuses on learning with annotated data. It introduces the task as a framework to structure solution development, through the review of several prototypical NLP tasks. For each task, we discuss the problem, data, modeling decisions, and formulate a technical approach to address it. This section takes a broad view of annotated data, including covering language model alignment using annotated data (i.e., instruction tuning and RLHF).

## Assignments

The class includes four assignments. The assignments are available upon request (i.e., just email [Yoav](https://yoavartzi.com/)). The assignments were developed by [Anne Wu](https://annshin.github.io/) and [Omer Gul](https://momergul.github.io/). The description below is rudimentary. We are still trying to get around to better document the assignments and figure out a better release protocol.

Each assignment includes instructions, report template, starter repository, code to manage a leader board, and a grading schema. The assignments emphasize building, and provide only basic starter code. Each assignment is designed to compare and contrast different approaches. The assignments are complementary to the class material, and require significant amount of self-learning and exploration.

The leader board for each assignment uses a held-out test set. The students get unlabeled test examples, and submit the labels for evaluation on a pre-determined schedule. We use GitHub Classroom to manage the assignment. Submitting results to the leader board requires committing a results file. Because we have access to all repositories, computing the leader board simply requires pulling all students repositories at pre-determined times. GitHub Classroom also allows us to include unit tests that are executed each time the students push their code. Passing the unit tests is a component of the grading.

The assignments include significant implementation work and experimental development. We mandate a minimal milestone 7-10 days before the assignment deadline. We observed that this is a necessary forcing function to get students to work on the assignment early.

Grading varies between assignments. Generally though, it is made of three components: automatic unit tests (i.e., auto-grading), a written report, and performance on the test set. Performance is a significant component of the grading and follows a per-assignment formula. We generally do not consider leader board placement for grading, except a tiny bonus for the top performer.

### Assignment 1

This assignment is released promptly at the beginning of the semester. The students are asked to implement, experiment with, and contrast a linear perceptron and multi-layer perceptron (i.e., a neural network) for text classification tasks. The linear perceptron is implemented from scratch with external framework (i.e., no PyTorch, SciPy, or NumPy). The students learn how the properties of language support efficient implementation of methods, how to design features, how to conduct error analysis, and how manual features compare to learned representations. This assignment currently does not have a milestone, but we plan to add one for the next iteration of the course.

### Assignment 2

The focus of this assignment is language modeling (i.e., next-word prediction). The students implement, experiment with, and contrast an n-gram LM and a transformer-based LM. We ask the students to experiment with different word-level n-gram LMs by varying the n-gram size, smoothing techniques, and tokenization (words vs. sub-words). We also ask them to contrast n-gram and transformer-based LMs. Because of the computational requirements of neural models, this comparison is done on character-based LMs. We provided a buggy implementation of the transformer block. The students were asked to fix it, and build the LM on top of it. We provided auto-grading unit tests to evaluate progress on fixing the transformer block. We provided two benchmarks to evaluate on: perplexity on a held-out set and word error rate on a speech recognition re-ranking task. The leader board used the recognition task, because perplexity is sensitive to tokenization.

### Assignment 3

The focus of this assignment are self-supervised representations. The students compare word2vec, BERT, GPT-2 representations. They implement and train word2vec, and experimentally compare it to off-the-shelf BERT and GPT-2 models. The goal of implementing word2vec is to develop a deep technical understanding of estimating self-supervised representations and observe the impact of training design decisions, within the resources available to students. The goal of comparing the three models is to gain insights into the quality of their representations. The experiments use context-independent and context-dependent word similarity datasets, exposing the impact of the different training paradigms.

### Assignment 4

The focus of this assignment is supervised sequence prediction. It uses a language-to-SQL code generation task using the ATIS benchmark, with evaluation of both surface-form correctness and database execution. The student implements three approaches to the problem: prompting with an LLM (including ICL), fine-tuning a pretrained encoder-decoder model, and training the same model from scratch. The assignment uses a T5 model that is provided to the students, and training from scratch uses the same model but with randomly initialized parameters.

## Issues and Future Plans

Expand Down
12 changes: 10 additions & 2 deletions lectures.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,18 @@
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: page
title: Lecture Slides
title: Lectures
order: 10
---

The lectures constitute a coherent sequence, where later sections often assume concepts and material from earlier sections.
The lectures constitute a coherent sequence, where later sections often assume concepts and material from earlier sections. They are organized into three sections:

1. Warming up: this section quickly brings the students up to speed with the basics. The goal is to prepare the students for the first assignment. Beyond a quick introduction, it includes: data basics, linear perceptron, and multi-layer perceptron.
1. Learning with raw data: this section focuses on representation learning from raw data (i.e., without any annotation or user labor). It is divided into three major parts: word embeddings, next-word-prediction language modeling, and masked language modeling. Through these subsections we introduce many of the fundamental technical concepts and methods of the field.
1. Learning with annotated data: this section focuses on learning with annotated data. It introduces the task as a framework to structure solution development, through the review of several prototypical NLP tasks. For each task, we discuss the problem, data, modeling decisions, and formulate a technical approach to address it. This section takes a broad view of annotated data, including covering language model alignment using annotated data (i.e., instruction tuning and RLHF).


I am collecting a [document of issues](https://docs.google.com/document/d/1aAYaRvR1BauC4RS5TzCeM4fCbTbnPwQVcjlMAVMlTjU/edit?usp=sharing), including with feedback from other researchers and instructors. It is recommended to consult this document if using this material. I did not review the issue document in depth, so cannot stand by it. However, I plan to review it and address issues in the next iteration of the class (usually: next spring).

## Warming Up

Expand Down

0 comments on commit 180e95a

Please sign in to comment.