Skip to content

Commit

Permalink
Merge pull request #305 from minemile/unit4_multimodal_models_transfe…
Browse files Browse the repository at this point in the history
…r_learning_fixed_links

Fixed links in multimodal models transfer learning introduction
  • Loading branch information
johko authored Jul 17, 2024
2 parents 80d9152 + 7055621 commit 4890225
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions chapters/en/unit4/multimodal-models/transfer_learning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ In the preceding sections, we've delved into the fundamental concepts of multimo

There are several approaches to how you can adapt multimodal models to your use case:

1. **Zero\few-shot learning**. Zero\few-shot learning involves leveraging large pretrained models capable of solving problems not present in the training data. These approaches can be useful when there is little labeled data for a task (5-10 examples) or there is none at all. [Unit 11](../Unit%2011%20%20-%20Zero%20Shot%20Computer%20Vision/1.mdx) will delve deeper into this topic.
1. **Zero\few-shot learning**. Zero\few-shot learning involves leveraging large pretrained models capable of solving problems not present in the training data. These approaches can be useful when there is little labeled data for a task (5-10 examples) or there is none at all. [Unit 11](https://huggingface.co/learn/computer-vision-course/unit11/1) will delve deeper into this topic.

2. **Training the model from scratch**. When pre-trained model weights are unavailable or the model's dataset substantially differs from your own, this method becomes necessary. Here, we initialize model weights randomly (or via more sophisticated methods like [He initialization](https://arxiv.org/abs/1502.01852)) and proceed with the usual training. However, this approach demands substantial amounts of training data.

Expand Down Expand Up @@ -38,12 +38,12 @@ However, despite its advantages, transfer learning has some challenges that shou

## Transfer Learning Applications

We'll explore practical applications of transfer learning across various tasks. Navigate to the Jupyter notebook relevant to your task of interest from the provided table.
We'll explore practical applications of transfer learning across various tasks. The table below provides a description of the tasks that can be solved using multimodal models, as well as examples of how you can fine-tune them on your data.

| Task | Description | Model | Notebook |
| ----------- | ---------------------------------------------------------------- | ------------------------------------------------- | ----------- |
| Fine-tune CLIP | Fine-tuning CLIP on a custom dataset | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | [CLIP notebook](https://) |
| VQA | Answering a question in natural <br/> language based on an image | [dandelin/vilt-b32-finetuned-vqa](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) | [VQA notebook](https://) |
| Image-to-Text | Describing an image in natural language | [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) | [Text 2 Image notebook](https://) |
| Open-set object detection | Detect objects by natural language input | [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) | [Grounding DINO notebook](https://) |
| Assistant (GTP-4V like) | Instruction tuning in the multimodal field | [LLaVA](https://github.com/haotian-liu/LLaVA) | [LLaVa notebook](https://) |
| Task | Description | Model |
| ----------- | ---------------------------------------------------------------- | ------------------------------------------------- |
| [Fine-tune CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/Clip_finetune.ipynb)| Fine-tuning CLIP on a custom dataset | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
| [VQA](https://huggingface.co/docs/transformers/main/en/tasks/visual_question_answering#train-the-model) | Answering a question in natural <br/> language based on an image | [dandelin/vilt-b32-mlm](https://huggingface.co/dandelin/vilt-b32-mlm) |
| [Image-to-Text](https://huggingface.co/docs/transformers/main/en/tasks/image_captioning) | Describing an image in natural language | [microsoft/git-base](https://huggingface.co/microsoft/git-base) |
| [Open-set object detection](https://docs.ultralytics.com/models/yolo-world/) | Detect objects by natural language input | [YOLO-World](https://huggingface.co/papers/2401.17270) |
| [Assistant (GTP-4V like)](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) | Instruction tuning in the multimodal field | [LLaVA](https://huggingface.co/docs/transformers/model_doc/llava) |

0 comments on commit 4890225

Please sign in to comment.