Adding OBELICS DataLoader #663

TJ-Solergibert · 2024-10-30T17:50:01Z

Hi,

In this PR I present a first draft of the Multimodal DataLoader. First I will describe how the batches are created and then I will explain the padding problem.

Let's begin checking the OBELICS dataset. For every sample on the dataset we have 4 keys, but we are just interested in 2 of them:

images: A list either with URLs of images OR Nones to specify the position of the text.
texts: A list either with text strings OR Nones to specify the position of the images.
It's important to highlight that len(images)==len(texts) and that for each index, one element and only one is not None.

The format_obelics function will transform each sample to a format that can be later fed into the transform block that will prepare the samples to the target type. Each formatted sample will be a dictionary containing 2 keys:

images: List of PIL Images with the loaded images.
text: str with the text of the sample ready to be tokenized, including the image tokens.

Once formatted, we will process each sample with the transform block. This transform block is composed of CLIPPreprocess, TikTokenizer & VisionCrossAttentionMask modules.

`CLIPPreprocess`

This module will prepare the List of images to be fed into the CLIP model. The most relevant steps is resizing the image without distortion, dividing the image into tiles and padding if necessary. Highlight the fact that it will still produce a List of tensors and NOT a tensor as every image can have a different number of tiles. This will be addressed in the collator where we will pad the image tiles to the largest in the batch. Also, we keep the maximum number of tiles to 4 and the tile size to 448 for pretraining [1], [2].

`TikTokenizer`

I've included a new method in the tokenizer to encode the multimodal text. In short, it just encodes the text adding the special image_id token and returns both the input_ids & labels masking the bos, eos & image_id tokens.

`VisionCrossAttentionMask`

This module will create the attention mask for the Fused layers. In short, for each TILE we will have 1025 image_tokens and this mask will specify for each text_token to which image_tokens should attend to. We are returning again a List of tensors as the quantity of image_tokens will depend on the number of tiles. Again, we will solve this in the collator.

Padding & the collator

As we've previously seen, both the outputs of the CLIPPreprocess & VisionCrossAttentionMask are list of tensors because of the different number of tiles. Within the same sample we should pad both artifacts to the maximum number of tiles, but the issue arises when we run batch_size > 1 as we will also need to pad the input_ids (& labels) which is relatively cheap BUT also the Number of images, as the input to the CLIP model will be a tensor of shape [Batch size, Number of images, Number of tiles, Channels, Tile size, Tile size]. Padding to the maximum number of tiles is bad, but in the worst case scenario you end up increasing the tensor x4 (from 1 tile to maximum number of tiles = 4). But for the number of images it can get really really big, as there are samples with +30 images.

To check this phenomenon I've included scripts/check_padding_mm.py which computes the % of padding in a sample. Feel free to give it a try but it's very easy to get samples where the majority of the input is padding.

python3 scripts/check_padding_mm.py
Unpadded tokens: 8717, Total tokens in batch: 21728
Padded text tokens: 13011, 59.88%
########################################
Unpadded images: 25, Total images in batch: 64
Padded images: 39, 60.94% (Each image with shape [4, 3, 448, 448])
########################################
Unpadded number of tiles: 61, Total number of tiles: 256
Padded tiles: 195, 68.72% (Each with shape [3, 448, 448])
########################################
Unpadded cross attention mask elements: 545030425, Total cross attention mask elements: 5701427200
Padded cross attention mask elements: 5156396775, 90.44%

That's why I proposed continue working on a DataLoader & Dataset than can pack multiple samples up to a given input_ids length OR number of images in a batch. Packing the input_ids is fairly easy while packing the cross attention masks will require a bit more effort. Let me know if you would be interested on supporting that feature or you just want to include in the repo an example of the multimodal pipeline despite the padding issue described. I also plan including some unit test, to check the generated samples & recovering from failures abilities.

Other comments:

torchtitan/datasets/mm_datasets.py: I hardcoded the image token & also the Llama3VisionTransform init. Let me know whether I should leave this (default) values or let the user modify them.
torchtitan/datasets/multimodal/collator.py: I hardcoded the ignore index to mask the special tokens & added some debug lists for the scripts/check_padding_mm.py script.
torchtitan/datasets/multimodal/llama3_transform.py: More hardcoded values & what should we do (Include and mask) with eos & bos tokens.
torchtitan/datasets/multimodal/utils.py: Most of the code directly copied from torchtune cleaning the unnecessary parts like the code for the inference case. Also in the format_obelics function we could drop the last images in the case the sample end with images and not text as no token will attend to them and we dont compute the loss with the image tokens (So they are useless)
torchtitan/datasets/tokenizer/tiktoken.py: Hardcoded image token + encode_multimodal method. Is it fine to include this method here or should we move it somewhere else? Also we could standardise the nomenclature for the input_ids/tokens across the repo.

Toni

facebook-github-bot · 2024-10-30T17:50:06Z

Hi @TJ-Solergibert!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2024-10-30T18:07:07Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

torchtitan/datasets/multimodal/clip.py

fduwjj · 2024-11-05T00:19:14Z

scripts/check_padding_mm.py

+BATCH_NUMBER = 4
+
+
+def main():


maybe we can make this as a unit test? WDYT?

I would add as a unit test some checks of shapes & types on the DP axis rather than this script that just checks the amount of padding in each batch

.pre-commit-config.yaml

torchtitan/datasets/tokenizer/tiktoken.py

fduwjj

Sorry for super late review. I finally finish the first round of review and let me know what do you think about my questions and comments. I will revisit this PR in the coming week.

torchtitan/datasets/tokenizer/tiktoken.py

fduwjj · 2024-12-28T00:00:43Z

torchtitan/datasets/multimodal/clip.py

+            Mapping[str, Any]: The sample with an updated "image" filed and added
+                "aspect_ratio" field.
+        """
+        image = sample["image"]


I am wondering instead of assuming a "image" field, shall we just pass in the image itself so that this part can be a generic for other dataset as well?

The docs in _process_obelics_sample state clearly the structure that a sample_processor generates in order to work with Llama3VisionTransform. I guess that we can mark this comment as resolved now that we have the sample_processor & text_processor + datasets.md in docs, right?

torchtitan/datasets/multimodal/clip.py

fduwjj · 2024-12-28T00:16:48Z

torchtitan/datasets/multimodal/clip.py

+        max_num_tiles (Optional[int]): Only used if possible_resolutions is NOT given.
+            Maximum number of tiles to break an image into.
+            This will be used to generate possible_resolutions,
+            e.g. [(224, 224), (224, 448), (448, 224)] if max_num_tiles = 2 and tile_size = 224.


Why we don't do (448, 448) as well? Also max_num_tiles should be 4?

Tiles have width and height, 2D. (448, 448) is 4 tiles, while the 3 examples mentioned in the example have only 2 ((224, 224) single tile, (224, 448) one tile next to another & (448, 224) one tile up another down).

For max_num_tiles = 4 we have: [(224, 896), (448, 448), (224, 224), (896, 224), (224, 672), (672, 224), (224, 448), (448, 224)].

We can put this example in the docstring but it's larger.

fduwjj · 2025-01-02T23:45:20Z

torchtitan/datasets/mm_datasets.py

+                ].count(self.image_token)
+                self._sample_idx += 1
+                # Transform sample
+                processed_sample = self.transform(processed_sample)


I know you get this from TorchTune, can we make the name not like transformer? Maybe prepoc? Because we will have transform or transformer in later stage of the model as well. And this part is not trainable, so to better differentiate, could you give it a different name?

Changed to self.format(processed_sample) & Llama3VisionFormatter

fduwjj · 2025-01-02T23:59:46Z

torchtitan/datasets/multimodal/vision_attention_mask.py

+            >>> transform = VisionCrossAttentionMask(tile_size=400, patch_size=40, image_token_id=1)
+            >>> intervals = transform._get_image_attention_intervals(tokens)
+            >>> print(intervals)
+            [[0, 7], [1, 7], [7, 12]]


Hmmm this is slightly different from what I have read about masking.. I need more time to think and validate on the logic of this part.

Text tokens only attend to the previous image, multiple images if they are consecutive.

From the example: "Image1Image2 These are two dogs. Image3 This is a cat."

These are two dogs. will attend to Image 1 & 2 and This is a cat. only to Image 3

fduwjj · 2025-01-03T00:02:48Z

torchtitan/datasets/hf_datasets.py

+        batch_size: int,
+        collator_fn: Callable,
+    ):
+        super().__init__(dataset=hf_ds, batch_size=batch_size, collate_fn=collator_fn)


where is this collate_fn being used or called?

Now we also have #1021 which is also adding the collator. For the HF datasets solution already in torchtitan we dont need it as the samples produced by the Dataset are ready to go. For the multimodal one we need it to pad to the longest/biggest samples, otherwise, we can directly pad to the longest supported shapes possible directly in the dataset (I prefer the collator solution)

torchtitan/datasets/multimodal/collator.py

fduwjj · 2025-01-03T00:06:20Z

torchtitan/datasets/multimodal/collator.py

+
+# NOTE Inspired from torchtune.data._collate.py
+@dataclass
+class MultiModalCollator:


IIUC, this generates the all the data needed before sending into encoder right? Also for the MM model, we need to feed text tokens into decoder as well? shall we just reuse the existing dataloader for llama3?

shall we just reuse the existing dataloader for llama3?

No! Now, with the Dataset & DataLoader in mm_dataset.py & mm_dataloader.py + the collator we prepare all the inputs with a single DataLoader!

The collator returns the prepared batches with:

batch_dict = { "tokens": collated_text["tokens"], "labels": collated_text["labels"], "encoder_input": { "images": collated_images, "aspect_ratio": collated_aspect_ratios, }, "encoder_mask": concat_masks, }

TJ-Solergibert · 2025-01-22T19:02:12Z

Sorry, I will try to answer all of these comments within the next 10 days.

fduwjj · 2025-01-30T17:23:07Z

@TJ-Solergibert I am wondering if we can split this PR and get it merged in pieces?

tianyu-l · 2025-03-25T02:19:15Z

Hey @TJ-Solergibert , are you still interested in continuing to work on this PR?

TJ-Solergibert · 2025-03-26T22:02:29Z

Updated the PR to the main branch, incorporating new features from torchtitan like the TrainSpec. I'll address the comments in the following 2 days!

pbontrager

Thank you for the hard work here. This looks good but includes more Llama3.2 code than I think we need to enable MM in torchtitan. Since most modern VLM's use Early Fusion architectures instead of Deep Fusion like 3.2, we should choose to just support Early Fusion models for now. I left some comments on what could be removed or moved. After that it looks good to go.

torchtitan/experiments/multimodal/vision_attention_mask.py

torchtitan/experiments/multimodal/utils.py

pbontrager · 2025-03-27T19:28:00Z

torchtitan/experiments/multimodal/mm_dataset.py

+        self._sample_processor = sample_processor
+        self.image_token = "<|image|>"  # TODO(tj.solergibert) Hardcoded!
+
+        self.transform = Llama3VisionTransform(


Leave a todo comment here to make this not hardcoded

Left multiples. We have to decide which variables we want to expose through JobConfig

torchtitan/experiments/multimodal/mm_dataset.py

torchtitan/experiments/multimodal/llama3_transform.py

torchtitan/experiments/multimodal/clip.py

torchtitan/experiments/multimodal/mm_dataloader.py

pbontrager

This looks good, thank you for the quick turnaround. I left one additional comment, but I'd be happy to land it now and then we can iterate on it further in follwup PRs. If you're ready to land it, you can just remove your personal todo comments and remove [WIP] and I'll approve it.

torchtitan/experiments/multimodal/llama3_transform.py

tianyu-l

Thanks for your work. I left some comments.

Please create a new customized tiktoken.py within the experiment folder.

torchtitan/datasets/tokenizer/tiktoken.py

torchtitan/components/dataloader.py

tianyu-l · 2025-03-28T22:44:41Z

torchtitan/experiments/multimodal/check_padding_mm.py

+
+from mm_dataset import build_mm_dataloader
+
+PATH_TO_TOKENIZER = "/iopsstor/scratch/cscs/asolergi/torchtitan/tokenizer.model"


what is this path?

It's the path to the llama 3 tokenizer. I've exposed all the args through click, but I can drop this script if you want!

tianyu-l · 2025-03-28T22:45:14Z

torchtitan/experiments/multimodal/check_padding_mm.py

what is this file for? Is it a test, or for sanity check? If so let's put it into a tests folder.

It's just for sanity check. Should I delete it?

torchtitan/experiments/multimodal/llama3_transform.py

TJ-Solergibert · 2025-03-31T14:35:12Z

Hi @tianyu-l & @pbontrager,

The PR is ready for a Re-review! I've addressed all your comments I would say. They've been very helpful, thanks!

In the last push mainly I've created a new tiktoken.py under the experiment folder, deleted the Llama3VisionTransform class and move all the logic to the dataset itself. I've also refactored a little bit the sanity check script, which basically checks some shapes and the amount of padding:

python3 torchtitan/experiments/multimodal/check_padding_mm.py --tokenizer-path tokenizer.model
[titan] 2025-03-31 16:07:48,572 - root - INFO - TikTokenizer built: #words 128257, BOS ID 128000, EOS ID 128001, IMAGE ID 128256
[titan] 2025-03-31 16:07:48,572 - root - INFO - Preparing obelics dataset from HuggingFaceM4/OBELICS
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1432/1432 [00:00<00:00, 3711.96it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1432/1432 [00:00<00:00, 191855.98it/s]

Padding tokens in each sample: tensor([ 31, 194, 262,   0])
Unpadded tokens: 1237, Total tokens in batch: 1724
Padded text tokens: 487, 28.25%
################################################################################
Unpadded images: 18, Total images in batch: 44
Padded images: 26, 59.09% (Each image with shape [4, 3, 448, 448])
################################################################################
Unpadded number of tiles: 55, Total number of tiles: 176
Padded tiles: 121, 68.75% (Each with shape [3, 448, 448])
################################################################################

I've left some TODO comments in the code, most of them regarding which arguments we should expose to the user or not:

torchtitan/experiments/multimodal/__init__.py: Which cls should we include in the train spec?
torchtitan/experiments/multimodal/mm_dataset.py: Expose tile_size, max_num_tiles, image_token, image_mean, image_std, pad_max_tiles & padding_idx through JobConfig?
torchtitan/experiments/multimodal/tokenizer/tiktoken.py: Hardcode IMAGE_TOKEN_ID & IGNORE_INDEX?

Toni

tianyu-l

Looks beautiful! Thank you for building the foundation of multimodal training in torchtitan!

Please fix linting before we could merge.

TJ-Solergibert · 2025-03-31T18:51:24Z

Done! Thanks for your comments! Now that I have a bit of bandwidth I will check if I can continue contributing forward! I'm excited to see how far torchtitan gets for multimodal training!

TJ-Solergibert mentioned this pull request Oct 30, 2024

[Multimodal] Adding OBELICS DataLoader #650

Closed

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2024

tianyu-l requested review from andrewkho, divyanshk, fduwjj and tianyu-l October 31, 2024 22:30

fduwjj reviewed Nov 5, 2024

View reviewed changes

torchtitan/datasets/multimodal/clip.py Outdated Show resolved Hide resolved

fduwjj reviewed Nov 5, 2024

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

fduwjj reviewed Nov 14, 2024

View reviewed changes

torchtitan/datasets/tokenizer/tiktoken.py Outdated Show resolved Hide resolved

tianyu-l linked an issue Nov 22, 2024 that may be closed by this pull request

[Multimodal] Adding OBELICS DataLoader #650

Closed

fduwjj reviewed Jan 3, 2025

View reviewed changes

fduwjj requested a review from fegin February 14, 2025 19:36

TJ-Solergibert closed this Mar 26, 2025

TJ-Solergibert force-pushed the multimodal_dl branch from 9a02575 to 1daca4c Compare March 26, 2025 16:55

Refractored MM Dataset

c8271ad

TJ-Solergibert reopened this Mar 26, 2025

tianyu-l requested review from pbontrager and removed request for andrewkho March 27, 2025 00:23

pbontrager reviewed Mar 27, 2025

View reviewed changes

TJ-Solergibert added 4 commits March 28, 2025 14:03

Missing import

30ceb28

Drop numpy and PIL, rename transform to formatter and minor fixes

285c89e

drop PIL

bb03481

Last commit before switching to Early Fusion

f47b453

pbontrager reviewed Mar 28, 2025

View reviewed changes

torchtitan/experiments/multimodal/llama3_transform.py Outdated Show resolved Hide resolved

tianyu-l requested changes Mar 28, 2025

View reviewed changes

TJ-Solergibert and others added 5 commits March 31, 2025 11:29

Merge branch 'pytorch:main' into multimodal_dl

111df62

rename collator file and deleted missing import

6660c7f

Created tiktoken, deleted Llama3formatter, deleted Dataloader

0aac961

refractor sanity check

76c2499

Final nits + formatter

d3d2d7c

TJ-Solergibert requested review from tianyu-l and pbontrager March 31, 2025 14:36

tianyu-l approved these changes Mar 31, 2025

View reviewed changes

Fix sanity check linting

8b0da1c

TJ-Solergibert changed the title ~~[WIP] Adding OBELICS DataLoader~~ Adding OBELICS DataLoader Mar 31, 2025

tianyu-l merged commit 3e75bae into pytorch:main Mar 31, 2025
6 checks passed


		from mm_dataset import build_mm_dataloader

		PATH_TO_TOKENIZER = "/iopsstor/scratch/cscs/asolergi/torchtitan/tokenizer.model"

		BATCH_NUMBER = 4


		def main():

Adding OBELICS DataLoader #663

Adding OBELICS DataLoader #663

Uh oh!

Conversation

TJ-Solergibert commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CLIPPreprocess

TikTokenizer

VisionCrossAttentionMask

Padding & the collator

Uh oh!

facebook-github-bot commented Oct 30, 2024

Action Required

Process

Uh oh!

facebook-github-bot commented Oct 30, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TJ-Solergibert Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TJ-Solergibert Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TJ-Solergibert Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TJ-Solergibert commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fduwjj commented Jan 30, 2025

Uh oh!

tianyu-l commented Mar 25, 2025

Uh oh!

TJ-Solergibert commented Mar 26, 2025

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TJ-Solergibert commented Oct 30, 2024 •

edited

Loading

`CLIPPreprocess`

`TikTokenizer`

`VisionCrossAttentionMask`

TJ-Solergibert Mar 27, 2025 •

edited

Loading

TJ-Solergibert Mar 27, 2025 •

edited

Loading

TJ-Solergibert Mar 27, 2025 •

edited

Loading

TJ-Solergibert commented Jan 22, 2025 •

edited

Loading