How to Train Model in Distributed Fashion Using DDP/accelerate with Custom Dataset containing Random Augmentations #3002

wolframalpha · 2025-01-28T14:59:24Z

Description
I have a custom dataset implementation that includes random augmentations for images and returns data in a specific format. The dataset is defined as follows:

class OCRDataset(Dataset):
    def __init__(self, csv_file, transform=None):
        self.data = pd.read_csv(csv_file)
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_path = self.data.iloc[idx]['image_path']
        transcription = self.data.iloc[idx]['transcription']
        image = Image.open(img_path).convert('RGB')
        if self.transform:
            image = self.transform(image)
        img_byte_arr = io.BytesIO()
        image.save(img_byte_arr, format='PNG')
        image_bytes = img_byte_arr.getvalue()
        result = {
            'images': [{'bytes': image_bytes}],
            'messages': [
                {'role': 'user', 'content': 'Perform OCR on the image.'},
                {'role': 'assistant', 'content': transcription}
            ]
        }
        return result

Since I am using a custom dataset with random augmentations, I cannot use the CLI - swift sft for training.

Question
How can I modify the training pipeline to support distributed training (e.g., using accelerate(DDP)) ?

The dataset returns data in a custom format (as shown above).

Current Setup
Here is the current training pipeline:

# Define random augmentations for the images
transform = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])

# Load the dataset
train_dataset = OCRDataset(csv_file='path_to_train.csv', transform=transform)
val_dataset = OCRDataset(csv_file='path_to_val.csv', transform=transform)

# Retrieve the model and template, and add a trainable LoRA module
model, tokenizer = get_model_tokenizer(model_id_or_path, ...)
template = get_template(model.model_meta.template, tokenizer, ...)
model = Swift.prepare_model(model, lora_config)

# Encode the text into tokens
train_dataset = EncodePreprocessor(template=template)(train_dataset, num_proc=num_proc)
val_dataset = EncodePreprocessor(template=template)(val_dataset, num_proc=num_proc)

# Train the model
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=template.data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    template=template,
)
trainer.train()

The dataset includes random augmentations, I need to use DDP or another distributed training method to scale training across multiple GPUs/nodes.

Could you provide guidance or an example of modifying this pipeline to support distributed training while maintaining the custom dataset and random augmentations? I am not sure on how to apply custom augmentation while using CLI for training.

Suggestions will be highly appreciated. Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Train Model in Distributed Fashion Using DDP/accelerate with Custom Dataset containing Random Augmentations #3002

How to Train Model in Distributed Fashion Using DDP/accelerate with Custom Dataset containing Random Augmentations #3002

wolframalpha commented Jan 28, 2025

How to Train Model in Distributed Fashion Using DDP/accelerate with Custom Dataset containing Random Augmentations #3002

How to Train Model in Distributed Fashion Using DDP/accelerate with Custom Dataset containing Random Augmentations #3002

Comments

wolframalpha commented Jan 28, 2025