Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Train Model in Distributed Fashion Using DDP/accelerate with Custom Dataset containing Random Augmentations #3002

Open
wolframalpha opened this issue Jan 28, 2025 · 0 comments

Comments

@wolframalpha
Copy link

Description
I have a custom dataset implementation that includes random augmentations for images and returns data in a specific format. The dataset is defined as follows:

class OCRDataset(Dataset):
    def __init__(self, csv_file, transform=None):
        self.data = pd.read_csv(csv_file)
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_path = self.data.iloc[idx]['image_path']
        transcription = self.data.iloc[idx]['transcription']
        image = Image.open(img_path).convert('RGB')
        if self.transform:
            image = self.transform(image)
        img_byte_arr = io.BytesIO()
        image.save(img_byte_arr, format='PNG')
        image_bytes = img_byte_arr.getvalue()
        result = {
            'images': [{'bytes': image_bytes}],
            'messages': [
                {'role': 'user', 'content': 'Perform OCR on the image.'},
                {'role': 'assistant', 'content': transcription}
            ]
        }
        return result

Since I am using a custom dataset with random augmentations, I cannot use the CLI - swift sft for training.

Question
How can I modify the training pipeline to support distributed training (e.g., using accelerate(DDP)) ?

The dataset returns data in a custom format (as shown above).

Current Setup
Here is the current training pipeline:

# Define random augmentations for the images
transform = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])

# Load the dataset
train_dataset = OCRDataset(csv_file='path_to_train.csv', transform=transform)
val_dataset = OCRDataset(csv_file='path_to_val.csv', transform=transform)

# Retrieve the model and template, and add a trainable LoRA module
model, tokenizer = get_model_tokenizer(model_id_or_path, ...)
template = get_template(model.model_meta.template, tokenizer, ...)
model = Swift.prepare_model(model, lora_config)

# Encode the text into tokens
train_dataset = EncodePreprocessor(template=template)(train_dataset, num_proc=num_proc)
val_dataset = EncodePreprocessor(template=template)(val_dataset, num_proc=num_proc)

# Train the model
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=template.data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    template=template,
)
trainer.train()

The dataset includes random augmentations, I need to use DDP or another distributed training method to scale training across multiple GPUs/nodes.

Could you provide guidance or an example of modifying this pipeline to support distributed training while maintaining the custom dataset and random augmentations? I am not sure on how to apply custom augmentation while using CLI for training.

Suggestions will be highly appreciated. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant