-
Notifications
You must be signed in to change notification settings - Fork 236
Open
Description
I would like to know how to configure the minhash deduplication pipeline so that when duplicate datapoints are detected, instead of keep 1 sample, I would like to drop ALL samples that are classified as duplicates.
Is there a config, or somewhere I can modify to achieve this?
here's some context:
let's say we have two datasets, dataset A and B, and there's a need to detect duplicates between A and B, and only remove duplicates from A.
I'm thinking if we can remove all detected duplicates, then I will have a processed A and I can keep using the original B to achieve a similar effect.
Thank you!
Metadata
Metadata
Assignees
Labels
No labels