Deduplication Configuration

I would like to know how to configure the minhash deduplication pipeline so that when duplicate datapoints are detected, instead of keep 1 sample, I would like to drop ALL samples that are classified as duplicates.

Is there a config, or somewhere I can modify to achieve this?

here's some context:

let's say we have two datasets, dataset A and B, and there's a need to detect duplicates between A and B, and only remove duplicates from A.
I'm thinking if we can remove all detected duplicates, then I will have a processed A and I can keep using the original B to achieve a similar effect.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplication Configuration #376

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deduplication Configuration #376

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions