-
Notifications
You must be signed in to change notification settings - Fork 87
Open
Description
I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).
It's a fairly crude filter but I haven't seen any false positives
import re
import datasets
ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))Samples of the docs are things like this:
Printed with
for doc in ds_filtered:
print(doc["text"].replace("\n", " | ")[:400])
print("\n")Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.
Metadata
Metadata
Assignees
Labels
No labels
