Idea for further filtering

I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).

It's a fairly crude filter but I haven't seen any false positives

```py
import re
import datasets

ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))
```

Samples of the docs are things like this:

![image](https://user-images.githubusercontent.com/4443482/222936389-ea18ec6e-1393-4708-b20f-2dba88f23f61.png)

Printed with 
```py
for doc in ds_filtered:
    print(doc["text"].replace("\n", " | ")[:400])
    print("\n")
```

Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Idea for further filtering #43

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Idea for further filtering #43

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions