Skip to content

Commit efe0c3b

Browse files
committed
clean: pipeline documentation
1 parent 318e049 commit efe0c3b

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,25 @@ Otherwise tens of terabytes will be needed if a single process indexes all the b
5454
With this approach, each job is computing its own Union-Find vector and storing it in disk.
5555
The dedup step is performed the same way, but instead all the vectors are read and merged at the beginning.
5656

57+
### Cleaning
58+
The process of cleaning adds a new metadata field (`"filter"`) to each document, that indicates if the document should be discarded or not and if not, the discarding reason.
59+
Possible values are:
60+
- `keep`: the document does not match with any of the filtering criteria.
61+
- `adult_ut1`: the url of the document matches one of the domains in UT1 adult list. To match, the full domain extracted from the url and looked for in the table. If it does not match, a couple it retries removing the subdomains.
62+
- `length_XX`: the text of the document has less than XX characters. Default: 200.
63+
- `lang_ratio_XX: the ratio of languages by segment that match the document language is less than XX. Default: 0.2 (at least 20% of the segment languages are the same as document language).
64+
- `word\_avg\_X`: the average number of words per segment is less than X. Default: 5.
65+
- `char\_avg\_X`: the average number of characters per segment is less than X. This is used for Chinese, Japanese and Korean. Default: 10.
66+
67+
There are languages considered exceptions for the language ratio rule and it is disabled.
68+
This is mainly because some languages either have poor language identification at segment level or the the majority of documents have a very high portion of boilerplate and/or English.
69+
Sometimes both cases.
70+
Therefore language ratio rule ends up being too aggressive.
71+
These language exceptions are:
72+
- Afrikaans, Swahili, Somali and Tagalog for the reasons explained above.
73+
- Uzbek segment level language identification is tagging all the Cyrillic as other languages.
74+
- Malay and Indonesian tend to mix up with each other.
75+
5776
## Install
5877
Install requirements inside your virtual environment.
5978
```

0 commit comments

Comments
 (0)