You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+19Lines changed: 19 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -54,6 +54,25 @@ Otherwise tens of terabytes will be needed if a single process indexes all the b
54
54
With this approach, each job is computing its own Union-Find vector and storing it in disk.
55
55
The dedup step is performed the same way, but instead all the vectors are read and merged at the beginning.
56
56
57
+
### Cleaning
58
+
The process of cleaning adds a new metadata field (`"filter"`) to each document, that indicates if the document should be discarded or not and if not, the discarding reason.
59
+
Possible values are:
60
+
-`keep`: the document does not match with any of the filtering criteria.
61
+
-`adult_ut1`: the url of the document matches one of the domains in UT1 adult list. To match, the full domain extracted from the url and looked for in the table. If it does not match, a couple it retries removing the subdomains.
62
+
-`length_XX`: the text of the document has less than XX characters. Default: 200.
63
+
- `lang_ratio_XX: the ratio of languages by segment that match the document language is less than XX. Default: 0.2 (at least 20% of the segment languages are the same as document language).
64
+
-`word\_avg\_X`: the average number of words per segment is less than X. Default: 5.
65
+
-`char\_avg\_X`: the average number of characters per segment is less than X. This is used for Chinese, Japanese and Korean. Default: 10.
66
+
67
+
There are languages considered exceptions for the language ratio rule and it is disabled.
68
+
This is mainly because some languages either have poor language identification at segment level or the the majority of documents have a very high portion of boilerplate and/or English.
69
+
Sometimes both cases.
70
+
Therefore language ratio rule ends up being too aggressive.
71
+
These language exceptions are:
72
+
- Afrikaans, Swahili, Somali and Tagalog for the reasons explained above.
73
+
- Uzbek segment level language identification is tagging all the Cyrillic as other languages.
74
+
- Malay and Indonesian tend to mix up with each other.
75
+
57
76
## Install
58
77
Install requirements inside your virtual environment.
0 commit comments