Skip to content

Commit 69b00d2

Browse files
committed
Add docs for new command #2402
Signed-off-by: Jono Yang <[email protected]>
1 parent 901bf1d commit 69b00d2

File tree

7 files changed

+77
-4
lines changed

7 files changed

+77
-4
lines changed

CHANGELOG.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,14 @@ Next release
1313
- Replace unmaintained ``toml`` library with ``tomllib`` / ``tomli``.
1414
https://github.com/aboutcode-org/scancode-toolkit/issues/4532
1515

16+
- Add gibberish detection to copyright scanning. This is done using a
17+
2-character Markov chain. A new CLI command,
18+
``scancode-train-gibberish-model``, has been added to regenerate the model
19+
used by the detector.
20+
https://github.com/aboutcode-org/scancode-toolkit/pull/4610
21+
https://github.com/aboutcode-org/scancode-toolkit/issues/2402
22+
23+
1624
v32.4.1 - 2025-07-23
1725
--------------------
1826

configure

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,6 @@ find_python
319319
create_virtualenv "$VIRTUALENV_DIR"
320320
install_packages "$CFG_REQUIREMENTS"
321321
. "$CFG_BIN_DIR/activate"
322-
"$CFG_BIN_DIR/train-gibberish-model"
322+
"$CFG_BIN_DIR/scancode-train-gibberish-model"
323323

324324
set +e

configure.bat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ if %ERRORLEVEL% neq 0 (
161161
%CFG_QUIET% ^
162162
%PIP_EXTRA_ARGS% ^
163163
%CFG_REQUIREMENTS%
164-
164+
"%CFG_BIN_DIR%\scancode-train-gibberish-model"
165165

166166
@rem ################################
167167
:create_bin_junction
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
.. _cli-scancode-train-gibberish-model:
2+
3+
ScanCode train gibberish model
4+
==============================
5+
6+
ScanCode uses a 2-character Markov chain to perform gibberish detection on text.
7+
At a high level, it detects gibberish strings by seeing if a sequence of letters
8+
is part or a whole word, two letters at a time. It does this by checking how
9+
likely it is to go from one letter to another. The probabilities of going from
10+
one letter to another are determined by a model that has been trained on a large
11+
set of valid text, where it counts each transition between letters and computes
12+
a probability based off of that. These probabilities and thresholds are stored
13+
in a model that is saved to a Python pickle.
14+
15+
The training corpus for the gibberish detector can be found in src/textcode/data/gibberish/
16+
17+
``big.txt`` contains the main source of valid words that the gibberish detector
18+
model is trained on.
19+
20+
``good.txt`` and ``bad.txt`` are used to determine the average threshold, where
21+
any letter transition whose average transition probability falls below this
22+
threshold is classified as gibberish.
23+
24+
25+
Usage: ``scancode-train-gibberish-model [OPTIONS]``
26+
27+
Quick Reference
28+
---------------
29+
30+
--big FILE Text file containing main training corpus for the gibberish
31+
detector
32+
--good FILE Text file containing text considered to be not gibberish (good)
33+
--bad FILE Text file containing text considered to be gibberish (bad)
34+
-h, --help Show this message and exit.
35+
36+
----
37+
38+
.. _cli-scancode-train-gibberish-model-big-option:
39+
40+
``--big`` option
41+
^^^^^^^^^^^^^^^^
42+
43+
The ``--big`` option allows the user to use a different text file to train the
44+
gibberish detector model.
45+
46+
.. _cli-scancode-train-gibberish-model-good-option:
47+
48+
``--good`` option
49+
^^^^^^^^^^^^^^^^^
50+
51+
The ``--good`` option allows the user to use a different text file containing
52+
strings considered to be valid copyrights. This option is used to adjust the
53+
average transition probability threshold that determines whether or not a string
54+
is gibberish.
55+
56+
.. _cli-scancode-train-gibberish-model-bad-option:
57+
58+
``--bad`` option
59+
^^^^^^^^^^^^^^^^
60+
61+
The ``--bad`` option allows the user to use a different text file containing
62+
strings considered to be invalid copyrights. This option is used to adjust the
63+
average transition probability threshold that determines whether or not a string
64+
is gibberish.

docs/source/reference/scancode-cli/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,4 @@ For more details into the post-scan CLI options, see :ref:`cli-post-scan-options
8888
cli-extractcode
8989
cli-scancode-reindex-licenses
9090
cli-scancode-license-data
91+
cli-scancode-train-gibberish-model

setup-mini.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ console_scripts =
160160
regen-package-docs = packagedcode.regen_package_docs:regen_package_docs
161161
add-required-phrases = licensedcode.required_phrases:add_required_phrases
162162
gen-new-required-phrases-rules = licensedcode.required_phrases:gen_required_phrases_rules
163-
train-gibberish-model = textcode.train_gibberish_model:train_gibberish_model
163+
scancode-train-gibberish-model = textcode.train_gibberish_model:train_gibberish_model
164164

165165
# These are configurations for ScanCode plugins as setuptools entry points.
166166
# Each plugin entry hast this form:

setup.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ console_scripts =
162162
regen-package-docs = packagedcode.regen_package_docs:regen_package_docs
163163
add-required-phrases = licensedcode.required_phrases:add_required_phrases
164164
gen-new-required-phrases-rules = licensedcode.required_phrases:gen_required_phrases_rules
165-
train-gibberish-model = textcode.train_gibberish_model:train_gibberish_model
165+
scancode-train-gibberish-model = textcode.train_gibberish_model:train_gibberish_model
166166

167167
# These are configurations for ScanCode plugins as setuptools entry points.
168168
# Each plugin entry hast this form:

0 commit comments

Comments
 (0)