Skip to content

Commit d076d86

Browse files
committed
Update to 0.1.0-rc5
1 parent cc5ef49 commit d076d86

File tree

3 files changed

+4
-4
lines changed

3 files changed

+4
-4
lines changed

Diff for: README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Neural networks compute a non-linear continuous function and therefore require c
5757

5858
First, we'll convert all characters to lowercase and remove any extra whitespace using [Text Normalizer](https://docs.rubixml.com/en/latest/transformers/text-normalizer.html). Then, [Word Count Vectorizer](https://docs.rubixml.com/en/latest/transformers/word-count-vectorizer.html) is responsible for creating a continuous feature vector of word counts from the raw text and [TF-IDF Transformer](https://docs.rubixml.com/en/latest/transformers/tf-idf-transformer.html) applies a weighting scheme to those counts. Finally, [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) takes the TF-IDF weighted counts and centers and scales the sample matrix to have 0 mean and unit variance. This last step will help the neural network converge quicker.
5959

60-
The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a particular document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents. In this way, we limit the amount of *noise* words that enter the training set.
60+
The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a particular document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents but no more than 5,000 documents. In this way, we limit the amount of *noise* words that enter the training set.
6161

6262
Another common feature representation for words are their [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values which take the term frequencies (TF) from Word Count Vectorizer and weight them by their inverse document frequencies (IDF). IDFs can be interpreted as the word's *importance* within the text corpus. Specifically, higher weight is given to words that are more rare within the corpus.
6363

@@ -84,7 +84,7 @@ use Rubix\ML\Persisters\Filesystem;
8484
$estimator = new PersistentModel(
8585
new Pipeline([
8686
new TextNormalizer(),
87-
new WordCountVectorizer(10000, 3, new NGram(1, 2)),
87+
new WordCountVectorizer(10000, 3, 5000, new NGram(1, 2)),
8888
new TfIdfTransformer(),
8989
new ZScaleStandardizer(),
9090
], new MultilayerPerceptron([

Diff for: composer.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
"require": {
2424
"php": ">=7.2",
2525
"league/csv": "^9.5",
26-
"rubix/ml": "^0.1.0-rc3"
26+
"rubix/ml": "^0.1.0-rc5"
2727
},
2828
"suggest": {
2929
"ext-tensor": "For faster training and inference"

Diff for: train.php

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
$estimator = new PersistentModel(
4242
new Pipeline([
4343
new TextNormalizer(),
44-
new WordCountVectorizer(10000, 3, new NGram(1, 2)),
44+
new WordCountVectorizer(10000, 3, 5000, new NGram(1, 2)),
4545
new TfIdfTransformer(),
4646
new ZScaleStandardizer(),
4747
], new MultilayerPerceptron([

0 commit comments

Comments
 (0)