Skip to content

Commit e7a4ef5

Browse files
committed
Update to Rubix ML 0.1.0
1 parent 1c6aa78 commit e7a4ef5

File tree

4 files changed

+32
-13
lines changed

4 files changed

+32
-13
lines changed

README.md

+23-5
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Neural networks compute a non-linear continuous function and therefore require c
5757

5858
First, we'll convert all characters to lowercase using [Text Normalizer](https://docs.rubixml.com/en/latest/transformers/text-normalizer.html) so that every word is represented by only a single token. Then, [Word Count Vectorizer](https://docs.rubixml.com/en/latest/transformers/word-count-vectorizer.html) creates a fixed-length continuous feature vector of word counts from the raw text and [TF-IDF Transformer](https://docs.rubixml.com/en/latest/transformers/tf-idf-transformer.html) applies a weighting scheme to those counts. Finally, [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) takes the TF-IDF weighted counts and centers and scales the sample matrix to have 0 mean and unit variance. This last step will help the neural network converge quicker.
5959

60-
The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents but no more than 5,000 documents. In this way, we limit the amount of *noise* words that enter the training set.
60+
The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents but no more than 10,000 documents. In this way, we limit the amount of *noise* words that enter the training set.
6161

6262
Another common text feature representation are [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values which take the term frequencies (TF) from Word Count Vectorizer and weigh them by their inverse document frequencies (IDF). IDFs can be interpreted as the word's *importance* within the training corpus. Specifically, higher weight is given to words that are more rare.
6363

@@ -84,7 +84,7 @@ use Rubix\ML\Persisters\Filesystem;
8484
$estimator = new PersistentModel(
8585
new Pipeline([
8686
new TextNormalizer(),
87-
new WordCountVectorizer(10000, 3, 5000, new NGram(1, 2)),
87+
new WordCountVectorizer(10000, 3, 10000, new NGram(1, 2)),
8888
new TfIdfTransformer(),
8989
new ZScaleStandardizer(),
9090
], new MultilayerPerceptron([
@@ -123,8 +123,18 @@ $scores = $estimator->scores();
123123

124124
$losses = $estimator->steps();
125125
```
126+
Next, we'll use an [Unlabeled](https://docs.rubixml.com/en/latest/datasets/unlabeled.html) dataset object to temporarily store and convert the scores and losses into CSV format so that we can import the data into our favorite plotting application such as [Plotly](https://plotly.com) or [Excel](https://www.microsoft.com/en-us/microsoft-365/excel). The global `array_transpose()` function takes a 2-dimensional array and changes the rows to columns and vice versa. It is necessary to call this function in order to get the samples into the correct *shape* for the dataset object.
126127

127-
Here is an example of what the validation score and training loss looks like when they are plotted. The validation score should be getting better with each epoch as the loss decreases. You can generate your own plots by importing the `progress.csv` file into your favorite plotting software such as [Plotly](https://plotly.com) or [Excel](https://www.microsoft.com/en-us/microsoft-365/excel).
128+
```php
129+
use Rubix\ML\Datasets\Unlabeled;
130+
use function Rubix\ML\array_transpose;
131+
132+
$table = array_transpose([$scores, $losses]);
133+
134+
Unlabeled::build($table)->toCSV()->write('progress.csv');
135+
```
136+
137+
Here is an example of what the validation score and training loss looks like when they are plotted. The validation score should be getting better with each epoch as the loss decreases. You can generate your own plots by importing the `progress.csv` file into your plotting application.
128138

129139
![F1 Score](https://raw.githubusercontent.com/RubixML/Sentiment/master/docs/images/validation-score.svg?sanitize=true)
130140

@@ -196,13 +206,21 @@ $report = new AggregateReport([
196206
]);
197207
```
198208

199-
To generate the report, pass in the predictions along with the labels from the testing set to the `generate()` method on the report.
209+
To generate the report, pass in the predictions along with the labels from the testing set to the `generate()` method on the report. The return value is a report object that can be echoed out to the console.
200210

201211
```php
202212
$results = $report->generate($predictions, $dataset->labels());
213+
214+
echo $results;
215+
```
216+
217+
We'll also save a copy of the report to a JSON file.
218+
219+
```php
220+
$results->toJSON()->write('report.json');
203221
```
204222

205-
Now we can execute the validation script from the command line like we see below to compute the results.
223+
Now we can execute the validation script from the command line.
206224
```sh
207225
$ php validate.php
208226
```

composer.json

+1-2
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,7 @@
2222
],
2323
"require": {
2424
"php": ">=7.2",
25-
"league/csv": "^9.5",
26-
"rubix/ml": "^0.1.0-rc5"
25+
"rubix/ml": "^0.1.0"
2726
},
2827
"suggest": {
2928
"ext-tensor": "For faster training and inference"

train.php

+5-5
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
use Rubix\ML\NeuralNet\Optimizers\AdaMax;
2020
use Rubix\ML\Persisters\Filesystem;
2121
use Rubix\ML\Other\Loggers\Screen;
22-
use League\Csv\Writer;
22+
use Rubix\ML\Datasets\Unlabeled;
2323

2424
use function Rubix\ML\array_transpose;
2525

@@ -41,7 +41,7 @@
4141
$estimator = new PersistentModel(
4242
new Pipeline([
4343
new TextNormalizer(),
44-
new WordCountVectorizer(10000, 3, 5000, new NGram(1, 2)),
44+
new WordCountVectorizer(10000, 3, 10000, new NGram(1, 2)),
4545
new TfIdfTransformer(),
4646
new ZScaleStandardizer(),
4747
], new MultilayerPerceptron([
@@ -69,9 +69,9 @@
6969
$scores = $estimator->scores();
7070
$losses = $estimator->steps();
7171

72-
$writer = Writer::createFromPath('progress.csv', 'w+');
73-
$writer->insertOne(['score', 'loss']);
74-
$writer->insertAll(array_transpose([$scores, $losses]));
72+
Unlabeled::build(array_transpose([$scores, $losses]))
73+
->toCSV(['scores', 'losses'])
74+
->write('progress.csv');
7575

7676
echo 'Progress saved to progress.csv' . PHP_EOL;
7777

validate.php

+3-1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@
3737

3838
$results = $report->generate($predictions, $dataset->labels());
3939

40-
file_put_contents('report.json', json_encode($results, JSON_PRETTY_PRINT));
40+
echo $results;
41+
42+
$results->toJSON()->write('report.json');
4143

4244
echo 'Report saved to report.json' . PHP_EOL;

0 commit comments

Comments
 (0)