Skip to content

Commit 9f55691

Browse files
committed
Update to Rubix ML 0.3.0
1 parent 7ca1c40 commit 9f55691

8 files changed

+10
-246
lines changed

Diff for: LICENSE.md renamed to LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2020 The Rubix ML Community
3+
Copyright (c) 2020 Rubix ML
44
Copyright (c) 2020 Andrew DalPino
55

66
Permission is hereby granted, free of charge, to any person obtaining a copy

Diff for: README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Neural networks compute a non-linear continuous function and therefore require c
5757

5858
First, we'll convert all characters to lowercase using [Text Normalizer](https://docs.rubixml.com/en/latest/transformers/text-normalizer.html) so that every word is represented by only a single token. Then, [Word Count Vectorizer](https://docs.rubixml.com/en/latest/transformers/word-count-vectorizer.html) creates a fixed-length continuous feature vector of word counts from the raw text and [TF-IDF Transformer](https://docs.rubixml.com/en/latest/transformers/tf-idf-transformer.html) applies a weighting scheme to those counts. Finally, [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) takes the TF-IDF weighted counts and centers and scales the sample matrix to have 0 mean and unit variance. This last step will help the neural network converge quicker.
5959

60-
The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents but no more than 10,000 documents. In this way, we limit the amount of *noise* words that enter the training set.
60+
The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 2 different documents but no more than 10,000 documents. In this way, we limit the amount of *noise* words that enter the training set.
6161

6262
Another common text feature representation are [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values which take the term frequencies (TF) from Word Count Vectorizer and weigh them by their inverse document frequencies (IDF). IDFs can be interpreted as the word's *importance* within the training corpus. Specifically, higher weight is given to words that are more rare.
6363

@@ -84,7 +84,7 @@ use Rubix\ML\Persisters\Filesystem;
8484
$estimator = new PersistentModel(
8585
new Pipeline([
8686
new TextNormalizer(),
87-
new WordCountVectorizer(10000, 3, 10000, new NGram(1, 2)),
87+
new WordCountVectorizer(10000, 2, 10000, new NGram(1, 2)),
8888
new TfIdfTransformer(),
8989
new ZScaleStandardizer(),
9090
], new MultilayerPerceptron([
@@ -136,9 +136,9 @@ Unlabeled::build($table)->toCSV()->write('progress.csv');
136136

137137
Here is an example of what the validation score and training loss looks like when they are plotted. The validation score should be getting better with each epoch as the loss decreases. You can generate your own plots by importing the `progress.csv` file into your plotting application.
138138

139-
![F1 Score](https://raw.githubusercontent.com/RubixML/Sentiment/master/docs/images/validation-score.svg?sanitize=true)
139+
![F1 Score](https://raw.githubusercontent.com/RubixML/Sentiment/master/docs/images/validation-scores.png)
140140

141-
![Cross Entropy Loss](https://raw.githubusercontent.com/RubixML/Sentiment/master/docs/images/training-loss.svg?sanitize=true)
141+
![Cross Entropy Loss](https://raw.githubusercontent.com/RubixML/Sentiment/master/docs/images/training-losses.png)
142142

143143
### Saving
144144
Finally, we save the model so we can load it later in our validation and prediction scripts.
@@ -366,4 +366,4 @@ See DATASET_README. For comments or questions regarding the dataset please conta
366366
>- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
367367
368368
## License
369-
The code is licensed [MIT](LICENSE.md) and the tutorial is licensed [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
369+
The code is licensed [MIT](LICENSE) and the tutorial is licensed [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).

Diff for: composer.json

+3-7
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
"type": "project",
44
"description": "An example project using a multi layer feed forward neural network for text sentiment classification trained with 25,000 movie reviews from IMDB.",
55
"homepage": "https://github.com/RubixML/Sentiment",
6-
"license": "Apache-2.0",
6+
"license": "MIT",
77
"readme": "README.md",
88
"keywords": [
99
"bag of words", "batch norm", "classification", "dataset", "data science", "example project",
@@ -15,17 +15,13 @@
1515
"authors": [
1616
{
1717
"name": "Andrew DalPino",
18-
"email": "[email protected]",
19-
"homepage": "https://andrewdalpino.com",
18+
"homepage": "https://github.com/andrewdalpino",
2019
"role": "Lead Engineer"
2120
}
2221
],
2322
"require": {
2423
"php": ">=7.2",
25-
"rubix/ml": "^0.1.0"
26-
},
27-
"suggest": {
28-
"ext-tensor": "For faster training and inference"
24+
"rubix/ml": "^0.3.0"
2925
},
3026
"scripts": {
3127
"predict": "@php predict.php",

0 commit comments

Comments
 (0)