Improve clarity of logfile contents #216

erikr · 2020-04-17T20:36:42Z

What

Display training, validation, and test set size at the end of the log file for train mode (and potentially other modes).
Clearly portray how many epochs are actually completed (due to patience).

Why
It is helpful to know the number of tensors used for training, validation, and test, as well as the label count within each set.

Label count makes sense for categorical. Less clear how we best handle this for regression models.

It is also important to know when early stopping occurred.

Currently this information is not consolidated in one place in the log file. It also is spread out over workers.

How
Aggregate over workers.

Acceptance Criteria
After running recipes with train mode, the number of tensors used for training, validation, and test sets, as well label counts in each set, and the number of epochs actually run before early stopping, are summarized at the end of the log file.

The text was updated successfully, but these errors were encountered:

lucidtronix · 2020-04-29T15:24:35Z

For regression models I think emitting max, min, mean, and std would be helpful.

erikr · 2020-05-25T20:19:37Z

Label distribution is now logged for the final test set (both truth and predicted).

However, it would be helpful to also log label distribution for train, val, and test splits.

This info is collected in aggregate stats but spread out in the log file. Would be good if all in one place at the end.

StevenSong · 2020-06-25T14:47:24Z

blocked by #323

StevenSong · 2020-06-29T21:50:22Z

the number of tensors used for training, validation, and test sets

this is somewhat difficult, as I understand, the number of tensors in each epoch can vary (purposefully using balance_csv) and so we'd need some way of getting all the unique tensors used across all epochs

erikr · 2020-06-29T22:55:48Z

The goal is to understand precisely how much data was given to the model (in train, validation, and test splits, per epoch).

Let me take a step back and ensure I understand some of the plumbing.

The number of tensors per "keras epoch" (== faux epoch) is set by ${split}_steps x batch_size where ${split} can be train, valid, or test.

I find the following information helpful (figures taken from our STS-ECG work):

|            | steps | batch_size | num_samples_in_faux_epoch | num_samples_in_csv | coverage_of_true_epoch |
|------------|-------|------------|---------------------------|--------------------|------------------------|
| train      | 400   | 64         | 25600                     | 11318              | 2.26                   |
| validation | 60    | 64         | 3840                      | 3233               | 1.19                   |
| test       | 32    | 64         | 2048                      | 1619               | 1.26                   |

How is this useful? If coverage_of_true_epoch < 1.0, not all the data in a given split is covered in one faux epoch, and the user should increase steps and/or batch_size.

If coverage_of_true_epoch >> 1.0, perhaps we should reduce steps and/or batch_size, so one faux epoch approximates one true epoch.

Note the counts (e.g. number of samples) are not the number of samples the model sees. Rather, the counts are the number of HD5 files the model attempts to get data from via TMaps.

In fact, even for our STS-ECG work where we specify every MRN in each split, we cannot easily tell from the log file if arrays were successfully obtained from 100% of those HD5 files and passed to the model for training, validation, and testing.

If a user does not set --train_csv, --valid_csv, --test_csv, there is no num_samples_in_csv.

Instead, the number of samples in each split, per epoch, is determined by --train_steps, --validation_steps , and --test_steps x batch_size (respectively).

Each split is randomly sampled (although not yet stratified by label, which would be better; see #313) from training, validation, and test sets that are determined by:

--train_ratio, --valid_ratio, and --test_ratio
The number of HD5 files at --tensors.

This is further complicated by the fact that a TMap may not return a valid array.

Let's say we have 1000 HD5 files at --tensors, and we train a model. Imagine we set validation_steps and batch_size so their product is exactly 700, and we use default ratios so the validation set is comprised of 700 HD5s. If we are at 699 samples in the first epoch, and the 700th fails to return a valid array (per the TMap), does the gradient update proceed?

I posit it is useful to know how many tensors were given to the model (e.g. had valid array returned by TMap), rather than merely how many HD5 files we tried to get tensors from via TMaps.

StevenSong · 2020-06-30T14:09:18Z

I posit it is useful to know how many tensors were given to the model (e.g. had valid array returned by TMap), rather than merely how many HD5 files we tried to get tensors from via TMaps.

agreed, I think I just meant that it will take more work/plumbing to achieve this output

erikr · 2020-06-30T14:44:37Z

Spoke w/ @paolodi:

Agrees would be good to know how many tensors are presented to the model for train / val / test at each epoch, perhaps in log file or separate .txt file.
Tensors that do not fail extraction via TMap are cached into memory for efficient recall.
Paths to tensors that fail are stored so they are not re-evaluated next time.
Next true epoch becomes faster because of this caching.
Should explore TensorBoard which has relevant functionality.
Find a time to discuss w/ @ndiamant and him (next week, as this week is busy).

erikr added the enhancement New feature or request label Apr 17, 2020

lucidtronix mentioned this issue Apr 29, 2020

True epoch stats #248

Merged

erikr changed the title ~~Log training, validation, and test set size~~ Summarize training, validation, and test set size at end of logfile May 25, 2020

erikr changed the title ~~Summarize training, validation, and test set size at end of logfile~~ Improve clarity of logfile contents Jun 5, 2020

erikr assigned StevenSong Jun 5, 2020

kathwy mentioned this issue Jun 29, 2020

Plot training calibration, PR, and ROC curves; logging of label breakdown and number of epochs #332 #340

Closed

erikr closed this as completed Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve clarity of logfile contents #216

Improve clarity of logfile contents #216

erikr commented Apr 17, 2020 •

edited

Loading

lucidtronix commented Apr 29, 2020

erikr commented May 25, 2020

StevenSong commented Jun 25, 2020

StevenSong commented Jun 29, 2020 •

edited

Loading

erikr commented Jun 29, 2020 •

edited

Loading

StevenSong commented Jun 30, 2020

erikr commented Jun 30, 2020

Improve clarity of logfile contents #216

Improve clarity of logfile contents #216

Comments

erikr commented Apr 17, 2020 • edited Loading

lucidtronix commented Apr 29, 2020

erikr commented May 25, 2020

StevenSong commented Jun 25, 2020

StevenSong commented Jun 29, 2020 • edited Loading

erikr commented Jun 29, 2020 • edited Loading

StevenSong commented Jun 30, 2020

erikr commented Jun 30, 2020

erikr commented Apr 17, 2020 •

edited

Loading

StevenSong commented Jun 29, 2020 •

edited

Loading

erikr commented Jun 29, 2020 •

edited

Loading