Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve clarity of logfile contents #216

Closed
erikr opened this issue Apr 17, 2020 · 7 comments
Closed

Improve clarity of logfile contents #216

erikr opened this issue Apr 17, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@erikr
Copy link

erikr commented Apr 17, 2020

What

  1. Display training, validation, and test set size at the end of the log file for train mode (and potentially other modes).

  2. Clearly portray how many epochs are actually completed (due to patience).

Why
It is helpful to know the number of tensors used for training, validation, and test, as well as the label count within each set.

Label count makes sense for categorical. Less clear how we best handle this for regression models.

It is also important to know when early stopping occurred.

Currently this information is not consolidated in one place in the log file. It also is spread out over workers.

How
Aggregate over workers.

Acceptance Criteria
After running recipes with train mode, the number of tensors used for training, validation, and test sets, as well label counts in each set, and the number of epochs actually run before early stopping, are summarized at the end of the log file.

@erikr erikr added the enhancement New feature or request label Apr 17, 2020
@lucidtronix
Copy link
Collaborator

For regression models I think emitting max, min, mean, and std would be helpful.

@erikr
Copy link
Author

erikr commented May 25, 2020

Label distribution is now logged for the final test set (both truth and predicted).

However, it would be helpful to also log label distribution for train, val, and test splits.

This info is collected in aggregate stats but spread out in the log file. Would be good if all in one place at the end.

@erikr erikr changed the title Log training, validation, and test set size Summarize training, validation, and test set size at end of logfile May 25, 2020
@erikr erikr changed the title Summarize training, validation, and test set size at end of logfile Improve clarity of logfile contents Jun 5, 2020
@StevenSong
Copy link
Collaborator

blocked by #323

@StevenSong
Copy link
Collaborator

StevenSong commented Jun 29, 2020

the number of tensors used for training, validation, and test sets

this is somewhat difficult, as I understand, the number of tensors in each epoch can vary (purposefully using balance_csv) and so we'd need some way of getting all the unique tensors used across all epochs

@erikr
Copy link
Author

erikr commented Jun 29, 2020

The goal is to understand precisely how much data was given to the model (in train, validation, and test splits, per epoch).

Let me take a step back and ensure I understand some of the plumbing.

The number of tensors per "keras epoch" (== faux epoch) is set by ${split}_steps x batch_size where ${split} can be train, valid, or test.

I find the following information helpful (figures taken from our STS-ECG work):

|            | steps | batch_size | num_samples_in_faux_epoch | num_samples_in_csv | coverage_of_true_epoch |
|------------|-------|------------|---------------------------|--------------------|------------------------|
| train      | 400   | 64         | 25600                     | 11318              | 2.26                   |
| validation | 60    | 64         | 3840                      | 3233               | 1.19                   |
| test       | 32    | 64         | 2048                      | 1619               | 1.26                   |

How is this useful? If coverage_of_true_epoch < 1.0, not all the data in a given split is covered in one faux epoch, and the user should increase steps and/or batch_size.

If coverage_of_true_epoch >> 1.0, perhaps we should reduce steps and/or batch_size, so one faux epoch approximates one true epoch.

Note the counts (e.g. number of samples) are not the number of samples the model sees. Rather, the counts are the number of HD5 files the model attempts to get data from via TMaps.

In fact, even for our STS-ECG work where we specify every MRN in each split, we cannot easily tell from the log file if arrays were successfully obtained from 100% of those HD5 files and passed to the model for training, validation, and testing.

If a user does not set --train_csv, --valid_csv, --test_csv, there is no num_samples_in_csv.

Instead, the number of samples in each split, per epoch, is determined by --train_steps, --validation_steps , and --test_steps x batch_size (respectively).

Each split is randomly sampled (although not yet stratified by label, which would be better; see #313) from training, validation, and test sets that are determined by:

  1. --train_ratio, --valid_ratio, and --test_ratio
  2. The number of HD5 files at --tensors.

This is further complicated by the fact that a TMap may not return a valid array.

Let's say we have 1000 HD5 files at --tensors, and we train a model. Imagine we set validation_steps and batch_size so their product is exactly 700, and we use default ratios so the validation set is comprised of 700 HD5s. If we are at 699 samples in the first epoch, and the 700th fails to return a valid array (per the TMap), does the gradient update proceed?

I posit it is useful to know how many tensors were given to the model (e.g. had valid array returned by TMap), rather than merely how many HD5 files we tried to get tensors from via TMaps.

@StevenSong
Copy link
Collaborator

I posit it is useful to know how many tensors were given to the model (e.g. had valid array returned by TMap), rather than merely how many HD5 files we tried to get tensors from via TMaps.

agreed, I think I just meant that it will take more work/plumbing to achieve this output

@erikr
Copy link
Author

erikr commented Jun 30, 2020

Spoke w/ @paolodi:

  1. Agrees would be good to know how many tensors are presented to the model for train / val / test at each epoch, perhaps in log file or separate .txt file.
  2. Tensors that do not fail extraction via TMap are cached into memory for efficient recall.
  3. Paths to tensors that fail are stored so they are not re-evaluated next time.
  4. Next true epoch becomes faster because of this caching.
  5. Should explore TensorBoard which has relevant functionality.
  6. Find a time to discuss w/ @ndiamant and him (next week, as this week is busy).

@erikr erikr closed this as completed Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants