Feature/weighted multiple datasets #3

gfickel · 2023-08-01T13:05:23Z

Feature to combine multiple datasets, each one with a sampling weight. This weight is given by the softmax of the normalized dataset sizes, with a user defined temperature.

Also changed WeightedRandomSampler replacement=True so that we can over-sample the smaller datasets, otherwise we will always sample each image once for every epoch. A quick overview on this can be seen here: https://towardsdatascience.com/demystifying-pytorchs-weightedrandomsampler-by-example-a68aceccb452

gustavofuhr · 2023-08-21T13:15:37Z

miyagi_trainer/dataloaders.py

+def _get_pytorch_dataloders(
+        dataset, batch_size, num_workers, balanced_weights=False,
+        multiple_datasets_temperature=0.0):
+
    if balanced_weights:


I know that it was me that did that, but I forgot, if I set balanced_weights the samples will be balanced by the size of the class, that's it? What you implemented is a way that you can give more/less importance to other sources of data, datasets, right?

Additionally, you cannot combine both, right?

Maybe we should change the balanced_weights parameter's name to be more descriptive.

gustavofuhr · 2023-08-21T13:17:55Z

miyagi_trainer/train.py

@@ -263,6 +264,9 @@ def train(args):
    parser.add_argument("--train_datasets", action='store', type=str, nargs="+", required=True)
    parser.add_argument("--val_datasets", action='store', type=str, nargs="+", required=True)
    parser.add_argument("--balanced_weights", action=argparse.BooleanOptionalAction)
+    parser.add_argument("--multiple_datasets_temperature", type=float, required=False,


Didn't get that, the temperature is not suppose to be defined by a value for each dataset? Is there any reference for this type of weighting?

It would be awesome if you explain a bit this options in the README.md as well.

It was a hacky way to balance multiple datasets that was not very good, honestly. I think it is a nice feature to have, but I would reimplement it with a more clear interface and output. Probably something like passing a list of datasets: [ds1, ds2, ds3] and some sampling weights to them: [0.2, 0.1, 0.7] that would sample 20% of ds1, 10% of ds2 and 70% of ds3. Sounds better, right?

ADD: add scheduler option and cleaned some code

gfickel · 2024-03-14T18:35:39Z

@gustavofuhr I believe the best thing here is to close this PR and I'll open a new one just for the sampling fix. It is a one liner. And this options of weighting multiple datasets can be redone if needed.

gfickel added 5 commits July 24, 2023 16:14

ADD: weighted datasets on dataloaders

867897c

ADD: options on train.py

0300bf9

ADD: add scheduler option and cleaned some code

5e208f3

FIX: a future feature sliped on the previous commit

7723078

REF: small refactor

ff6a41f

gustavofuhr requested changes Aug 21, 2023

View reviewed changes

Merge pull request #4 from gustavofuhr/clean_code

96946b2

ADD: add scheduler option and cleaned some code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/weighted multiple datasets #3

Feature/weighted multiple datasets #3

gfickel commented Aug 1, 2023

gustavofuhr Aug 21, 2023

gustavofuhr Aug 21, 2023

gfickel Mar 14, 2024

gfickel commented Mar 14, 2024

Feature/weighted multiple datasets #3

Are you sure you want to change the base?

Feature/weighted multiple datasets #3

Conversation

gfickel commented Aug 1, 2023

gustavofuhr Aug 21, 2023

Choose a reason for hiding this comment

gustavofuhr Aug 21, 2023

Choose a reason for hiding this comment

gfickel Mar 14, 2024

Choose a reason for hiding this comment

gfickel commented Mar 14, 2024