[WIP] Add datasets to config api V2 #623

leoromanovich · 2024-07-28T19:32:22Z

I've checked contribution guide.

AlekseySh

I like this approach. Let's continue with this.

As the next step, I think you need to update pipelines code with usage of a new split argument

AlekseySh · 2024-08-04T11:11:56Z

oml/configs/datasets/image_labeled_dataset.yaml

@@ -0,0 +1,9 @@
+name: image_labeled_dataset
+args:
+  dataframe_name: df.csv


let's use "df" name here? so it's more consistent with argument names, but replace it with read actual df in runtime

AlekseySh · 2024-08-04T11:20:44Z

oml/registry/datasets.py

+}
+
+
+def get_dataset_by_cfg(cfg: TCfg, split: Optional[str] = None) -> IBaseDataset:


I suggest to rework this function a bit:

def get_dataset_by_cfg(cfg: TCfg, split: Optional[str] = None) -> IBaseDataset: if cfg['name'] in DATASETS_REGISTRY: df = pd.read_csv(Path(cfg["args"]["dataset_root"]) / cfg["args"]["df"], index_col=False) mapper = {l: i for i, l in enumerate(df.sort_values(by=[SPLIT_COLUMN])[LABELS_COLUMN].unique())} df[LABELS_COLUMN] = df[LABELS_COLUMN].map(mapper) if split is not None: df = df[df[SPLIT_COLUMN] == split].reset_index(drop=True) cfg["args"]["df"] = df else: if split is not None: raise ValueError("We only support <split> option for built-in datasets.") if split and "dataframe_name" in cfg["args"].keys(): return dataset_class(**cfg["args"])

I also removed check_retrieval_dataframe_format because there was no such functionality before
I also hope in my implementation we dont need filtering arguments

WIP implementation of dataset registry

2f338ae

AlekseySh requested changes Aug 4, 2024

View reviewed changes

AlekseySh added the rework label Aug 4, 2024

AlekseySh assigned leoromanovich Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add datasets to config api V2 #623

[WIP] Add datasets to config api V2 #623

leoromanovich commented Jul 28, 2024

AlekseySh left a comment

AlekseySh Aug 4, 2024

AlekseySh Aug 4, 2024

		}


		def get_dataset_by_cfg(cfg: TCfg, split: Optional[str] = None) -> IBaseDataset:

[WIP] Add datasets to config api V2 #623

Are you sure you want to change the base?

[WIP] Add datasets to config api V2 #623

Conversation

leoromanovich commented Jul 28, 2024

AlekseySh left a comment

Choose a reason for hiding this comment

AlekseySh Aug 4, 2024

Choose a reason for hiding this comment

AlekseySh Aug 4, 2024

Choose a reason for hiding this comment