Skip to content

Commit df2013b

Browse files
authored
Merge branch 'main' into eddie-changes
2 parents 14e6cbe + c5000d5 commit df2013b

File tree

7 files changed

+308
-40
lines changed

7 files changed

+308
-40
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ import fogx as fox
2020

2121
# 🦊 Dataset Creation
2222
# from distributed dataset storage
23-
dataset = fox.Dataset(load_from = ["/tmp/rtx", "s3://fox_stroage/"])
23+
dataset = fox.Dataset(load_from = ["/tmp/rtx", "s3://fox_storage/"])
2424

2525
# 🦊 Data collection:
2626
# create a new trajectory

docs/Reference.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# API Reference
2+
3+
## Dataset
4+
::: fog_x.dataset.Dataset
5+
6+
-------
7+
8+
## Episode
9+
::: fog_x.episode.Episode

docs/index.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
1-
# Welcome to MkDocs
1+
# 🦊 Fog-X Documentation
22

3-
For full documentation visit [mkdocs.org](https://www.mkdocs.org).
3+
**Fog-X is an efficient and scalable data collection and management framework for robotics learning.**
4+
Supports datasets from [Open-X-Embodiment](https://robotics-transformer-x.github.io/) and 🤗[HuggingFace](https://huggingface.co/).
5+
Fog-X considers both speed 🚀 and memory efficiency 📈 with active metadata and lazily-loaded trajectory data. It supports flexible and distributed dataset partitioning.
46

5-
## Commands
7+
## Installation
68

7-
* `mkdocs new [dir-name]` - Create a new project.
8-
* `mkdocs serve` - Start the live-reloading docs server.
9-
* `mkdocs build` - Build the documentation site.
10-
* `mkdocs -h` - Print help message and exit.
9+
```bash
10+
pip install fogx
11+
```
1112

12-
## Project layout
13+
## Usage
1314

14-
mkdocs.yml # The configuration file.
15-
docs/
16-
index.md # The documentation homepage.
17-
... # Other markdown pages, images and other files.
15+
See [Usage Guide](./usage.md) for an overview of how to use Fog-X.
16+
17+
You can also view [working examples on GitHub](https://github.com/KeplerC/fog_x/tree/main/examples).

docs/usage.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Usage Guide
2+
The code examples below will assume the following import:
3+
```py
4+
import fog_x as fox
5+
```
6+
7+
## Definitions
8+
- **episode**: one robot trajectory or action, consisting of multiple step data.
9+
- **step data**: data representing a snapshot of the robot action at a certain time
10+
- **metadata**: information that is consistent across a certain episode, e.g. the language instruction associated with the robot action, the name of the person collecting the data, or any other tags/labels.
11+
12+
## The Fog-X Dataset
13+
To start, create a `Dataset` object. Any data that is collected, loaded, or exported
14+
will be saved to the provided path.
15+
There can be existing Fog-X data located at the path as well, so that you can continue
16+
right where you left off.
17+
```py
18+
dataset = fox.Dataset(name="my_fogx_dataset", path="/local/path/my_fogx_dataset")
19+
```
20+
21+
## Collecting Robot Data
22+
23+
```py
24+
# create a new trajectory
25+
episode = dataset.new_episode()
26+
27+
# run robot and collect data
28+
while robot_is_running:
29+
# at each step, add data to the episode
30+
episode.add(feature = "arm_view", value = "image1.jpg")
31+
32+
# Automatically time-aligns and saves the trajectory
33+
episode.close()
34+
```
35+
36+
## Exporting Data
37+
By default, the exported data will be located under the `/export` directory within
38+
the initialized `dataset.path`.
39+
Currently, the supported data formats are `rtx`, `open-x`, and `rlds`.
40+
41+
```py
42+
# Export and share the dataset as standard Open-X-Embodiment format
43+
dataset.export(desired_episodes, format="rtx")
44+
```
45+
46+
### PyTorch
47+
```py
48+
import torch
49+
50+
metadata = dataset.get_episode_info()
51+
metadata = metadata.filter(metadata["feature1"] == "value1")
52+
pytorch_ds = dataset.pytorch_dataset_builder(metadata=metadata)
53+
54+
# get samples from the dataset
55+
for data in torch.utils.data.DataLoader(
56+
pytorch_ds,
57+
batch_size=2,
58+
collate_fn=lambda x: x,
59+
sampler=torch.utils.data.RandomSampler(pytorch_ds),
60+
):
61+
print(data)
62+
```
63+
64+
### HuggingFace
65+
WIP: Currently there is limited support for HuggingFace.
66+
67+
```py
68+
huggingface_ds = dataset.get_as_huggingface_dataset()
69+
```
70+
71+
72+
## Loading Data from Existing Datasets
73+
74+
### RT-X / Tensorflow Datasets
75+
Load any RT-X robotics data available at [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/).
76+
You can also find a preview of all the RT-X datasets [here](https://dibyaghosh.com/rtx_viz/).
77+
78+
When loading the episodes, you can optionally specify `additional_metadata` to be associated with it.
79+
You can also load a specific portion of train or test data with the `split` parameter. See the [Tensorflow Split API](https://www.tensorflow.org/datasets/splits) for specifics.
80+
81+
```py
82+
# load all berkeley_autolab_ur5 data
83+
dataset.load_rtx_episodes(name="berkeley_autolab_ur5")
84+
85+
# load 75% of berkeley_autolab_ur5 train data labeled with my_label as train`
86+
dataset.load_rtx_episodes(
87+
name="berkeley_autolab_ur5",
88+
split="train[:75%]",
89+
additional_metadata={"my_label": "train1"}
90+
)
91+
```
92+
93+
## Data Management
94+
95+
### Episode Metadata
96+
You can retrieve episode-level information (metadata) using `dataset.get_episode_info()`.
97+
This is a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html),
98+
meaning you have access to pandas data management methods including `filter`, `map`, `aggregate`, `groupby`, etc.
99+
After processing the metadata, you can then use the metadata to obtain your
100+
desired episodes with `dataset.read_by(desired_metadata)`.
101+
102+
```py
103+
# Retrieve episode-level data as a pandas DataFrame
104+
episode_info = dataset.get_episode_info()
105+
106+
# Use pandas DataFrame filter to select episodes
107+
desired_episode_metadata = episode_info.filter(episode_info["natural_language_instruction"] == "open door")
108+
109+
# Obtain the actual episodes containing their step data
110+
episodes = dataset.read_by(desired_episode_metadata)
111+
```
112+
113+
### Step Data
114+
Step data is stored as a [Polars LazyFrame](https://docs.pola.rs/py-polars/html/reference/lazyframe/index.html).
115+
Lazy loading with Polars results in speedups of 10 to 100 times compared to pandas.
116+
```py
117+
# Retrieve Fog-X dataset as a Polars LazyFrame
118+
step_data = dataset.get_step_data()
119+
120+
# select only the episode_id and natural_language_instruction
121+
lazy_id_to_language = step_data.select("episode_id", "natural_language_instruction")
122+
123+
# the frame is lazily evaluated at memory when we call collect(). returns a Polars DataFrame
124+
id_to_language = lazy_id_to_language.collect()
125+
126+
# drop rows with duplicate natural_language_instruction to see unique instructions
127+
id_to_language.unique(subset=["natural_language_instruction"], maintain_order=True)
128+
```
129+
130+
Polars also allows chaining methods:
131+
```py
132+
# Same as above example, but chained
133+
id_to_language = (
134+
dataset.get_step_data()
135+
.select("episode_id", "natural_language_instruction")
136+
.collect()
137+
.unique(subset=["natural_language_instruction"], maintain_order=True)
138+
)
139+
```
140+

fog_x/dataset.py

Lines changed: 91 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
from typing import Any, Dict, List, Optional, Tuple
55

66
import numpy as np
7+
import polars
8+
import pandas
79

810
from fog_x.database import (
911
DatabaseConnector,
@@ -36,6 +38,22 @@ def __init__(
3638
step_data_connector: DatabaseConnector = None,
3739
storage: Optional[str] = None,
3840
) -> None:
41+
"""
42+
43+
Args:
44+
name (str): Name of this dataset. Used as the directory name when exporting.
45+
path (str): Required. Local path of where this dataset should be stored.
46+
features (optional Dict[str, FeatureType]): Description of `param1`.
47+
enable_feature_inference (bool): enable inferring additional FeatureTypes
48+
49+
Example:
50+
```
51+
>>> dataset = fog_x.Dataset('my_dataset', path='~/fog_x/my_dataset`)
52+
```
53+
54+
TODO:
55+
* is replace_existing actually used anywhere?
56+
"""
3957
self.name = name
4058
path = os.path.expanduser(path)
4159
self.path = path
@@ -55,23 +73,24 @@ def __init__(
5573
if not os.path.exists(f"{path}/{name}"):
5674
os.makedirs(f"{path}/{name}")
5775
step_data_connector = LazyFrameConnector(f"{path}/{name}")
58-
self.db_manager = DatabaseManager(
59-
episode_info_connector, step_data_connector
60-
)
76+
self.db_manager = DatabaseManager(episode_info_connector, step_data_connector)
6177
self.db_manager.initialize_dataset(self.name, features)
6278

6379
self.storage = storage
6480
self.obs_keys = []
6581
self.act_keys = []
6682
self.step_keys = []
6783

68-
def new_episode(
69-
self, metadata: Optional[Dict[str, Any]] = None
70-
) -> Episode:
84+
def new_episode(self, metadata: Optional[Dict[str, Any]] = None) -> Episode:
7185
"""
7286
Create a new episode / trajectory.
73-
TODO #1: support multiple processes writing to the same episode
74-
TODO #2: close the previous episode if not closed
87+
88+
Returns:
89+
Episode
90+
91+
TODO:
92+
* support multiple processes writing to the same episode
93+
* close the previous episode if not closed
7594
"""
7695
return Episode(
7796
metadata=metadata,
@@ -113,6 +132,10 @@ def export(
113132
) -> None:
114133
"""
115134
Export the dataset.
135+
136+
Args:
137+
export_path (optional str): location of exported data. Uses dataset.path/export by default.
138+
format (str): Supported formats are `rtx`, `open-x`, and `rlds`.
116139
"""
117140
if format == "rtx" or format == "open-x" or format == "rlds":
118141
self.export_rtx(export_path, max_episodes_per_file, version, obs_keys, act_keys, step_keys)
@@ -274,7 +297,18 @@ def load_rtx_episodes(
274297
additional_metadata: Optional[Dict[str, Any]] = None,
275298
):
276299
"""
277-
Load the dataset.
300+
Load robot data from Tensorflow Datasets.
301+
302+
Args:
303+
name (str): Name of RT-X episodes, which can be found at [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog) under the Robotics category
304+
split (optional str): the portion of data to load, see [Tensorflow Split API](https://www.tensorflow.org/datasets/splits)
305+
additional_metadata (optional Dict[str, Any]): additional metadata to be associated with the loaded episodes
306+
307+
Example:
308+
```
309+
>>> dataset.load_rtx_episodes(name="berkeley_autolab_ur5)
310+
>>> dataset.load_rtx_episodes(name="berkeley_autolab_ur5", split="train[:10]", additional_metadata={"data_collector": "Alice", "custom_tag": "sample"})
311+
```
278312
"""
279313

280314
# this is only required if rtx format is used
@@ -334,26 +368,36 @@ def load_rtx_episodes(
334368
fog_episode.add(
335369
feature=str(k),
336370
value=v.numpy(),
337-
feature_type=FeatureType(
338-
tf_feature_spec=data_type[k]
339-
),
371+
feature_type=FeatureType(tf_feature_spec=data_type[k]),
340372
)
341373
self.step_keys.append(k)
342374
fog_episode.close()
343375

344-
def get_episode_info(self):
376+
def get_episode_info(self) -> pandas.DataFrame:
345377
"""
346-
Return the metadata as pandas dataframe.
378+
Returns:
379+
metadata of all episodes as `pandas.DataFrame`
347380
"""
348381
return self.db_manager.get_episode_info_table()
349382

350-
def get_step_data(self):
383+
def get_step_data(self) -> polars.LazyFrame:
351384
"""
352-
Return the all step data as lazy dataframe.
385+
Returns:
386+
step data of all episodes
353387
"""
354388
return self.db_manager.get_step_table_all()
355389

356-
def get_step_data_by_episode_ids(self, episode_ids: List[int], as_lazy_frame = True):
390+
def get_step_data_by_episode_ids(
391+
self, episode_ids: List[int], as_lazy_frame=True
392+
) -> List[polars.LazyFrame] | List[polars.DataFrame]:
393+
"""
394+
Args:
395+
episode_ids (List[int]): list of episode ids
396+
as_lazy_frame (bool): whether to return polars.LazyFrame or polars.DataFrame
397+
398+
Returns:
399+
step data of each episode
400+
"""
357401
episodes = []
358402
for episode_id in episode_ids:
359403
if episode_id == None:
@@ -363,8 +407,17 @@ def get_step_data_by_episode_ids(self, episode_ids: List[int], as_lazy_frame = T
363407
else:
364408
episodes.append(self.db_manager.get_step_table(episode_id).collect())
365409
return episodes
366-
367-
def read_by(self, episode_info: Any = None):
410+
411+
def read_by(self, episode_info: Any = None) -> List[polars.LazyFrame]:
412+
"""
413+
To be used with `Dataset.get_episode_info`.
414+
415+
Args:
416+
episode_info (pandas.DataFrame): episode metadata information to determine which episodes to read
417+
418+
Returns:
419+
episodes filtered by `episode_info`
420+
"""
368421
episode_ids = list(episode_info["episode_id"])
369422
logger.info(f"Reading episodes as order: {episode_ids}")
370423
episodes = []
@@ -384,6 +437,11 @@ def get_episodes_from_metadata(self, metadata: Any = None):
384437
return episodes
385438

386439
def pytorch_dataset_builder(self, metadata=None, **kwargs):
440+
"""
441+
Used for loading current dataset as a PyTorch dataset.
442+
To be used with `torch.utils.data.DataLoader`.
443+
"""
444+
387445
import torch
388446
from torch.utils.data import Dataset
389447
episodes = self.get_episodes_from_metadata(metadata)
@@ -394,17 +452,24 @@ def pytorch_dataset_builder(self, metadata=None, **kwargs):
394452
return pytorch_dataset
395453

396454
def get_as_huggingface_dataset(self):
455+
"""
456+
Load current dataset as a HuggingFace dataset.
457+
458+
TODO:
459+
* currently the support for huggingg face dataset is limited.
460+
it only shows its capability of easily returning a hf dataset
461+
* add features from the episode metadata
462+
* allow selecting episodes based on queries.
463+
doing so requires creating a new copy of the dataset on disk
464+
"""
397465
import datasets
398466

399-
# TODO: currently the support for huggingg face dataset is limited
400-
# it only shows its capability of easily returning a hf dataset
401-
# TODO #1: add features from the episode metadata
402-
# TODO #2: allow selecting episodes based on queries
403-
# doing so requires creating a new copy of the dataset on disk
404467
dataset_path = self.path + "/" + self.name
405-
parquet_files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path)]
468+
parquet_files = [
469+
os.path.join(dataset_path, f) for f in os.listdir(dataset_path)
470+
]
406471

407-
hf_dataset = datasets.load_dataset('parquet', data_files=parquet_files)
472+
hf_dataset = datasets.load_dataset("parquet", data_files=parquet_files)
408473
return hf_dataset
409474

410475
class PyTorchDataset(Dataset):

0 commit comments

Comments
 (0)