Skip to content

Commit

Permalink
Merge pull request #12 from fraunhoferportugal/dev
Browse files Browse the repository at this point in the history
Minor Patch 0.1.3
  • Loading branch information
matiaspedro authored Nov 6, 2024
2 parents ebd2e4c + 7406ee2 commit 49d8a17
Show file tree
Hide file tree
Showing 43 changed files with 1,859 additions and 281 deletions.
15 changes: 15 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Read the Docs configuration file for Sphinx projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.10"
commands:
- pip install poetry==1.8.3
- poetry install --only docs
- poetry run mkdocs build --site-dir $READTHEDOCS_OUTPUT/html
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,26 @@
All notable changes to this project will be documented in this file.
This format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.1.3] - 2024-11-05
Documentation and API updates.

### Added
- Objective field in the class docstrings for specifying the objective of the metric
- Added .readthedocs.yml file for readthedocs configuration
- General description of the package in the Getting Started section of the documentation

### Fixed
- Updated unit tests for the API to reflect the new changes in the metric categorization
- Updated developer guidelines for pre-commit hooks
- Updated Makefile hook for documentation generation with output directory
- Reorganized the documentation structure to the new `metric_group` categorization
- Removed any reference to the previous categorizations from the API models

### Changed
- mkdocs heading from 3 to 4 for better organization
- `FrechetDistance` metric now uses `InceptionFID` from piq as a default feature extractor



## [0.1.2] - 2024-10-28
### Added
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ To send a pull request, please:
## Tips for Modifying the Source Code

- We recommend developing on Linux as this is the only OS where all features are currently 100% functional.
- Use **Python >= 3.8.16** for development.
- Use **Python >= 3.9.0** for development.
- Please try to avoid introducing additional dependencies on 3rd party packages.
- We encourage you to add your own unit tests, but please ensure they run quickly (unit tests should train models on
small data-subsample with the lowest values of training iterations and time-limits that suffice to evaluate the intended
Expand Down
34 changes: 6 additions & 28 deletions DEVELOPER.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@

This project uses [Conda](https://anaconda.org/anaconda/python) to manage Python virtual environments and [Poetry](https://python-poetry.org/) as the main dependency manager. The project is structured as a Python src package, with the main package located in the `pymdma` folder.

There are three main modalities: `image`, `time_series`, and `tabular`. Each modality has its own folder/submodule in the `pymdma` package. The `general` and `common` modules contain the main classes definitions used in the API and on the package version of the project.
There are three main modalities: `image`, `time_series`, and `tabular`. Each modality has its own folder/submodule in the `pymdma` package. The `general` and `common` modules contain the main classes definitions used in the API and on the package version of the project.

Each modality dependency is defined as an extra in the [pyproject](pyproject.toml) configuration file. Development dependencies are defined as poetry groups in the same file. More information about packaging and dependencies can be found below.
Each modality dependency is defined as an extra in the [pyproject](pyproject.toml) configuration file. Development dependencies are defined as poetry groups in the same file. More information about packaging and dependencies can be found below.

> **IMPORTANT:** We are using setuptools as the build system, due to limitations in the current version of Poetry. Because of this, you should use poetry only to manage dependencies during the development stage. Requirements for each modality are defined in the requirements folder with the appropriate version constraints to ensure cross-compatibility between dependencies.
The `scripts` folder contains shell scripts that can be used to automate common tasks. You can find some examples of execution in this folder. Additionally, the `notebooks` folder contains Jupyter notebooks with examples of how to import and use the package.

The `scripts` folder contains shell scripts that can be used to automate common tasks. You can find some examples of execution in this folder. Additionally, the `notebooks` folder contains Jupyter notebooks with examples of how to import and use the pa

We also provide a docker image to run a REST API server version of the repository. The docker image is built using the [Dockerfile](Dockerfile) in the root of the repository.
We also provide a docker image to run a REST API version of the repository. The docker image is built using the [Dockerfile](Dockerfile) in the root of the repository.

A coding standard is enforced using [Black](https://github.com/psf/black), [isort](https://pypi.org/project/isort/) and
[Flake8](https://flake8.pycqa.org/en/latest/). Python 3 type hinting is validated using
Expand Down Expand Up @@ -60,7 +58,7 @@ To start developing, you should install the project dependencies and the pre-com
```shell
make setup-all # install dependencies
source .venv-dev/bin/activate # activate the virtual environment
make install-dev-tools # install development tools
make install-dev-all # install development tools
```

Alternatively, you can install the dependencies manually by running the following commands:
Expand Down Expand Up @@ -117,13 +115,11 @@ poetry install --only <group>,<group>,...

### Extra Dependencies
To add an extra dependency, use:

```
poetry add <package> --extras <extra>
```

To install the extra dependencies, use:

```
poetry install --extras <extra>
```
Expand All @@ -132,24 +128,6 @@ Note that `<extra>` is the name of the extra dependencies group or a space separ

A list of all dependencies can be found in the [pyproject.toml](pyproject.toml) configuration file.

A coding standard is enforced using [Black](https://github.com/psf/black), [isort](https://pypi.org/project/isort/) and
[Flake8](https://flake8.pycqa.org/en/latest/). Python 3 type hinting is validated using
[MyPy](https://pypi.org/project/mypy/).

Unit tests are written using [Pytest](https://docs.pytest.org/en/latest/), documentation is written
using [Numpy Style Python Docstring](https://numpydoc.readthedocs.io/en/latest/format.html).
[Pydocstyle](http://pydocstyle.org/) is used as static analysis tool for checking compliance with Python docstring
conventions.

Additional code security standards are enforced by [Safety](https://github.com/pyupio/safety) and
[Bandit](https://bandit.readthedocs.io/en/latest/). [Git-secrets](https://github.com/awslabs/git-secrets)
ensure you're not pushing any passwords or sensitive information into your Bitbucket repository.
Commits are rejected if the tool matches any of the configured regular expression patterns that indicate that sensitive
information has been stored improperly.

We use [mkdocs](https://www.mkdocs.org) for building documentation.
You can call `make build_docs` from the project root, the docs will be built under `docs/_build/html`.
Detail information about documentation can be found [here](docs/index.md).

## Git Hooks

Expand All @@ -167,7 +145,7 @@ that the linting and type checking is correct. If there are errors, the commit w
that need to be made. Alternatively, you can run pre-commit

```
pre-commit run --all-files
pre-commit run
```

If necessary, you can temporarily disable a hook using Git's `--no-verify` switch. However, keep in mind that the CI
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ setup-docs:
python3 -m venv .venv-docs && \
source .venv-docs/bin/activate && \
poetry run pip install --upgrade pip setuptools && \
poetry install --all-extras --with dev,docs && \
poetry install --only docs && \
echo -e "$(SUCCESS) Virtual environment created successfully!$(TERMINATOR)" && \
echo -e "$(HINT) Activate the virtual environment with: source .venv-docs/bin/activate$(TERMINATOR)"

Expand Down Expand Up @@ -415,7 +415,7 @@ push-all: push dvc-upload
mkdocs-build: setup-docs
@echo -e "$(INFO) Building documentation...$(TERMINATOR)" && \
source .venv-docs/bin/activate && \
poetry run mkdocs build
poetry run mkdocs build --site-dir html/

## Serve MKDocs documentation on localhost:8000
mkdocs-serve: setup-docs
Expand Down
2 changes: 1 addition & 1 deletion docs/image/annotation.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Annotation Validation

## COCO Dataset Metrics

### Validity
::: pymdma.image.measures.input_val.annotation.coco.DatasetCompletness
::: pymdma.image.measures.input_val.annotation.coco.AnnotationCorrectness
::: pymdma.image.measures.input_val.annotation.coco.AnnotationUniqueness
7 changes: 3 additions & 4 deletions docs/image/input_val.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Input Validation

## No Reference

## Data-based
### Quality (No-reference)
::: pymdma.image.measures.input_val.DOM
::: pymdma.image.measures.input_val.Tenengrad
::: pymdma.image.measures.input_val.TenengradRelative
Expand All @@ -14,8 +14,7 @@

______________________________________________________________________

## Full Reference

### Quality (Full-reference)
::: pymdma.image.measures.input_val.PSNR
::: pymdma.image.measures.input_val.SSIM
::: pymdma.image.measures.input_val.MSSIM
8 changes: 5 additions & 3 deletions docs/image/synth_val.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,17 @@

For synthesis validation we have only feature based evaluation metrics.

## Feature Based

## Feature-based
### Quality
::: pymdma.image.measures.synthesis_val.GIQA
::: pymdma.image.measures.synthesis_val.ImprovedPrecision
::: pymdma.image.measures.synthesis_val.ImprovedRecall
::: pymdma.image.measures.synthesis_val.Density
::: pymdma.image.measures.synthesis_val.Coverage
::: pymdma.image.measures.synthesis_val.Authenticity
::: pymdma.image.measures.synthesis_val.FrechetDistance
::: pymdma.image.measures.synthesis_val.GeometryScore
::: pymdma.image.measures.synthesis_val.MultiScaleIntrinsicDistance
::: pymdma.image.measures.synthesis_val.PrecisionRecallDistribution

### Privacy
::: pymdma.image.measures.synthesis_val.Authenticity
29 changes: 26 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,30 @@
# pyMDMA - Multimodal Data Metrics for Auditing real and synthetic datasets

Repository for the development of time series data/model auditing methods.
Data auditing is essential for ensuring the reliability of machine learning models by maintaining the integrity of the datasets upon which these models rely. This work introduces a dedicated repository for data auditing, presenting a comprehensive suite of metrics designed for evaluating data.

## Getting Started
pyMDMA is an open-source Python library that offers metrics for evaluating both
real and synthetic datasets across image, tabular, and time-series data modalities. It was
developed to address gaps in existing evaluation frameworks that either lack metrics for
specific data modalities or do not include certain state-of-the-art metrics. The library is designed to be modular, allowing users to easily extend it with new metrics.

This is the entrypoint for the documentation.
The source code is available on [GitHub](https://github.com/fraunhoferportugal/pymdma/tree/main) and the documentation can be found [here](dummy).

## Metric Categories
Each metric class is organized based on the modality, validation type, metric group and goal. Following is a brief description of these categories:
![Metric Categories](resources/data_auditing.png)

### Validation Type
The platform offers two types of evaluation - input and synthesis validation. The first type includes metrics for assessing raw data quality intended for use in machine learning tasks. The second type evaluates data generated by a synthesis model. Note that input metrics can also be used to evaluate the quality of synthetic datasets.

### Metric Group
Metrics are loosely organized based on the data format and metric input requirements. Data-based metrics require minimal to no preprocessing of the data before computation. Feature-based metrics are computed over embeddings of the data, often obtained with a classification model. Annotation-based metrics validate the integrity and validity of dataset annotations. Currently, this last type is only available for COCO [1] annotated image datasets.

### Metric Goal
These categories represent the types of evaluations each metric performs and are applicable across various validation contexts. For input validation, Quality refers to measurable data attributes, such as contrast and brightness in images or the signal-to-noise ratio in time-series data. In synthesis validation, Quality encompasses three key evaluation pillars for synthetic datasets: Fidelity, Diversity, and Authenticity [2]. Fidelity measures the similarity of a synthetic dataset to real data; Diversity evaluates how well the synthetic dataset spans the full range of the real data manifold; and Authenticity ensures the synthetic dataset is sufficiently distinct from real data to avoid being a copy.

Utility metrics assess the usefulness of synthetic datasets for downstream tasks, which is especially valuable when synthetic data is used to augment real datasets. Privacy metrics examine whether a dataset or instance is overly similar to another; without a reference, they help identify sensitive attributes like names or addresses. Finally, Validity includes metrics that confirm data integrity, such as ensuring that COCO annotations meet standard formatting requirements.

## References
[1] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.

[2] Alaa, A., Van Breugel, B., Saveliev, E.S. &amp; van der Schaar, M.. (2022). How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. <i>Proceedings of the 39th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 162:290-306 Available from https://proceedings.mlr.press/v162/alaa22a.html.
5 changes: 3 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,11 @@ Or alternatively, you can install all modalities by running the following comman
$ pip install pymdma[all]
```

## Torch CPU version

## Minimal Version (CPU)

For a minimal installation (without GPU support), you can install the package with CPU version of torch, which will skip the installation of CUDA dependencies. To do so, run the following command:

```bash
$ pip install pymdma[...] --extra-index-url=https://download.pytorch.org/whl/cpu/torch_stable.html
$ pip install pymdma[...] --find-url=https://download.pytorch.org/whl/cpu/torch_stable.html
```
Binary file added docs/resources/data_auditing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/tabular/input_val.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Input Validation

## Data Quality

## Data-based
### Quality
::: pymdma.tabular.measures.input_val.CorrelationScore
::: pymdma.tabular.measures.input_val.UniquenessScore
::: pymdma.tabular.measures.input_val.UniformityScore
Expand All @@ -12,6 +12,6 @@

______________________________________________________________________

## Privacy
### Privacy

::: pymdma.tabular.measures.input_val.KAnonymityScore
13 changes: 8 additions & 5 deletions docs/tabular/synth_val.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,22 @@

For synthesis validation we have only feature based evaluation metrics. To access the

## Feature Based

## Feature-based
### Quality
::: pymdma.tabular.measures.synthesis_val.ImprovedPrecision
::: pymdma.tabular.measures.synthesis_val.ImprovedRecall
::: pymdma.tabular.measures.synthesis_val.Density
::: pymdma.tabular.measures.synthesis_val.Coverage
::: pymdma.tabular.measures.synthesis_val.Authenticity
::: pymdma.tabular.measures.synthesis_val.StatisticalSimScore
::: pymdma.tabular.measures.synthesis_val.StatisiticalDivergenceScore
::: pymdma.tabular.measures.synthesis_val.CoherenceScore

### Privacy
::: pymdma.tabular.measures.synthesis_val.Authenticity

______________________________________________________________________

## Data Based
## Data-based
### Utility

::: pymdma.tabular.measures.synthesis_val.Utility
<!-- ::: pymdma.tabular.measures.synthesis_val.Utility -->
4 changes: 2 additions & 2 deletions docs/time_series/input_val.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Input Validation

## Data Quality

## Data-based
### Quality (No-reference)
::: pymdma.time_series.measures.input_val.Uniqueness
::: pymdma.time_series.measures.input_val.SNR
8 changes: 5 additions & 3 deletions docs/time_series/synth_val.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,19 @@ For synthesis validation we have only feature based evaluation metrics. Reminder

These metrics require a preprocessing of the data, to extract the features that will be used to compare the generated data with the reference dataset. For more information on this process, please refer to the feature extraction tutorials.

## Feature Based

## Feature-based
### Quality
::: pymdma.time_series.measures.synthesis_val.ImprovedPrecision
::: pymdma.time_series.measures.synthesis_val.ImprovedRecall
::: pymdma.time_series.measures.synthesis_val.Density
::: pymdma.time_series.measures.synthesis_val.Coverage
::: pymdma.time_series.measures.synthesis_val.Authenticity
::: pymdma.time_series.measures.synthesis_val.FrechetDistance
::: pymdma.time_series.measures.synthesis_val.GeometryScore
::: pymdma.time_series.measures.synthesis_val.MultiScaleIntrinsicDistance
::: pymdma.time_series.measures.synthesis_val.PrecisionRecallDistribution
::: pymdma.time_series.measures.synthesis_val.WassersteinDistance
::: pymdma.time_series.measures.synthesis_val.MMD
::: pymdma.time_series.measures.synthesis_val.CosineSimilarity

### Privacy
::: pymdma.time_series.measures.synthesis_val.Authenticity
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ plugins:
show_docstring_functions: false
show_docstring_classes: false
show_root_heading: true
heading_level: 3
heading_level: 4
show_source: true
members: false
show_bases: false
Expand All @@ -50,7 +50,7 @@ nav:
- Getting Started:
- Home: index.md
- Installation: installation.md
- Tutorials: tutorials.md
# - Tutorials: tutorials.md
- Contributting: contributing.md
- Developing: developer.md
- Image Metrics:
Expand Down
Loading

0 comments on commit 49d8a17

Please sign in to comment.