Skip to content

Commit 42670bf

Browse files
committed
Update MZKBlank description
1 parent f0f15f0 commit 42670bf

File tree

2 files changed

+24
-19
lines changed

2 files changed

+24
-19
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Semester work at MFF CUNI under the guidance of Jirka Mayer.
1111
- [User documentation](docs/README.md)
1212
- [Download the latest dataset](https://github.com/v-dvorak/omr-layout-analysis/releases/tag/Latest) (~2.5 GB)
1313
- [Download the latest model](https://github.com/v-dvorak/omr-layout-analysis/releases/tag/Models) (~50 MB)
14+
- [MZKBlank dataset](app/MZKBlank/README.md)
1415

1516
## Dataset overview
1617

@@ -109,6 +110,7 @@ System are mainly created by looking at system staves at approximately the same
109110
- **September 2024**
110111
- added random shuffling to train/test split
111112
- added `seed` option
113+
- unpublished MZKBlank
112114

113115
- **August 2024**
114116
- complete dataset made public in Releases

app/MZKBlank/README.md

Lines changed: 22 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,24 @@
1-
# Negative samples (Background images)
1+
# MZKBlank (Background images)
22

33
## From the YOLO docs
44

55
> **Background images**. Background images are images with no objects that are added to a dataset to reduce False
66
> Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for
77
> reference, 1% of the total). No labels are required for background images.
88
9-
The final dataset will consist of approximately ten thousand images, that means that we'll need around one thousand
10-
background images.
9+
We plan for the dataset to consist of approximately ten thousand images, that means that we'll need around one thousand background images.
1110

1211
## Where do we get the images from?
1312

14-
From Moravská zemská knihovna - MZK.
15-
16-
Using
17-
the [MZK search engine](https://www.digitalniknihovna.cz/mzk/search?access=open&licences=public&doctypes=sheetmusic) we
18-
can search specifically for open and public sheet music, there are thousands of them. MZK implements the really nice and
19-
convenient to use [IIIF API](https://iiif.io/api/image/3.0/). There are only two problems with it:
20-
21-
- I was not able to find any documentation on how to use this api to search for documents using filters. (Like "give me
22-
IDs to all sheet music documents".)
23-
24-
- Page labelling is not consistent and some page IDs end up raising 4xx or 5xx errors.
25-
26-
### [How does MZKScraper work?](https://github.com/v-dvorak/mzkscraper/blob/main/docs/README.md)
13+
From Moravská zemská knihovna (MZK). Using
14+
the [MZK search engine](https://www.digitalniknihovna.cz/mzk/search?access=open&licences=public&doctypes=sheetmusic) we can search specifically for open and public sheet music, there are thousands of them. MZK implements the really nice and convenient to use [IIIF API](https://iiif.io/api/image/3.0/). But it does implement all features that I would like to use, so I created [MZKScraper](https://github.com/v-dvorak/mzkscraper/blob/main/docs/README.md).
2715

2816
## Numbers of representative images
2917

30-
The number of representative images for each label was counted as $\left\lceil\frac{\#\text{ samples of label} \:
31-
\cdot \: \#\text{ images wanted}}{\#\text{ samples in total}}\right\rceil$, this ensures that the number of
32-
representative images in total is around the value we want (1000 images).
18+
The number of representative images for each label was counted as
19+
`round_up(total samples of a label * images wanted / samples in total)`, so that the ratios between images inside the dataset are approximately same as in all the library.
20+
21+
All used images are cited [here](./docs/citations.txt).
3322

3423
For quick image overview see [image grids](./docs/README.md).
3524

@@ -49,6 +38,20 @@ For quick image overview see [image grids](./docs/README.md).
4938
| title page | 212 |
5039
| **Total** | **1006** |
5140

41+
## Downloading the dataset
42+
43+
Due to possible copyright issues, only page ids are publicly available and the user has to download them themselves.
44+
45+
```bash
46+
# download
47+
python3 -m app.MZKBlank.download
48+
49+
# create annotations (all will be emmpty)
50+
python3 -m app.MZKBlank.build
51+
```
52+
53+
> Make sure, that you have reliable internet connection. Downloading images from the MZK might take some time.
54+
5255
## Labels provided by MZK
5356

5457
### Accepted labels

0 commit comments

Comments
 (0)