You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: app/MZKBlank/README.md
+22-19Lines changed: 22 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -1,35 +1,24 @@
1
-
# Negative samples (Background images)
1
+
# MZKBlank (Background images)
2
2
3
3
## From the YOLO docs
4
4
5
5
> **Background images**. Background images are images with no objects that are added to a dataset to reduce False
6
6
> Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for
7
7
> reference, 1% of the total). No labels are required for background images.
8
8
9
-
The final dataset will consist of approximately ten thousand images, that means that we'll need around one thousand
10
-
background images.
9
+
We plan for the dataset to consist of approximately ten thousand images, that means that we'll need around one thousand background images.
11
10
12
11
## Where do we get the images from?
13
12
14
-
From Moravská zemská knihovna - MZK.
15
-
16
-
Using
17
-
the [MZK search engine](https://www.digitalniknihovna.cz/mzk/search?access=open&licences=public&doctypes=sheetmusic) we
18
-
can search specifically for open and public sheet music, there are thousands of them. MZK implements the really nice and
19
-
convenient to use [IIIF API](https://iiif.io/api/image/3.0/). There are only two problems with it:
20
-
21
-
- I was not able to find any documentation on how to use this api to search for documents using filters. (Like "give me
22
-
IDs to all sheet music documents".)
23
-
24
-
- Page labelling is not consistent and some page IDs end up raising 4xx or 5xx errors.
25
-
26
-
### [How does MZKScraper work?](https://github.com/v-dvorak/mzkscraper/blob/main/docs/README.md)
13
+
From Moravská zemská knihovna (MZK). Using
14
+
the [MZK search engine](https://www.digitalniknihovna.cz/mzk/search?access=open&licences=public&doctypes=sheetmusic) we can search specifically for open and public sheet music, there are thousands of them. MZK implements the really nice and convenient to use [IIIF API](https://iiif.io/api/image/3.0/). But it does implement all features that I would like to use, so I created [MZKScraper](https://github.com/v-dvorak/mzkscraper/blob/main/docs/README.md).
27
15
28
16
## Numbers of representative images
29
17
30
-
The number of representative images for each label was counted as $\left\lceil\frac{\#\text{ samples of label} \:
31
-
\cdot \:\#\text{ images wanted}}{\#\text{ samples in total}}\right\rceil$, this ensures that the number of
32
-
representative images in total is around the value we want (1000 images).
18
+
The number of representative images for each label was counted as
19
+
`round_up(total samples of a label * images wanted / samples in total)`, so that the ratios between images inside the dataset are approximately same as in all the library.
20
+
21
+
All used images are cited [here](./docs/citations.txt).
33
22
34
23
For quick image overview see [image grids](./docs/README.md).
35
24
@@ -49,6 +38,20 @@ For quick image overview see [image grids](./docs/README.md).
49
38
| title page | 212 |
50
39
|**Total**|**1006**|
51
40
41
+
## Downloading the dataset
42
+
43
+
Due to possible copyright issues, only page ids are publicly available and the user has to download them themselves.
44
+
45
+
```bash
46
+
# download
47
+
python3 -m app.MZKBlank.download
48
+
49
+
# create annotations (all will be emmpty)
50
+
python3 -m app.MZKBlank.build
51
+
```
52
+
53
+
> Make sure, that you have reliable internet connection. Downloading images from the MZK might take some time.
0 commit comments