forked from dataobservatory-eu/opencollections-manual
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtidy.qmd
104 lines (56 loc) · 10.2 KB
/
tidy.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Tidy work {#sec-tidy}
> Все счастливые семьи похожи друг на друга, каждая несчастливая семья несчастлива по-своему.</br> All happy families are alike; each unhappy family is unhappy in its own way.
Our `OpenCollections` systems are complex system which are intended to be used in trustworthy AI applications. They follow the Anna Karenina principle: a deficiency in any of a number of factors dooms an endeavour to fail. Consequently, a successful endeavour (subject to this principle) is one for which every possible deficiency has been avoided.
Once the data is messy, there is a semantic ambiguity (an ambiguity in the intended use or meaning of data) that will render automation impossible or will lead to logical faults when software agents or algorithms use your data. You must keep your numeric and text data tidy at all times. The best way to keep data and text tidy is to keep it simple. Very simple.
Simplicity is simple, if you start simple and keep it that way. Simplifying messy text and messy data is always challenging.
Collective work involving data and data annotations and descriptions requires a shared understanding of the syntax and file formats.
Our curators need to be familiar with two ideas.
- [x] Tidy data means that tabular datasets are organised in a simple but particular manner. All observations are in rows, and all measured variables or characteristics are in columns, with no merged columns or rows. This is the optimal formatting for working with relational databases, and it is also a helpful start for graph databases. (See: @sec-tidy-data.)
- [x] Word processors like Word Work on different operational systems like Windows, MacOS, and Linux, creating very different text files and adding their formatting and other metadata to what you type. When we work together on the World Wide Web, we need something simpler than HTML but a bit more rich than a plain text file, clearly separating text editing from text formatting. The various markup notations, for example, *markdown*, are conventions for indicating that you want to make a text part **bold** or *italics* that works on all computer systems exactly the same way.(See: @sec-markup-text.)
## Tidy data {#sec-tidy-data}
Our data stewardship must follow the tidy data principle, which has very complex computer science and information management consequences, but for the curators of data, it boils down to an organised simplicity.
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations or collection items, and the measures and types of variables.
](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png){fig-align="center"}
In tidy data:
- [x] **1. Every table column is a variable**. We do not use colours (our machine-to-machine pipelines is colourblind). If we need comments or specifications, we add a new column.
- [x] **2. Every row in the table represents an observation, or an individual piece of a collection.** Every variable belonging to `Bulgaria` is in the `Bulgaria` row, and there is one and only `Bulgaria row`.
- [x] **3. Every cell is a single value**. Your blue male apron length is 97? The length column of the blue male apron row is 97.
::: callout-note
A `tidy dataset` is black-and-white, and each table cells contains one element of knowledge that cannot be further divided.
:::
We repeat in negative terms these seamingly simple principles:
- [ ] **Two observed values: two columns**. We do not use colours (our machine-to-machine pipelines is colourblind). If we need comments or specifications, we add a new column. The apron has a variable length of 96-98? Use a column for min and max length, and in the cases when there is a single number, put 97 as both max length and minimum length. But never write 96-98 in on column, it must be either 96 or 98.
- [ ] **Two items or two observations: two rows**: Do you have two types of blue male aprons: write a separate line for the kitchen apron and the gardening apron. Or just call it apron 1 and apron 2. Do you have two responsdents in the apron customer satisfaction form: that will be two rows.
- [x] **We never merge cells!** A tidy dataset has no divided or joined columns and no divided or joined rows. Never write 96-98 into a cell, because 96 goes to a separate cell than 98 because it has a different meaning: 96 means the minimum length of the apron, and 98 the maximum. Is there an apron with a length of 85 cm? That goes to a different row, because it refers to the length of a different type of apron.
Looks easy? If you start with a tidy table, it *is* very easy. If you have to tidy up a messy data table or an entire database, it often requires many years of data-wrangling experience to get it right first.
Is there science behind this? Yes, and it is more complicated than it sounds. In computer science or algebraic terms, you must organise your data to Codd's 3rd standard form. If you start from a well-organised table, it is a piece of cake to keep it that way. Reorganising messy information into a tidy format requires a lot of experience. Understanding that the ambiguity in the meaning of 96-98 should be resolved by treating them as two separate values, one meaning minimum possible length and the other maximum length, will not come naturally for everybody. But we will help in those cases.
### Wide & Long Formats
{fig-align="center"}
The tidy format is unambiguous: we always know that a number or string (value) belongs to its observational subject (in the rows) and the measured property variable (in the columns). Because the meaning is unambiguous, it can be transposed to different formats without loss of knowledge or misunderstandings.
Our knowledge base applications and Wikibase requires the three-column semantic triple format, because it can be organised into a graph; relational database managers usually prefer the wide format, because in this case every observed property of a data subject is in one record.
::: callout-note
A `tidy dataset` is black-and-white, and each table cells contains one element of knowledge that cannot be further divided.
:::
If your table is tidy, it will be easy to reuse in relational or graph database, or it can import easily into a spreadsheet or statistical program. Any further formatting with colours, divided columns, merged rows will stop the data portability, because only you will know why columns or rows are merged, divided, or coloured.
## Markup text {#sec-markup-text}
We create interconnected, interoperable (web) resources. We want to ensure that our research results are findable, accessible, and reusable. It must work in Word and Works, Notebook and VIM, Windows, MacOS, and Linux, with Latvian, English, Greek, and Thai character sets.
The World Wide Web has been a source of high interoperability and findability in the last 30 years, with the introduction of the HTTP protocol and the standardization of the HTML text markup language. We use a much-simplified version of HTML called Markdown.
Markdown text opens on MacOS, Windows, or Linux. It is very easy to translate into HTML, Word, Libre Office, Google Docs, LaTeX, or PDF. Markdown is a simplified HTML text notation that works well with word processors.
{fig-align="center"}
If you want Word output, Word is rendered instead of HTML. You can also create a PDF or EPUB and even a PPTX output.
### Markdown editors
There are countless Markdown editors. Because Markdown is so simple, you can, if you want to, edit markdown files in Notepad, WordPad (Windows) or VIM (Linux).
Most word processors support markdown. For example, Google Docs has a [free extension](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) that converts and document from Docs to markdown.
{fig-align="center"}
There are several online Markdown editors that you can use to try writing in Markdown. [Dillinger](https://dillinger.io/) is one of the best online Markdown editors. Just open the site and start typing in the left pane. A preview of the rendered document appears in the right pane.
- [Basic Syntax](https://www.markdownguide.org/basic-syntax/)
- [Extended Syntax](https://www.markdownguide.org/extended-syntax/)
::: callout-tip
### Using Word or Works
You can work on Word, your iWorks suite, or any preferred word processor. However, you will lose margin settings, font typefaces and sizes, background colors, and other finishing touches.
We discourage the use of word processors for footnotes and bibliographic references due to their varying treatment of such metadata. Our systems rely on standard BibLatex bibliographic references and a simple notation for footnotes, ensuring consistency and reliability.
Our recommended markdown editor is Quarto. You can copy and paste text from Word or other word processors into Quarto, and it will retain **bold**, *italics*, and headings.
Remember, we want to create text that machines and people can read, too, to avoid fancy aesthetics. Keep the text fancy (but of course, you can dress it up in Word or Adobe Illustrator later).
:::
### Wikipedia & MediaWiki
The documentation of our knowledge base and terminological agreements is documented in MediaWiki, the software that makes Wikipedia editable, too. It uses a form of markdown for an interoperable and simple editing of interlinked documents, images, and data documents.