forked from dataobservatory-eu/opencollections-manual
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.tex
5385 lines (4512 loc) · 251 KB
/
index.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames,x11names}{xcolor}
%
\documentclass[
letterpaper,
DIV=11,
numbers=noendperiod]{scrreprt}
\usepackage{amsmath,amssymb}
\usepackage{iftex}
\ifPDFTeX
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
\usepackage{lmodern}
\ifPDFTeX\else
% xetex/luatex font selection
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\setcounter{secnumdepth}{5}
% Make \paragraph and \subparagraph free-standing
\makeatletter
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}{
\@ifstar
\xxxParagraphStar
\xxxParagraphNoStar
}
\newcommand{\xxxParagraphStar}[1]{\oldparagraph*{#1}\mbox{}}
\newcommand{\xxxParagraphNoStar}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}{
\@ifstar
\xxxSubParagraphStar
\xxxSubParagraphNoStar
}
\newcommand{\xxxSubParagraphStar}[1]{\oldsubparagraph*{#1}\mbox{}}
\newcommand{\xxxSubParagraphNoStar}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
\makeatother
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{241,243,245}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.40,0.45,0.13}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\BuiltInTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\ExtensionTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.28,0.35,0.67}{#1}}
\newcommand{\ImportTok}[1]{\textcolor[rgb]{0.00,0.46,0.62}{#1}}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}}
\newcommand{\RegionMarkerTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.07,0.07,0.07}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
% definitions for citeproc citations
\NewDocumentCommand\citeproctext{}{}
\NewDocumentCommand\citeproc{mm}{%
\begingroup\def\citeproctext{#2}\cite{#1}\endgroup}
\makeatletter
% allow citations to break across lines
\let\@cite@ofmt\@firstofone
% avoid brackets around text for \cite:
\def\@biblabel#1{}
\def\@cite#1#2{{#1\if@tempswa , #2\fi}}
\makeatother
\newlength{\cslhangindent}
\setlength{\cslhangindent}{1.5em}
\newlength{\csllabelwidth}
\setlength{\csllabelwidth}{3em}
\newenvironment{CSLReferences}[2] % #1 hanging-indent, #2 entry-spacing
{\begin{list}{}{%
\setlength{\itemindent}{0pt}
\setlength{\leftmargin}{0pt}
\setlength{\parsep}{0pt}
% turn on hanging indent if param 1 is 1
\ifodd #1
\setlength{\leftmargin}{\cslhangindent}
\setlength{\itemindent}{-1\cslhangindent}
\fi
% set entry spacing
\setlength{\itemsep}{#2\baselineskip}}}
{\end{list}}
\usepackage{calc}
\newcommand{\CSLBlock}[1]{\hfill\break\parbox[t]{\linewidth}{\strut\ignorespaces#1\strut}}
\newcommand{\CSLLeftMargin}[1]{\parbox[t]{\csllabelwidth}{\strut#1\strut}}
\newcommand{\CSLRightInline}[1]{\parbox[t]{\linewidth - \csllabelwidth}{\strut#1\strut}}
\newcommand{\CSLIndent}[1]{\hspace{\cslhangindent}#1}
\KOMAoption{captions}{tableheading}
\makeatletter
\@ifpackageloaded{tcolorbox}{}{\usepackage[skins,breakable]{tcolorbox}}
\@ifpackageloaded{fontawesome5}{}{\usepackage{fontawesome5}}
\definecolor{quarto-callout-color}{HTML}{909090}
\definecolor{quarto-callout-note-color}{HTML}{0758E5}
\definecolor{quarto-callout-important-color}{HTML}{CC1914}
\definecolor{quarto-callout-warning-color}{HTML}{EB9113}
\definecolor{quarto-callout-tip-color}{HTML}{00A047}
\definecolor{quarto-callout-caution-color}{HTML}{FC5300}
\definecolor{quarto-callout-color-frame}{HTML}{acacac}
\definecolor{quarto-callout-note-color-frame}{HTML}{4582ec}
\definecolor{quarto-callout-important-color-frame}{HTML}{d9534f}
\definecolor{quarto-callout-warning-color-frame}{HTML}{f0ad4e}
\definecolor{quarto-callout-tip-color-frame}{HTML}{02b875}
\definecolor{quarto-callout-caution-color-frame}{HTML}{fd7e14}
\makeatother
\makeatletter
\@ifpackageloaded{bookmark}{}{\usepackage{bookmark}}
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\AtBeginDocument{%
\ifdefined\contentsname
\renewcommand*\contentsname{Table of contents}
\else
\newcommand\contentsname{Table of contents}
\fi
\ifdefined\listfigurename
\renewcommand*\listfigurename{List of Figures}
\else
\newcommand\listfigurename{List of Figures}
\fi
\ifdefined\listtablename
\renewcommand*\listtablename{List of Tables}
\else
\newcommand\listtablename{List of Tables}
\fi
\ifdefined\figurename
\renewcommand*\figurename{Figure}
\else
\newcommand\figurename{Figure}
\fi
\ifdefined\tablename
\renewcommand*\tablename{Table}
\else
\newcommand\tablename{Table}
\fi
}
\@ifpackageloaded{float}{}{\usepackage{float}}
\floatstyle{ruled}
\@ifundefined{c@chapter}{\newfloat{codelisting}{h}{lop}}{\newfloat{codelisting}{h}{lop}[chapter]}
\floatname{codelisting}{Listing}
\newcommand*\listoflistings{\listof{codelisting}{List of Listings}}
\makeatother
\makeatletter
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\@ifpackageloaded{subcaption}{}{\usepackage{subcaption}}
\makeatother
\newcounter{quartocallouttipno}
\newcommand{\quartocallouttip}[1]{\refstepcounter{quartocallouttipno}\label{#1}}
\ifLuaTeX
\usepackage{selnolig} % disable illegal ligatures
\fi
\usepackage{bookmark}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same} % disable monospaced font for URLs
\hypersetup{
pdftitle={OpenCollections Manual},
pdfauthor={Daniel Antal, CFA; Ádám Lázár; Andor Kornél Barát; Anna Márta Mester},
colorlinks=true,
linkcolor={blue},
filecolor={Maroon},
citecolor={Blue},
urlcolor={Blue},
pdfcreator={LaTeX via pandoc}}
\title{OpenCollections Manual}
\author{Daniel Antal, CFA \and Ádám Lázár \and Andor Kornél
Barát \and Anna Márta Mester}
\date{2025-01-09}
\begin{document}
\maketitle
\renewcommand*\contentsname{Table of contents}
{
\hypersetup{linkcolor=}
\setcounter{tocdepth}{2}
\tableofcontents
}
\bookmarksetup{startatroot}
\chapter*{Introduction}\label{introduction}
\addcontentsline{toc}{chapter}{Introduction}
\markboth{Introduction}{Introduction}
Reprex's new \texttt{OpenCollections} system wants to help small and
large enterprises work with big data without huge investments into data
infrastructure. \texttt{OpenCollections} is an element of our
collaborative toolkits that enables owners of small, local databases to
remain competitive in training AI in the age of big data. It helps to
fill your databases with up-to-date information, find and correct
errors, and connect your database entries to new information as you need
them without further IT and data investments.
The \texttt{OpenCollections} component of our solutions aims to
interconnect inventories, collections, and repertoires. We want to
enable private entities, like music distributors, rights management
agencies, and film producers, to synchronise their IT systems with
public GLAM memory institutions: archives, libraries, museums, and
statistical agencies. We want to enable the enrichment of your inventory
or repertoire management from interconnected databases to improve
automated sales processes and the training or sales, inventory
management or other AI algorithms.
Like many applications in the European open data field,
\texttt{OpenCollections} is built around Wikibase. This open-source
software system has built one of the world's most extensive knowledge
graphs and knowledge bases, Wikidata, which synchronises the knowledge
base of the 329 versions of Wikipedia with global databases, libraries,
statistical authorities, company houses and other digital
infrastructure.
This manual is not aimed at IT professionals or engineers. Wikibase has
many thousands users with a simple and intuitive user interface. With
this manual we are aiming for data stewards, data curators, librarians,
archivists, inventory managers, who are responsible for documenting,
updating repertoires, intellectual property assets, rights claims,
webshop inventories, inventory management, and want to automate their
processes, or train AI algorithms to do a better job for them.
Chapter~\ref{sec-inspiration} will need to be rewritten; it is currently
taken from our observatory handbook, which deals with data collection
programs, not broader collecting programs.
Chapter~\ref{sec-tidy} is a very brief introduction to tidy data and
text. It is a very brief introduction to keeping information tidy for
automated computer use and easy database import.
Chapter~\ref{sec-collections} offers a typology of collections and the
most prevalent problems with collections: ambiguous names,
hard-to-translate descriptions, mismatched names and titles. Such
problems appear in all large-scale applications and can negatively
affect business, sales, legal or research integrity. We give some tips
on how to work with our systems to prevent such problems or to resolve
existing collection management problems with automated data improvement,
enrichment or updating.
Chapter~\ref{sec-wikidata-open-graph} introduces Wikidata and other Open
Knowledge Graphs. Using Wikidata, Wikipedia's document database, as an
example, we show how to organise knowledge into a graph database and
connect it with other knowledge graphs on the Internet of Data.
Chapter~\ref{sec-wikibase} introduces the adaptability of Wikibase and
enterprise knowledge graphs that are tailored to your needs, and can
handle highly confidential data.
Chapter~\ref{sec-reprex-sandbox} shows how to get familiar with the
system in our Sandbox.
The creation of \texttt{OpenCollections} accounts is explained
step-by-step in Section~\ref{sec-opencollections-create-account}.
Wikibase has been open source for a long time, but it is in its infancy
as a supported open-source product. \texttt{ReprexBase}, our
distribution, is enhanced with know-how, and our software libraries help
you manage this knowledge system to be tailored to your needs. Wikibase
has been successfully used in many EU projects, including the creation
of the \href{https://linkedopendata.eu/wiki/The_EU_Knowledge_Graph}{EU
Knowledge Graph↗} (see: Section~\ref{sec-eu-knowledge-graph},
(Diefenbach, Wilde, and Alipio 2021)). It also has training material on
the EU Academy. While Wikibase is fully open-source and accessible, it
is a complicated system that requires many extensions and adoptions to
support a data-sharing space or a public-private knowledge base like
ours. Reprex's extensions aim to make data importing and enrichment
easier and less costly and make data export more reusable.
Using Wikibase allows coordination with Wikidata, which evolved into a
central hub on the web of data and it is one of the largest existing
knowledge graphs, and perhaps the best known open one. It is
synchronised with knowledge from respected public institutions like
Eurostat, the German National Library or BBC, and it is one of the
backbones of many web services like Google Search. Wikibase \emph{is
scalable} to very big graphs.
\bookmarksetup{startatroot}
\chapter*{Glossary}\label{glossary}
\addcontentsline{toc}{chapter}{Glossary}
\markboth{Glossary}{Glossary}
Our glossary is harmonised with the ISO \emph{Information technology ---
Vocabulary} (ISO 2023); \emph{Information technology --- Cloud computing
--- Taxonomy based data handling for cloud services} (ISO 2020) and
\emph{Information technology --- Cloud computing --- Interoperability
and portability} (ISO 2017).
\section*{Data science terms}\label{data-science-terms}
\addcontentsline{toc}{section}{Data science terms}
\markright{Data science terms}
\begin{longtable}[]{@{}
>{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.2500}}
>{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.7500}}@{}}
\caption{Data and information science terms, definitions}\tabularnewline
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Term
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Description
\end{minipage} \\
\midrule\noalign{}
\endfirsthead
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Term
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Description
\end{minipage} \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
\texttt{conceptualisation} & an abstract, simplified view of some
selected part of the world, containing the objects, concepts, and other
entities that are presumed of interest for some particular purpose and
the relationships between them \\
\texttt{data} & reinterpretable representation of information in a
formalized manner suitable for communication, interpretation, or
processing Note 1 to entry: Data can be processed by humans or by
automatic means.{[}SOURCE:ISO/IEC 2382:2015, 2121272{]} \\
\texttt{database} & collection of \texttt{data} organized according to a
conceptual structure describing the characteristics of these data and
the relationships among their corresponding entities, supporting one or
more application areas. {[}SOURCE:ISO/IEC 2382:2015, 2121413{]} \\
\texttt{data\ set} or \texttt{dataset} & identifiable collection of
\texttt{data} available for access or download in one or more formats
{[}SOURCE:Adapted from ISO 19115-2:2009, 4.7{]} \emph{Beware: various
conceptual and information models use different dataset definitions}. \\
\texttt{datatype} & defined set of \texttt{data} objects of a specified
data structure and a set of permissible operations, such that these data
objects act as operands in the execution of any one of these
operations \\
\texttt{big\ data} & \begin{minipage}[t]{\linewidth}\raggedright
extensive datasets -- primarily in the data characteristics of
\texttt{volume}, \texttt{variety}, \texttt{velocity}, and/or
\texttt{variability}~ -- that require a scalable technology for
efficient storage, manipulation, management, and analysis\\
note : Big data is commonly used in many different ways, for example as
the name of the scalable technology used to handle big data extensive
datasets.\strut
\end{minipage} \\
\texttt{data\ variability} & changes in transmission rate, format or
structure, semantics, or quality of datasets \\
\texttt{data\ variety} & range of formats, logical models, timescales,
and semantics of a dataset. Note: Data veracity refers to descriptive
data and self-inquiry about objects to support real-time
decision-making. \\
\texttt{data\ velocity} & rate of flow at which data is created,
transmitted, stored, analysed or visualised \\
\texttt{data\ volatility} & \begin{minipage}[t]{\linewidth}\raggedright
characteristic of data pertaining to the rate of change of these data
over time\\
{[}SOURCE:ISO/IEC 2382:2015, 17.06.06{]}\strut
\end{minipage} \\
\texttt{register} & an official list or record of names or items; it
aims to be a complete list of the objects in a specific group of objects
or population, for example, all copyright-protected musical works in a
country, or all legal person enterprises in another country. \\
\texttt{data\ flow\ definition} & a structure which describes,
categorises and constrains the allowable content of a data set that
providers will supply for different reference periods. {[}SDMX 3.0{]} \\
\texttt{datacube} & A statistical data set created in a
multi-dimensional space (e.g., time, geography, gender), or hyper-cube,
indexed by those dimensions. \emph{The term cube shouldn't be taken
literally, it is not meant to imply that there are exactly three
dimensions.} \\
\texttt{data\ science} & extraction of actionable knowledge from
\texttt{data} through a process of discovery, or hypothesis and
hypothesis testing \\
\texttt{cluster} & \textless distributed data processing\textgreater{}
set of functional units under common control {[}SOURCE:ISO/IEC
2382:2015, 2120586{]} \\
\texttt{scatter} & distribution of processing across multiple nodes in a
\texttt{cluster} \\
\texttt{file} & named set of records treated as a unit
{[}SOURCE:ISO/IEC 2382:2015, 04.07.10{]} \\
\texttt{knowledge\ base} or \texttt{K-base} & database that contains
inference rules and information about human experience and expertise in
a domain. 1: In self-improving systems, the knowledge base additionally
contains information resulting from the solution of previously
encountered problems. The terms \texttt{knowledge\ base} and
\texttt{K-base} are standardized by ISO/IEC {[}ISO/IEC
2382-1:1993{]}. \\
\texttt{knowledge\ representation} & process or result of encoding and
storing knowledge in a knowledge base. Term and definition standardized
by ISO/IEC {[}ISO/IEC 2382-28:1995{]}. \\
\texttt{knowledge\ graph} & a \texttt{knowledge\ representation} that
uses a graph-structured data model to represent and operate on data. \\
\texttt{knowledge\ source} & source of information from which a
knowledge base has been created for a specific kind of problem Term and
definition standardized by ISO/IEC {[}ISO/IEC 2382-28:1995{]}. \\
\texttt{knowledge\ engineering\ tool} & functional tool designed to
facilitate the rapid development of knowledge-based systems. 1. A
knowledge engineering tool incorporates specific strategies for
knowledge representation, inference, and control, as well as elementary
modeling constructs for easy handling of typical problems.Term and
definition standardized by ISO/IEC {[}ISO/IEC 2382-28:1995{]}. \\
\texttt{conceptual\ model} & representation of the characteristics of a
universe of discourse by means of entities and entity relationships
(ISO/IEC 2382-17:1999). \emph{In this document, we use conceptual models
for models that can be used by humans and computers, too, and we use the
information model term for use in IT systems.} \\
\texttt{interoperability} & Ability of two or more systems or
applications to exchange information and to mutually use the information
that has been exchanged. {[}SOURCE:ISO/IEC 19941:2017{]} \\
\texttt{data\ portability} & Ability to easily transfer data from one
system to another without being required to re-enter data. \\
\texttt{metadata} & data that define and describe other data {[}ISO/IEC
11179-1:2023{]}; we use the more functional definition ``a statement
about a potentially informative object.''
{[}SOURCE:ISO/IEC 2382:2015, 17.06.05{]}: metadata is \texttt{data}
about data or data elements, possibly including their data descriptions,
and data about data ownership, access paths, access rights and data
volatility \\
\texttt{NERD} & Named-entity recognition and disambiguation \\
\texttt{persistent\ identifier} & A persistent identifier (or
\emph{permanent Identifier} or \emph{handle}), is one that never
changes, so that your bookmarks and links don't break when a website or
a database or an API service gets updated. \\
\texttt{CIDOC-CRM} & The conceptual model of CIDOC, the standard
conceptualisation of collection management systems in heritage
organisations. \\
\texttt{RiC} & \emph{Records in Context}, a new conceptual model that
replaces the four most important international archiving standards. \\
\texttt{Wikibase} & Wikibase is a software system that help the
collaborative management of knowledge in a central repository. It was
originally developed for the management of Wikidata, but it is available
now for the creation of private, or public-private partnership knowledge
graphs. It is developed by Wikimedia Deutschland. \\
\texttt{relational\ model} & \begin{minipage}[t]{\linewidth}\raggedright
\texttt{data\ model} whose structure is based on a set of relations\\
{[}SOURCE:ISO/IEC 2382:2015, 17.04.04{]}\strut
\end{minipage} \\
\texttt{non-relational\ model} & logical \texttt{data\ model} that does
not follow a \texttt{relational\ model} for the storage and manipulation
of data \\
\texttt{structured\ data} & \texttt{data} which are organized based on a
pre-defined (applicable) set of rules
Notey: The predefined set of rules governing the basis on which the data
is structured needs to be clearly stated and made known. \\
\texttt{partially\ structured\ data} & \texttt{data} that has some
organization
Note 1: Partially structured data is often referred to as
semi-structured data by industry.
Note 2: examples of partially structured data are records with free text
fields in addition to more structured fields. Such data is frequently
represented in computer interpretable/parsible formats such as XML or
JSON \\
\texttt{horizontal\ scaling} & providing a single logical unit through
the connection of multiple hardware and software.
Note: The example of horizontal scaling is increasing the performance of
distributed data processing through the addition of nodes in the cluster
for additional resources. \\
\texttt{vertical\ scaling} & act of increasing the performance of data
processing through improvements to processors, memory, storage, or
connectivity. \\
\texttt{algorithm} & \begin{minipage}[t]{\linewidth}\raggedright
finite ordered set of well-defined rules for the solution of a problem\\
definition standardized by ISO/IEC 2382-1:1993\strut
\end{minipage} \\
\end{longtable}
\section*{Naming}\label{naming}
\addcontentsline{toc}{section}{Naming}
\markright{Naming}
\begin{longtable}[]{@{}
>{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.2500}}
>{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.7500}}@{}}
\caption{Named entity terms, definitions}\tabularnewline
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Term
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Description
\end{minipage} \\
\midrule\noalign{}
\endfirsthead
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Term
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Description
\end{minipage} \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
\texttt{party} & natural person or legal person, whether or not
incorporated, or a group of either (ISNI 3.1, (ISO 2012, p15)) \\
\texttt{registrant} & \texttt{party} that requests an ISNI from the
Registration Authority (ISNI 3.2 (ISO 2012, p15)) \\
\texttt{identity\ of\ a\ party} & identity of a party or a fictional
character that is or was presented to the public (3.2, (ISO 2012,
p15)) \\
\texttt{name} & (3.2, (ISO 2012, p15)) \\
\texttt{name} & word or phrase used for identification (Wikidata
\href{https://www.wikidata.org/wiki/Q82799}{Q82799}) \\
\texttt{common\ name} & name generally used for a taxon, group of taxa
or organism(s) (Wikidata
\href{https://www.wikidata.org/wiki/Q502895}{Q502895}) \\
\texttt{family\ name} & part of a naming scheme for individuals, used in
many cultures worldwide (Wikidata
\href{https://www.wikidata.org/wiki/Q101352}{Q101352}) \\
\texttt{given\ name} & name typically used to differentiate people from
the same family, clan, or other social group who have a common last name
(Wikidata \href{https://www.wikidata.org/wiki/Q202444}{Q202444}) \\
\texttt{company\ name} & \\
\texttt{geographical\ name} & Toponym. Name for a geographical entity or
location. (Wikidata
\href{https://www.wikidata.org/wiki/Q7884789}{Q7884789}) \\
\texttt{namespace} & collection of identifiers with a unique meaning
within the namespace (Wikidata
\href{https://www.wikidata.org/wiki/Q873636}{Q873636}) \\
\texttt{thesaurus} & controlled vocabulary expanded with relations of
broader, narrower and related terms, serving subject indexing and
vocabulary control (Wikidata
\href{https://www.wikidata.org/wiki/Q17152639}{Q17152639}) \\
\texttt{authority\ file} & \\
\texttt{register} & \\
\texttt{name\ ambiguity} \textbar{} \textbar{} & \\
\end{longtable}
\section*{VIAF definitions}\label{viaf-definitions}
\addcontentsline{toc}{section}{VIAF definitions}
\markright{VIAF definitions}
Based on the VIAF website and the OCLC glossary.
\begin{longtable}[]{@{}
>{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.0569}}
>{\raggedright\arraybackslash}p{(\columnwidth - 2\tabcolsep) * \real{0.9431}}@{}}
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
Term
\end{minipage} & \begin{minipage}[b]{\linewidth}\raggedright
Description
\end{minipage} \\
\midrule\noalign{}
\endhead
\bottomrule\noalign{}
\endlastfoot
\texttt{VIAF} & The
\texttt{Virtual\ International\ Authority\ File\ (VIAF)} combines
multiple name authority files into a single OCLC-hosted name authority
service. The goal of the service is to lower the cost and increase the
utility of library authority files by matching and linking widely-used
authority files and making that information available on the Web. VIAF
does not create data but only processes data submitted by VIAF
participants. \\
\texttt{OCLC} & A nonprofit global library cooperative providing shared
technology services, original research, and community programs for its
membership and the library community at large. Originally ``Ohio College
Library Center,'' later ``Online Computer Library Center, Inc.'' or
``OCLC, Inc.''. \\
\texttt{authority\ record} & A collection of information about a name
(personal, corporate, family, or meeting), preferred title, or subject
term (topical, geographic, genre, etc.). \\
\texttt{bibliographic\ record} & A description of the physical or
virtual format and intellectual content of a single resource (a book,
video, map, etc.) encoded in a standardized format such as
\texttt{MARC}. \\
\texttt{MARC\ (Machine-Readable\ Cataloging)} & A family of
international standards for the representation and communication of
bibliographic, authority, holdings, classification, and related
information in machine-readable form, based upon the Format for
Information Exchange, ISO 2709.~MARC standards define the three elements
of record structure, content designation, and data content.~ MARC 21,
originally developed by the Library of Congress in the 1960s, is the
most widely used of the MARC standards.~ UNIMARC (Universal MARC
Format), developed by the International Federation of Library
Associations and Institutions (IFLA) in the 1970s, is the second most
widely used MARC standard. \\
\texttt{authority\ control} & Verifies an access point in a
\texttt{bibliographic\ record} against an internal or external authority
file such as the Library of Congress Authority File and, if a matching
authority record exists, links the access point to the corresponding
authority record. If the authority record is updated, the controlled
(linked) access point in bibliographic records is updated
automatically. \\
\texttt{access\ point} & A name, term, code, etc., representing a
specific entity that is indexed. \\
\texttt{authorized\ access\ point} & An access point,~representing an
entity, formulated according to a specified standard. \\
\texttt{surname} & A name used as a family name that may precede or
follow a \texttt{given\ name}, depending on the culture. \\
\texttt{multipart\ surname} & \texttt{Surname} that includes prefixes,
hyphenated names, or names that begin with articles or prepositions. \\
\texttt{given\ name} & A name chosen for a person at birth that
identifies and differentiates that person from others in the same
family. Depending on the culture a person is born into, the given name
can precede or follow a \texttt{surname} (i.e.~family name). A given
name may also be known as a \texttt{forename}, \texttt{first\ name}, or
\texttt{personal\ name}. \\
\texttt{corporate\ name} & The name of an agency, association, business,
firm, government, institution, nonprofit enterprise, performing group,
etc. used as an authorized access point in a bibliographic record. \\
\texttt{title} & A word, phrase, character, or group of characters,
normally appearing on a resource, that names the manifestation or the
work contained in it. \\
\texttt{subtitle} & A word, character(s), or phrase that appears in
conjunction with, and is subordinate to, a title proper of a
manifestation. Also known as other title information. \\
\texttt{preferred\ title} & A title forming the
\texttt{authorized\ access\ point} that identifies a resource,
especially if it has appeared under varying titles. Preferred titles
generally serve one of two purposes: collocating versions of the
resource including complete works, works in a particular literary or
musical form (sonatas, songs) and distinguishing between different
resources with the same or similar titles. Uniform title is the term
used by AACR2, and Preferred title is the term used by \texttt{RDA}. \\
\texttt{RDA} & \texttt{Resource\ Description\ and\ Access}.An
international standard for creating library and cultural heritage
resource metadata that are well-formed according to international models
for user-focused linked data applications. RDA was created by the RDA
Steering Committee (RSC) to replace the Anglo-American Cataloguing
Rules, 2nd Edition Revised (AACR2), which were first published in 1978.
RDA continues to be developed in a collaborative process led by the RSC
in line with a set of objectives and principles informed by the
Statement of International Cataloguing Principles. \\
\texttt{metadata} & Literally, data about data. It is descriptive
information about a particular data set, object, or resource, including
how it is formatted, and when and by whom it was collected. Originally
metadata most commonly referred to digital resources, but now can refer
to any physical or electronic resource. It may be created automatically
using software or entered by hand. \\
\texttt{identifier} & A term, number, or name used to refer to a library
resource, library metadata description, or an entry within an
ontology. \\
\texttt{subject} & The topic treated, or matter discussed, in a
resource. What a resource is about. \texttt{Subject\ schemes} (for
example, Library of Congress Subject Headings {[}LCSH{]}) use a
controlled vocabulary to categorize library materials about the same
subject. \\
\texttt{subject\ scheme} & \texttt{Subjects} categorize library material
and provide controlled access to the content of resources.
\texttt{Schemes} define concepts and relationships between concepts to
support user navigation. \texttt{Subject\ schemes}, such as Library of
Congress Subject Headings (LCSH), use a controlled vocabulary; that is,
they use the same terms to categorize the library material about the
same subject. For example, a resource about atomic structure and another
resource about neutrons can have the same subject entry, Nuclear
physics. \\
\end{longtable}
\section*{}\label{section}
\addcontentsline{toc}{section}{}
\markright{}
\texttt{statement}: a simple element of knowledge with a true or false
value; an \emph{atomic statement} is a declarative sentence that
attributes one property or relationship to an object or event.
semantic triple, or RDF triple or simply triple,
\bookmarksetup{startatroot}
\chapter{Inspiration}\label{sec-inspiration}
Data curators, as professionals, are responsible for managing,
maintaining, and enhancing the quality of an organisation's data. Their
work is instrumental in making data easily accessible, accurate, and
relevant to the organisation's needs. Large organisations collaborate
closely with data and knowledge engineers, analysts, scientists, and
other stakeholders to establish a robust data ecosystem.
Because of the dominance of micro- and small enterprise (institution)
sizes in the music sector and many other creative industries and service
sectors, very few competent data curators and specialised data or
knowledge engineers are present. Our approach to solving this problem is
to do the curatorial and engineering work collectively.
\begin{itemize}
\item[$\boxtimes$]
We pool those music experts within the stakeholder network who have
data curatorial skills (for example, music librarians) or, due to
their job, have background skills or an affinity to data curation.
\item[$\boxtimes$]
We provide these data curators with robust tools that only require a
little learning.
\item[$\boxtimes$]
We centralise all the knowledge and data engineering work in the
centre of the data-sharing space, i.e., at the Open Music Observatory.
\end{itemize}
To become a data curator, you do not need to be a data scientist,
statistician, librarian, or data engineer. We are looking for
professionals, researchers, or citizen scientists who have deep
subject-domain knowledge about the data we want to improve: they know a
lot about organs in churches, about species of wild bees, music
publishing, or any other domain on which we collect data. Our ideal
curators share a passion for data-driven evidence or visualisations, can
learn tools that Wikipedia editors use, and have a robust and subjective
idea about the data that would inform them in their work.
\section{We need data}\label{sec-inspiration-data-need}
\subsection{No Data is Available: This Scientist Stung Himself With
Dozens Of Insects Because No One Else
Would}\label{no-data-is-available-this-scientist-stung-himself-with-dozens-of-insects-because-no-one-else-would}
\begin{figure}[H]
{\centering \includegraphics{png/inspiration/schmidt_pain_index.png}
}
\caption{Good data curators are people who share a passion for
measuring, recording and categorising the knowledge about their field,
be it insects, music, or informal economy.}
\end{figure}%
The \texttt{Schmidt\ Pain\ Index}, as its informally known, runs from
1-4. The common honey bee serves as its anchor point, a solid 2. At the
top end of the scale lie the bullet ant and the tarantula hawk (which is
neither a tarantula nor a hawk; it's
\href{https://www.wired.com/2015/07/absurd-creature-of-the-week-tarantula-hawk/}{a
wasp}). Watch the video with
\href{https://youtu.be/i0LjT-qkUes}{Dr.~Schmidt}, and listen to the
whole interview
\href{https://podcasts.apple.com/us/podcast/48-the-schmidt-sting-pain-index/id1011406983?i=1000391467968}{here}.
\href{https://fivethirtyeight.com/features/this-scientist-stung-himself-with-dozens-of-insects-because-no-one-else-would/}{⏯
This Scientist Stung Himself With Dozens Of Insects Because No One Else
Would}.
\subsection{Nobody Counted Them Before: Big Data Is Saving This Little
Bird}\label{nobody-counted-them-before-big-data-is-saving-this-little-bird}
``We need to improve conservation by improving wildlife monitoring.
Counting plants and animals is really tricky business.''
\href{https://fivethirtyeight.com/features/big-data-is-saving-this-little-bird/}{⏯
Big Data Is Saving This Little Bird}
\section{From Datasets and Files to Living Web
Resoures}\label{sec-web-30}
\subsection{Web 3.0}\label{web-3.0}
\begin{figure}[H]
{\centering \includegraphics{png/inspiration/weblinks.png}
}
\caption{We are structuring your knowledge in a way that it results in
datasets that can be connected similarly to the World Wide Web pages.}
\end{figure}%
\section{Remain Critical: Ethical Data, Trustworthy
AI}\label{critical-attitude}
Sometimes we put our hands on data that looks like a unique starting
point to create a new indicator. But our indicator will be flawed if the
original dataset is flawed. And it can be flawed in many ways, most
likely that some important aspect of the information was omitted, or the
data is autoselected, for example, under-sampling women, people of
colour, or observations from small or less developed countries.
\subsection{Machine Learning from Bad Data: Weapons of Math Destruction,
Algorithms of
Oppression}\label{machine-learning-from-bad-data-weapons-of-math-destruction-algorithms-of-oppression}
Cathy O'Neil:
\href{https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction}{⏯
Weapons of math destruction}, which O'Neil are mathematical models or
algorithms that claim to quantify important traits: teacher quality,
recidivism risk, creditworthiness but have harmful outcomes and often
reinforce inequality, keeping the poor poor and the rich rich. They have
three things in common: opacity, scale, and damage.
https://blogs.scientificamerican.com/roots-of-unity/review-weapons-of-math-destruction/{]}(https://blogs.scientificamerican.com/roots-of-unity/review-weapons-of-math-destruction/)
In \href{https://nyupress.org/9781479837243/algorithms-of-oppression/}{⏯
Algorithms of Oppression}, Safiya Umoja Noble challenges the idea that
search engines like Google offer an equal playing field for all forms of
ideas, identities, and activities. Data discrimination is a real social
problem; Noble argues that the combination of private interests in
promoting certain sites, along with the monopoly status of a relatively
small number of Internet search engines, leads to a biased set of search
algorithms that privilege whiteness and discriminate against people of
colour, especially women of colour.
\subsection{Big Data Creates Inequalities: Data
Feminism}\label{big-data-creates-inequalities-data-feminism}
Catherine D'Ignazio and Lauren F. Klein:
\href{https://mitpress.mit.edu/books/data-feminism}{⏯ Data Feminism}.
This is a much-celebrated book and with a good reason. It views AI and
data problems from a feminist point of view, but the examples and the
toolbox can be easily imagined for small-country biases, racial, ethnic,
or small enterprise problems. A very good introduction to the injustice
of big data and the fight for a fairer use of data, and how bad data
collection practices through garbage in-garbage out lead to misleading
information or even misinformation.
\subsection{Bad Data collection Used for Modeling: Why The Bronx
Burned}\label{bad-data-collection-used-for-modeling-why-the-bronx-burned}
\href{https://fivethirtyeight.com/features/why-the-bronx-really-burned/}{Why
The Bronx Burned}. Between 1970 and 1980, seven census tracts in the
Bronx lost more than 97 percent of their buildings to fire and
abandonment. In his book
\href{https://www.amazon.com/Fires-Computer-Intentions-City-Determined/dp/1594485062}{⏯
The Fires}, Joe Flood blames the misguided ``best and brightest'' effort
by New York City to increase government efficiency. With the help of the
Rand Corp., the city tried to measure fire response times, identify
redundancies in service, and close or re-allocate fire stations
accordingly. What resulted, though, was a perfect storm of bad data: The
methodology was flawed, the analysis was rife with biases, and the
results were interpreted in a way that stacked the deck against poorer
neighbourhoods. The slower response times allowed smaller fires to rage
uncontrolled in the city's most vulnerable communities. Listen to the
podcast
\href{https://podcasts.apple.com/us/podcast/19-why-the-bronx-burned/id1011406983?i=1000391467912}{here}.
\subsection{Bad Incentives Are Blocking Better
Science}\label{bad-incentives-are-blocking-better-science}
\href{https://fivethirtyeight.com/features/podcast-bad-incentives-are-blocking-better-science/}{Bad
Incentives Are Blocking Better Science} ``There's a difference between
an answer and a result. But all the incentives are pointing toward
telling you that as soon as you get a result, you stop.'' After the
deluge of retractions, the stories of fraudsters, the false positives,
and the high-profile failures to replicate landmark studies, some people
have begun to ask:
``\href{https://fivethirtyeight.com/features/science-isnt-broken/}{⏯ Is
science broken?}''. Listen to the pdocast {[}⏯Science is
Hard{]}tttps://podcasts.apple.com/us/podcast/10-science-is-hard/id1011406983?i=1000391467935)
\section{Reality Check}\label{reality-check}
\subsection{Looking Behind Data: Moving to America's Worst Place to
Live}\label{looking-behind-data-moving-to-americas-worst-place-to-live}
Christopher Ingraham wrote
\href{https://www.washingtonpost.com/gdpr-consent/?next_url=https\%3a\%2f\%2fwww.washingtonpost.com\%2fnews\%2fwonk\%2fwp\%2f2015\%2f08\%2f17\%2fevery-county-in-america-ranked-by-natural-beauty\%2f}{⏯
a quick blog post} for The Washington Post about an obscure USDA data
set called the \texttt{natural\ amenities\ index}, which attempts to
quantify the natural beauty of different parts of the country. He
described the rankings, noted the counties at the top and bottom, hit
publish and did not think much of it. Almost immediately, he started to
hear from the residents of northern Minnesota, who were not very happy
that Chris had written, ``The absolute worst place to live in America is
(drumroll, please) \ldots{} Red Lake County, Minn.'' He could not have
been more wrong \ldots{} a year later
\href{https://fivethirtyeight.com/features/he-called-it-americas-worst-place-to-live-now-hes-moving-there/}{he
moved} to Red Lake County with his family.
\bookmarksetup{startatroot}
\chapter{Tidy work}\label{sec-tidy}
\begin{quote}
Все счастливые семьи похожи друг на друга, каждая несчастливая семья
несчастлива по-своему. All happy families are alike; each unhappy family
is unhappy in its own way.
\end{quote}
Our \texttt{OpenCollections} systems are complex system which are
intended to be used in trustworthy AI applications. They follow the Anna
Karenina principle: a deficiency in any of a number of factors dooms an
endeavour to fail. Consequently, a successful endeavour (subject to this
principle) is one for which every possible deficiency has been avoided.
Once the data is messy, there is a semantic ambiguity (an ambiguity in
the intended use or meaning of data) that will render automation
impossible or will lead to logical faults when software agents or
algorithms use your data. You must keep your numeric and text data tidy
at all times. The best way to keep data and text tidy is to keep it
simple. Very simple.
Simplicity is simple, if you start simple and keep it that way.
Simplifying messy text and messy data is always challenging.
Collective work involving data and data annotations and descriptions
requires a shared understanding of the syntax and file formats.
Our curators need to be familiar with two ideas.
\begin{itemize}
\item[$\boxtimes$]
Tidy data means that tabular datasets are organised in a simple but
particular manner. All observations are in rows, and all measured
variables or characteristics are in columns, with no merged columns or
rows. This is the optimal formatting for working with relational
databases, and it is also a helpful start for graph databases. (See:
Section~\ref{sec-tidy-data}.)
\item[$\boxtimes$]
Word processors like Word Work on different operational systems like
Windows, MacOS, and Linux, creating very different text files and
adding their formatting and other metadata to what you type. When we
work together on the World Wide Web, we need something simpler than
HTML but a bit more rich than a plain text file, clearly separating
text editing from text formatting. The various markup notations, for
example, \emph{markdown}, are conventions for indicating that you want
to make a text part \textbf{bold} or \emph{italics} that works on all
computer systems exactly the same way.(See:
Section~\ref{sec-markup-text}.)
\end{itemize}
\section{Tidy data}\label{sec-tidy-data}