forked from datalad/datalad
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdesign.rst
615 lines (451 loc) · 20.9 KB
/
design.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
Thoughts about redesign, well actually "design" since originally there
were none, of datalad crawl.
Global portion of the config
============================
::
[dataset]
path =
description =
exec =
Data providers
==============
`crawl` command collects data present possibly across
different remote data providers (regular HTTP websites, AWS S3
buckets, etc) and then consolidates access to them within a single
git-annex'ed repository. `crawl` should also keep track of
status/versions of the files, so in case of the updates (changes,
removals, etc) on remote sites, git-annex repository could be
correspondingly updated.
Common config specs::
[provider:XXX]
type = (web|s3http|merge|git-annex) # default to web
branch = master # default to master
commit_to_git = # regexps of file names to commit directly to git
ignore = # files to ignore entirely
drop = False # either to drop the load upon 'completion'
# some sanity checks
check_entries_limit = -1 # no limit by default
(To be) Supported data providers
--------------------------------
Web
~~~
In many usecases data are hosted on a public portal, lab website,
personal page, etc. Such data are often provided in tarballs, which
need to be downloaded and extracted later on. Extraction will not be
a part of this provider -- only download from the web::
[provider:incoming_http]
type = web
mode = (download|fast|relaxed) # fast/relaxed/download
filename = (url|request) # of cause also could be _e'valuated given the bs4 link get_video_filename(link, filename)
recurse_(a|href) = # regexes to recurse
# mimicking scrapy
start_urls = http://... #
certificates = # if there are https -- we need to allow specifying those
allowed_domains = example.com/buga/duga # to limit recursion
sample.com
excluded_hrefs = # do not even search for "download" URLs on given pages. Should also allow to be a function/callback to decide based on request?
include_(a|href) = # what to download
exclude_(a|href) = # and not (even if matches)
???generators = generate_readme # Define some additional actions to be performed....
We need to separate options for crawling (recursion etc) and deciding
what to download/annex.
Q: should we just specify xpath's for information to get extracted
from a response corresponding to a matching url? just any crawled page?
Q: allow to use xpath syntax for better control of what to recurse/include?
Q: authentication -- we should here relate to the Hostings
!: scrapy's Spider provides start_requests() which could be used to
initiate the connection, e.g. to authenticate and then use that connection.
Authentication detail must not be a part of the configuration, BUT
it must know HOW authentication should be achieved. In many cases could
be a regular netrc-style support (so username/password).
Those authenticators should later be reused by "download clients"
Q: we might need to worry/test virtually about every possible associated
to http downloads scenario, e.g. support proxy (with authentication).
May be we could just switch to aria2 and allow to specify access options?
Q: may be (a new provider?) allow to use a scrapy spider's output to
harvest the table of links which need to be fetched
Use cases to keep in mind
+++++++++++++++++++++++++
- versioning present in the file names
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/sequence_indices/
- ha -- idea, all those should be referred in some other branch, like
with archives, and then 'public' one would just take care about
pointing to the "correct one" and serve a "compressed" view.
Hence: monitor original, point "compressed" to a branch giving it
a set of rules on how to determine version, i.e. on which files
This way we could have both referenced in the same repository.
Amazon S3
~~~~~~~~~
Initial accent will be made on S3 buckets which have versioning
enabled, and which expose their content via regular http/https.
tricky points:
- versioning (must be enabled. If uploaded before enabled, version is Null)
- etags are md5s BUT only if upload was not multi-chunked, so
it becomes difficult to identify files by md5sums (must be
downloaded first then, or some meta-info of file should be modified so
etag gets regenerated -- should result in file md5sum appearing as etag)
- S3 most probably would be just an additional -- not the primary provider
Generated
~~~~~~~~~
We should allow for files to be generated based on the content of the
repository and/or original information from the data providers,
e.g. content of the webpages containing the files to be
downloaded/referenced. Originally envisioned as a separate branch,
where only archived content would be downloaded and later extracted
into corresponding locations of the "public" branch (e.g. master).
But may be it should be more similar to the stated above "versioning"
idea where it would simply be an alternative "view" of another branch,
where some content is simply extracted. I.e. all those modifications
could be assembled as a set of "filters"::
[generator:generate_readme]
filename = README.txt
content_e = generate_readme(link, filename) # those should be obtained/provided while crawling
or
[generator:fetch_license]
filename = LICENSE.txt
content_e = fetch_license(link, filename) # those should be obtained/provided while crawling
Merge
~~~~~
Originally fetched Files might reside in e.g. 'incoming' branch while
'master' branch then could be 'assembled' from few other branches with
help of filtering::
[provider:master]
type = merge
branch = master # here matches the name but see below if we need to repeat
merge = incoming_data_http
incoming_meta_website
filters = extract_models
extract_data
generate_some_more_files_if_you_like
Q: should we may be 'git merge --no-commit' and then apply the
filters???
probably not since there could be conflicts if similarly named file
is present in target branch (e.g. generated) and was present
(moved/renamed via filters) in the original branch.
Q: but merging of branches is way too cool and better establishes the
'timeline' and dependencies...
So merge should be done "manually" by doing (there must be cleaner way)::
git merge -s ours --no-commit
git rm -r *
# - collect and copy files for all the File's from branches to .
# - stage all the files
# - pipe those "File"s from all the branches through the filters
# (those should where necessary use git rm, mv, etc)
# - add those File's to git/git-annex
git commit
but what if a filter (e.g. cmd) requires current state of files from
different branches?... all possible conflict problems could be
mitigated by storing content in branches under some directories,
then manipulating upon "merge" and renaming before actually 'git merging'
Q: what about filters per incoming branch??? we could options for
filters specification
(e.g. extract_models[branches=incoming_data_http]) or allow
only regular 2-edge merge at a time but multiple times...
XNAT, COINS, ...
~~~~~~~~~~~~~~~~
Later ... but the idea should be the same I guess: they should expose
collections of File's with a set of URIs so they could be addurl'ed to
the files. It is not clear yet either they would need to be crawled
or would provide some API similar to S3 to request all the necessary
information?
git/git-annex
~~~~~~~~~~~~~
If provider is already a Git(-annex) repository. Usecase:
forrest_gump. So it is pretty much a regular remote **but** it might
benefit from our filters etc.
torrent
~~~~~~~
I guess similar/identical to archives if torrent points to a single
file -- so just 'addurl'. If torrent provides multiple files, would
need mapping of UUIDs I guess back to torrents/corresponding files.
So again -- similar to archives...?
aria2 seems to provide a single unified HTTP/HTTPS/FTP/BitTorrent
support, with fancy simultaneous fetching from multiple
remotes/feeding back to the torrent swarm (caution for non-free data).
It also has RPC support, which seems to be quite cool and might come
handy (e.g. to monitor progress etc)
Wild: Git repository for being rewritten
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
theoretically we could collect all the information to rewrite some
other Git repo but now injecting some files into git-annex (while
possibly even pointing for the load to e.g. original SVN repo).
Tricky:
- branches and merges -- would be really tricky and so far not
envisioned how
- "updates" should correspond to commits in original repository
- all the commit information should be extracted/provided for the
commit here
Filters
=======
Considering idea that all the modifications (archives extraction,
versioning etc) could be made through monitoring of another branch(es)
and applying a set of filters.
- files which aren't modified, should also propagate into target
branch, along with all their urls
file by file wouldn't work since filter might need to analyze the
entire list of files...::
def apply_filters(self):
files_out = files_in
for filter in self.filters:
files_out = filter.apply(files_out)
return files_out
then each filter would decide on how to treat the list of files.
May be some filters' subtyping would be desired
(PerfileFilter/AllfilesFilter)
- filters should provide API to 'rerun' their action to obtain the
same result.
Cross-branch
------------
Some filters to be applied on files from one branch to have results
placed into another:
Extract
~~~~~~~
Special kind of a beast: while keeping the original archive under
git-annex obtained from any other provider (e.g. 'Web'), we extract
the load (possibly with some filtering/selection):
Q: how to deal with extract from archives -- extraction should
better be queued to extract multiple files from the archive at
once. But ATM it would not happen since all those URIs will
simply be requested by simple wget/curl calls by git-annex file
at a time.
A: upon such a first call, check if there is .../extracted_key/key/, if
there is -- use. If not -- extract and then use. use = hardlink
into the target file.
Upon completion of `datalad get` (or some other command) verify
that all `/extracted/` are removed (and/or provide setting -- may
be we could/should just keep those around)
Config Examples:
++++++++++++++++
::
[filter:extract_models]
filter = extract # by default would be taken as the element after "filter:"
input = *(\S+)_models\.tgz$ # and those files are not provided into output
output_prefix = models/$1/ # somehow we should allow to reuse input regex's groups
exclude = # regex for files to be excluded from extraction or straight for tar?
strip_path = 1
Probably will just use patoolib (do not remember if has
strip_path... seems not:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757483)
URI: dl:extract:UID
and we keep information for what 'key' it came into what file (which
might later get renamed, so extraction from the archive shouldn't
later happen in-place, but rather outside and then moved accordingly)
Tricky point(s):
- may be by default should still extract all known archives types and
just rely on the filename logic?
- the same file might be available from multiple archives.
So we would need to keep track from previous updates, from which
archive files could be fetched.
- how to remove if archive is no longer avail?
probably some fsck should take care about checking if archives
are still avail, and if not -- remove the url
- keep track which files came from the archive, so we could later
remove them happen if archive misses the file now.
Q: allow for 'relaxed' handling?
If tarballs are not versioned at all, but we would like to create
overall (? or just per files) 'relaxed' git-annex?
Probably no complication if URIs will be based (natively) on the
fast or relaxed keys. Sure thing things would fail if archive was
changed and lacks the file.
Q: hm -- what about MD5SUM checking? e.g. if archive was posted with
the MD5SUMs file
I guess some kind of additional filter which could be attached
somehow?
Move/Rename/Delete
~~~~~~~~~~~~~~~~~~
Just move/rename/delete some files around e.g. for a custom view of
the dataset (e.g. to conform to OpenfMRI layout). Key would simply be
reused ;)
Q: should it be 'Within-branch' filter?
Command
~~~~~~~
A universal filter which would operate on some files and output
possibly in place or modified ones...
Then it would need to harvest and encode into file's URI the
provenance -- i.e. so it could later be recreated automagically.
For simple usecases (e.g. creation of lateralized atlas in HOX, some
data curation, etc)
URI: dl:cmd:UID
while we keep a file providing the corresponding command for each UID,
where ARGUMENTS will would point to the original files keys in the git
annex. Should it be kept in PROV format may be???
Config Examples::
[filter:command_gunzip]
in1 = *\.gz
in2_e = in1.replace('.gz', '')
#eval_order=in1 in2
command = zcat {in1} > {in2}
output_files = {in2}
Problems:
- might be tricky to provide generic enough interface?
- we need plentiful of use-cases to get it right, so this one is just
to keep in mind for future -- might be quite cool after all.
Within-branch
-------------
Other "Filters" should operate within the branch, primarily simply for
checking the content
Checksum
~~~~~~~~
e.g. point to MD5SUMS file stored in the branch, provide how file
names must be augmented, run verification -- no files output, just the
status
Addurl
~~~~~~
If the repository is going/was published also online under some URL.
We might like to populate files with corresponding urls.
[filter:addurl]
prefix = http://psydata.ovgu.de/forrest_gump/.git/annex/
check = (False|True) # to verify presence or not ???
Usecase -- Michael's forrest_gump repository. Now files are not
associated explicitly with that URL -- only via a regular git remote.
This cumbersomes work with clones which then all must have original
repository added as a remote.
`check = False` could be the one needed for a 'publish' operation
where this data present locally is not yet published anywhere.
Tagging
~~~~~~~
We might like to tag files... TODO: think what to provide/use to
develop nice tags.
Ideas:
- a tag given a set of filename regexps
[tag:anatomicals]
files = .*\_anat\.nii\.gz
tag = modality=anatomy
or just
[tag:anatomicals]
files = .*\_anat\.nii\.gz
tag = anatomy
if it is just a tag (anatomy) without a field
- (full)filename regexp with groups defining possibly multiple
tag/value pairs
[tag:modality]
files = .*\_(?P<modality>\S*)\.nii\.gz
translate = anat: T1 # might need some translation dictionary?
dwi: DTI
Notes:
- metadata cane be added only to files under git-annex control so those
directly committed
Design thoughts
===============
Data providers should provide a unified interface
DataProvider
~~~~~~~~~~~~
Common Parameters
- add_to_git - what files to commit to git directly (should we leverage
git-annex largefiles option somehow?)
- ignore - what files to ignore
- get_items(version=None) - return a list of Files
- get_item_by_name
- get_item_by_md5
- should those be additional interfaces?
- what if multiple items fulfill (content is the same, e.g. empty, names differ,
we better get the most appropriate in the name or don't give a damn?)
- what if a collision????
- get_item_by_sha256
- e.g. natively provided by 'Branch' provider for annexed files
(what to do about git committed ones -- compute/keep info?)
- get_versions(min_version=None)
provider-wide version (i.e. not per file). E.g. S3
provider can have multiple versions of files.
Might be that it needs to return a DAG of versions i.e. a
(version, [prev_version1, prev_version2, ...]) to represent e.g.
history of a Git repo. In most of the cases would be degenerate to just
one prev version, in which case could just be (version, ).
We would need to store that meta-information for future updates at least
for the last version so we could 'grow' next ones on top.
- ? get_release_versions() -- by default identical to above... but might
differ (update was, but no new official release (yet), so no release
tag)
- get_version_metainformation() -- primarily conceived when thinking
about monitoring other VCS repos... so should be information to be
used for a new Git commit into this new repository
.. note:
Keep in mind
- Web DataProvider must have an option to request the content filename
(addressing use case with redirects etc)
- Some providers might have multiple URIs (mirrors) so right away
assign them per each file... As such they might be from
different Hostings!
File
~~~~
what would be saved as a file. Should know about itself... and origins!
- filename
- URIs - list containing origins (e.g. URLs) on where to fetch it from.
First provided by the
original DataProvider, but then might be expanded using
other DataProviders
Q: Those might need to be not just URIs but some classes associated
with original Hosting's, e.g. for the cases of authentication etc?
or we would associate with a Hosting based on the URI?
# combination of known fields should be stored/used to detect changes
# Different data providers might rely on a different subset of below
# to see if there was a change. We should probably assume some
# "correspondence"
- key # was thinking about Branch as DataProvider -- those must be reused
- md5
- sha256
- mtime
- size
It will be the job of a DataProvider to initiate File with the
appropriate filename.
URI
~~~
-> URL(URI): will be our first and main "target" but it could
also be direct S3, etc.
a URI should be associated with an "Hosting" (many-to-one), so we could
e.g. provide authentication information per actual "Hosting" as the
entity. But now we are getting back to DataProvider, which is the
Hosting, or actually also a part of it (since Hosting could serve
multiple Providers, e.g. OpenfMRI -> providers per each dataset?)
But also Provider might use/point to multiple Hostings (e.g. mirrors
listed on nitp-2013).
Hosting
~~~~~~~
Each DataProvider would be a factory of File's.
Ideas to not forget
~~~~~~~~~~~~~~~~~~~
- Before carrying out some operation, remember the state of all
(involved) branches, so it would be very easy later on to "cancel"
the entire transaction through a set of 'git reset --hard' or
'update-ref's.
Keep log of the above!
- multiple data providers could be specified but there should be
'primary' and 'complimentary' ones:
- primary provider(s) define the layout/content
- complimentary providers just provide references to additional
locations where that data (uniquely identified via checksums etc)
could be obtained, so we could add more data providing urls
- Q: should all DataProvider's be able to serve as primary and complimentary?
- most probably we should allow for an option to 'fail' or issue a
warning in some cases
- secondary provider doesn't carry a requested load/file
- secondary provider provides some files not provided by the primary
data provider
- at the end of the crawl operation, verify that all the files have all
and only urls from the provided data providers
- allow to add/specify conventional git/annex clones as additional,
conventional (non special) remotes to be added.
- allow to prepopulate URLs given e.g. perspective hosting on HTTP.
This way whenever content gets published there -- all files would
have appropriate URLs associated and would 'transcend' through the
clones without requiring adding original remote.
Updates
=======
- must track updates and removals of the files
- must verify presence (add, remove) of the urls associated with the
files given a list of data providers
Meta information
================
Since a while `git annex` provides a neat feature allowing to assign
tags to the files and later use e.g. `git annex view` to quickly
generate customized views of the repository.
Some cool related tools
=======================
https://github.com/scrapy/scrapely
Pure Python (no DOM, lxml, etc) scraping of pages, "training" the
scraper given a sample. May be could be handy???
https://github.com/scrapy/slybot
Brings together scrapy + scrapely to provide json-specs for
spiders/items/etc
Might be worth at least adopting spiders specs...?
Has a neat slybot/validation/schemas.json which validates the schematic