1
- This is a summary todo covering several subprojects, which would extend
1
+ This is a summary todo covering several subprojects, which extend
2
2
git-annex to be able to use proxies which sit in front of a cluster of
3
3
repositories.
4
4
@@ -12,7 +12,7 @@ repositories.
12
12
13
13
[[ !toc ]]
14
14
15
- ## planned schedule
15
+ ## plan
16
16
17
17
Joey has received funding to work on this.
18
18
Planned schedule of work:
@@ -24,94 +24,27 @@ Planned schedule of work:
24
24
* September: proving behavior of balanced preferred content with proxies
25
25
* October: streaming through proxy to special remotes (especially S3)
26
26
27
- [[ !tag projects/openneuro]]
28
-
29
- ## remaining things to do in October
30
-
31
- * Possibly some of the deferred items listed in following sections:
32
-
33
- ## items deferred until later for balanced preferred content and maxsize tracking
34
-
35
- * The assistant is using NoLiveUpdate, but it should be posssible to plumb
36
- a LiveUpdate through it from preferred content checking to location log
37
- updating.
38
-
39
- * ` git-annex info ` in the limitedcalc path in cachedAllRepoData
40
- double-counts redundant information from the journal due to using
41
- overLocationLogs. In the other path it does not (any more; it used to),
42
- and this should be fixed for consistency and correctness.
43
-
44
- * getLiveRepoSizes has a filterM getRecentChange over the live updates.
45
- This could be optimised to a single sql join. There are usually not many
46
- live updates, but sometimes there will be a great many recent changes,
47
- so it might be worth doing this optimisation. Persistent is not capable
48
- of this, would need dependency added on esquelito.
49
-
50
- ## items deferred until later for p2p protocol over http
27
+ > This project is now complete! [[ done]] --[[ Joey]]
51
28
52
- * Support proxying to git remotes that use annex+http urls. This needs a
53
- translation from P2P protocol to servant-client to P2P protocol.
54
-
55
- * Should be possible to use a git-remote-annex annex::$uuid url as
56
- remote.foo.url with remote.foo.annexUrl using annex+http, and so
57
- not need a separate web server to serve the git repository. Doesn't work
58
- currently because git-remote-annex urls only support special remotes.
59
- It would need a new form of git-remote-annex url, eg:
60
- annex::$uuid?annex+http://example.com/git-annex/
61
-
62
- * ` git-annex p2phttp ` could support systemd socket activation. This would
63
- allow making a systemd unit that listens on port 80.
64
-
65
- ## items deferred until later for [[ design/passthrough_proxy]]
66
-
67
- * Check annex.diskreserve when proxying for special remotes
68
- to avoid the proxy's disk filling up with the temporary object file
69
- cached there.
70
-
71
- * Resuming an interrupted download from proxied special remote makes the proxy
72
- re-download the whole content. It could instead keep some of the
73
- object files around when the client does not send SUCCESS. This would
74
- use more disk, but could minimize to eg, the last 2 or so.
75
- The design doc has some more thoughts about this.
76
-
77
- * Getting a key from a cluster currently picks from amoung
78
- the lowest cost remotes at random. This could be smarter,
79
- eg prefer to avoid using remotes that are doing other transfers at the
80
- same time.
29
+ [[ !tag projects/openneuro]]
81
30
82
- * The cost of a proxied node that is accessed via an intermediate gateway
83
- is currently the same as a node accessed via the cluster gateway.
84
- To fix this, there needs to be some way to tell how many hops through
85
- gateways it takes to get to a node. Currently the only way is to
86
- guess based on number of dashes in the node name, which is not satisfying.
31
+ ## some todos that spun off from this project and didn't get implemented during it:
87
32
88
- Even counting hops is not very satisfying, one cluster gateway could
89
- be much more expensive to traverse than another one.
33
+ For balanced preferred content and maxsize tracking:
90
34
91
- If seriously tackling this, it might be worth making enough information
92
- available to use spanning tree protocol for routing inside clusters.
35
+ * [[ todo/assistant_does_not_use_LiveUpdate ]]
36
+ * [[ todo/git-annex_info_with_limit_overcounts ]]
93
37
94
- * Speed: A proxy to a local git repository spawns git-annex-shell
95
- to communicate with it. It would be more efficient to operate
96
- directly on the Remote. Especially when transferring content to/from it.
97
- But: When a cluster has several nodes that are local git repositories,
98
- and is sending data to all of them, this would need an alternate
99
- interface than ` storeKey ` , which supports streaming, of chunks
100
- of a ByteString.
38
+ For p2p protocol over http:
101
39
102
- * Use ` sendfile() ` to avoid data copying overhead when
103
- ` receiveBytes ` is being fed right into ` sendBytes ` .
104
- Library to use:
105
- < https://hackage.haskell.org/package/hsyscall-0.4/docs/System-Syscall.html >
40
+ * [[ p2phttp_serve_multiple_repositories]]
41
+ * [[ git-remote-annex_support_for_p2phttp]]
106
42
107
- * Support using a proxy when its url is a P2P address.
108
- (Eg tor-annex remotes.)
43
+ For proxying:
109
44
110
- * When an upload to a cluster is distributed to multiple special remotes,
111
- a temporary file is written for each one, which may even happen in
112
- parallel. This is a lot of extra work and may use excess disk space.
113
- It should be possible to only write a single temp file.
114
- (With streaming this wouldn't be an issue.)
45
+ * [[ proxying_for_p2phttp_and_tor-annex_remotes]]
46
+ * [[ faster_proxying]]
47
+ * [[ smarter_use_of_disk_when_proxying]]
115
48
116
49
## completed items for October's work on streaming through proxy to special remotes
117
50
0 commit comments