Skip to content

Commit 87871f7

Browse files
committed
split up remaining items from todo/git-annex_proxies and close it!
1 parent 9b7378f commit 87871f7

File tree

3 files changed

+27
-82
lines changed

3 files changed

+27
-82
lines changed

Database/RepoSize.hs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -400,6 +400,11 @@ liveRepoOffsets (RepoSizeHandle (Just h) _) wantedsizechange = H.queryDb h $ do
400400
map (\(k, v) -> (k, [v])) $
401401
fromMaybe [] $
402402
M.lookup u livechanges
403+
-- This could be optimised to a single SQL join, rather
404+
-- than querying once for each live change. That would make
405+
-- it less expensive when there are a lot happening at the
406+
-- same time. Persistent is not capable of that join,
407+
-- it would need a dependency on esquelito.
403408
livechanges' <- combinelikelivechanges <$>
404409
filterM (nonredundantlivechange livechangesbykey u)
405410
(fromMaybe [] $ M.lookup u livechanges)
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
`git-annex info` in the limitedcalc path in cachedAllRepoData
2+
double-counts redundant information from the journal due to using
3+
overLocationLogs. In the other path it does not (any more; it used to but
4+
live repo sizes fixed that), and this should be fixed for consistency
5+
and correctness.
6+
7+
(This is a deferred item from the [[todo/git-annex_proxies]] megatodo.) --[[Joey]]

doc/todo/git-annex_proxies.mdwn

Lines changed: 15 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
This is a summary todo covering several subprojects, which would extend
1+
This is a summary todo covering several subprojects, which extend
22
git-annex to be able to use proxies which sit in front of a cluster of
33
repositories.
44

@@ -12,7 +12,7 @@ repositories.
1212

1313
[[!toc ]]
1414

15-
## planned schedule
15+
## plan
1616

1717
Joey has received funding to work on this.
1818
Planned schedule of work:
@@ -24,94 +24,27 @@ Planned schedule of work:
2424
* September: proving behavior of balanced preferred content with proxies
2525
* October: streaming through proxy to special remotes (especially S3)
2626

27-
[[!tag projects/openneuro]]
28-
29-
## remaining things to do in October
30-
31-
* Possibly some of the deferred items listed in following sections:
32-
33-
## items deferred until later for balanced preferred content and maxsize tracking
34-
35-
* The assistant is using NoLiveUpdate, but it should be posssible to plumb
36-
a LiveUpdate through it from preferred content checking to location log
37-
updating.
38-
39-
* `git-annex info` in the limitedcalc path in cachedAllRepoData
40-
double-counts redundant information from the journal due to using
41-
overLocationLogs. In the other path it does not (any more; it used to),
42-
and this should be fixed for consistency and correctness.
43-
44-
* getLiveRepoSizes has a filterM getRecentChange over the live updates.
45-
This could be optimised to a single sql join. There are usually not many
46-
live updates, but sometimes there will be a great many recent changes,
47-
so it might be worth doing this optimisation. Persistent is not capable
48-
of this, would need dependency added on esquelito.
49-
50-
## items deferred until later for p2p protocol over http
27+
> This project is now complete! [[done]] --[[Joey]]
5128
52-
* Support proxying to git remotes that use annex+http urls. This needs a
53-
translation from P2P protocol to servant-client to P2P protocol.
54-
55-
* Should be possible to use a git-remote-annex annex::$uuid url as
56-
remote.foo.url with remote.foo.annexUrl using annex+http, and so
57-
not need a separate web server to serve the git repository. Doesn't work
58-
currently because git-remote-annex urls only support special remotes.
59-
It would need a new form of git-remote-annex url, eg:
60-
annex::$uuid?annex+http://example.com/git-annex/
61-
62-
* `git-annex p2phttp` could support systemd socket activation. This would
63-
allow making a systemd unit that listens on port 80.
64-
65-
## items deferred until later for [[design/passthrough_proxy]]
66-
67-
* Check annex.diskreserve when proxying for special remotes
68-
to avoid the proxy's disk filling up with the temporary object file
69-
cached there.
70-
71-
* Resuming an interrupted download from proxied special remote makes the proxy
72-
re-download the whole content. It could instead keep some of the
73-
object files around when the client does not send SUCCESS. This would
74-
use more disk, but could minimize to eg, the last 2 or so.
75-
The design doc has some more thoughts about this.
76-
77-
* Getting a key from a cluster currently picks from amoung
78-
the lowest cost remotes at random. This could be smarter,
79-
eg prefer to avoid using remotes that are doing other transfers at the
80-
same time.
29+
[[!tag projects/openneuro]]
8130

82-
* The cost of a proxied node that is accessed via an intermediate gateway
83-
is currently the same as a node accessed via the cluster gateway.
84-
To fix this, there needs to be some way to tell how many hops through
85-
gateways it takes to get to a node. Currently the only way is to
86-
guess based on number of dashes in the node name, which is not satisfying.
31+
## some todos that spun off from this project and didn't get implemented during it:
8732

88-
Even counting hops is not very satisfying, one cluster gateway could
89-
be much more expensive to traverse than another one.
33+
For balanced preferred content and maxsize tracking:
9034

91-
If seriously tackling this, it might be worth making enough information
92-
available to use spanning tree protocol for routing inside clusters.
35+
* [[todo/assistant_does_not_use_LiveUpdate]]
36+
* [[todo/git-annex_info_with_limit_overcounts]]
9337

94-
* Speed: A proxy to a local git repository spawns git-annex-shell
95-
to communicate with it. It would be more efficient to operate
96-
directly on the Remote. Especially when transferring content to/from it.
97-
But: When a cluster has several nodes that are local git repositories,
98-
and is sending data to all of them, this would need an alternate
99-
interface than `storeKey`, which supports streaming, of chunks
100-
of a ByteString.
38+
For p2p protocol over http:
10139

102-
* Use `sendfile()` to avoid data copying overhead when
103-
`receiveBytes` is being fed right into `sendBytes`.
104-
Library to use:
105-
<https://hackage.haskell.org/package/hsyscall-0.4/docs/System-Syscall.html>
40+
* [[p2phttp_serve_multiple_repositories]]
41+
* [[git-remote-annex_support_for_p2phttp]]
10642

107-
* Support using a proxy when its url is a P2P address.
108-
(Eg tor-annex remotes.)
43+
For proxying:
10944

110-
* When an upload to a cluster is distributed to multiple special remotes,
111-
a temporary file is written for each one, which may even happen in
112-
parallel. This is a lot of extra work and may use excess disk space.
113-
It should be possible to only write a single temp file.
114-
(With streaming this wouldn't be an issue.)
45+
* [[proxying_for_p2phttp_and_tor-annex_remotes]]
46+
* [[faster_proxying]]
47+
* [[smarter_use_of_disk_when_proxying]]
11548

11649
## completed items for October's work on streaming through proxy to special remotes
11750

0 commit comments

Comments
 (0)