Skip to content
This repository was archived by the owner on Apr 29, 2020. It is now read-only.

Using a protobuf or other IPFS-ish format for the wrapping archive #6

Open
wking opened this issue May 6, 2015 · 6 comments
Open

Comments

@wking
Copy link

wking commented May 6, 2015

In ipfs/kubo#1195, @jbenet proposed making the archive file itself a DAG object (or maybe just a protobuf or something ;). I'm not entirely clear on what he has in mind.

Looking over the protobuf encoding, I think it's a poor choice for the index, because efficient bisection requires fixed-size records, and protobuf encoding works hard to not use fixed-size fields. Jumping to the n-th record is slow if you have to read all n-1 records to figure out how long they are. We'll have the same problem if we try to put something like Git's fanout index in a protobuf. On the other hand, I don't have a problem with putting a fanout table, index, and array of serialized objects into a wrapping protobuf. But I don't see that gaining a lot of efficiency, either in programming time or processing efficiency, with such a small portion of the file syntax being handled by protobufs.

So I'm currently leaning against this, but I admit to not understanding the vision ;). If anyone wants to clear me up on this front, I'd be happy for the enlightenment.

@whyrusleeping
Copy link

@wking we're going to have to do something like the fanout index anyways, we dont have fixed block sizes.

@whyrusleeping
Copy link

my thoughts are that we use a very similar protobuf to the merkledag itself, and simply add a field in the links for 'offset' which will represent the offset of that child within the archive.

@whyrusleeping
Copy link

writing then becomes as easy as traversing the DAG:

rough psuedo code...

func writeDag(n *dag.Node, w io.WriteSeeker) error {
    w.Write(n.Hash())
    w.Write(n.Data)
    offset := w.CurrentOffset
    for link := range n.Links {
        w.Write(link.Name)
        w.Write(link.Hash)
        w.Write(link.Size)
        w.Write(int64(0))
    }

    for link := range n.Links {
        childoff := w.CurrentOffset
        writeDag(link.GetNode())

        // Now seek back and write the offset into the link table
        w.Seek(offset + (math to find right location))
        w.Write(childoff)

        // now seek back down and continue writing
    }
}

This would allow O(log(n)) (depth of the dag) time lookups

@wking
Copy link
Author

wking commented May 6, 2015

On Tue, May 05, 2015 at 08:26:44PM -0700, Jeromy Johnson wrote:

we're going to have to do something like the fanout index anyways,
we dont have fixed block sizes.

I'm all for fanout, but Git's packfiles use the fanout table for
faster searching in the main index [1,2]. And I think we need to
truncate/pad those index entries to a fixed length 3. I don't see
how variable block sizes have an effect on either the fanout or the
main index.

@wking
Copy link
Author

wking commented May 6, 2015

On Tue, May 05, 2015 at 08:36:00PM -0700, Jeromy Johnson wrote:

writing then becomes as easy as traversing the DAG:

rough psuedo code...

func writeDag(n *dag.Node, w io.WriteSeeker) error {
    w.Write(n.Hash())
    w.Write(n.Data)
    offset := w.CurrentOffset
    for link := range n.Links {
        w.Write(link.Name)
        w.Write(link.Hash)
        w.Write(link.Size)
        w.Write(int64(0))
    }

    for link := range n.Links {
        childoff := w.CurrentOffset
        writeDag(link.GetNode())

        // Now seek back and write the offset into the link table
        w.Seek(offset + (math to find right location))
        w.Write(childoff)

        // now seek back down and continue writing
    }
}

This would allow O(log(n)) (depth of the dag) time lookups

If you are looking up objects by following a QmKey1/QmKey2/QmKey3
path, yes. But if your goal is “Extract QmKey3 from the archive” I
think you're stuck.

I was thinking of fanout and index tables as Git uses them, a single
global occurence of each at the top of the archive file, which you can
bisect searching for your key to get the offset into to it's object
payload (or a close relative, see the a/b/c steps here 1).

@whyrusleeping
Copy link

Hrm, i guess we should lay out what properties we are interested in having before attempting to design anything.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants