support unixfs dag in mater #728

pete-eiger · 2025-02-04T10:27:19Z

Description

This PR extends mater’s content extraction capabilities to support .car files that store UnixFS DAGs. Previously, extraction was limited to sequentially ordered blocks containing arbitrary content, where blocks were assumed to be in the correct order. With this change, mater can now correctly traverse and extract content from files wrapped in UnixFS DAG structures, as used in IPFS.

The implementation introduces a new code branch in the filestore creation and extraction paths. Specifically:

New functions have been added to walk through a UnixFS DAG. The extraction now recognizes and processes DAG-PB nodes by reading their links, thereby reconstructing the original file content from its underlying tree structure
The reader iterates over the DAG, recursively queuing up child nodes and ensuring all parts of the DAG are traversed and written out correctly
The Config enum has been updated with additional constructors - balanced_unixfs(chunk_size, tree_width), for UnixFS-wrapped content (default behavior for IPFS compatibility) and balanced_raw(chunk_size, tree_width) for direct/raw storage without UnixFS metadata. These allow users to specify whether to process content as UnixFS or in raw mode. The CLI has been updated with new flags (raw, chunk_size, and tree_width) to control behavior. Several unit tests have been updated. Inline documentation has been updated to explain the logic and changes.

Checklist

Are there important points that reviewers should know?
- If yes, which ones? - we have to confirm that this is the expected behavior of the unixfs functionality and that no regressions have been introduced to the raw data option
Make sure that you described what this change does.
If there are follow-ups, have you created issues for them?
- I haven't fully tackled content extraction as I didn't want this PR to get too bloated
Have you tested this solution?
- this needs better testing, am open to suggestions
Were there any alternative implementations considered?
Did you document new (or modified) APIs? - inline comments

Open questions (aside from the ones raised in the checklist)

Not sure about whether to remove the overwrite field on Config, thoughts?

mater/cli/src/main.rs

mater/lib/src/stores/filestore.rs

mater/lib/src/v2/reader.rs

mater/lib/src/lib.rs

storage-provider/server/src/storage.rs

jmg-duarte

I didn't audit the balanced streams, I'll leave that for the final review.

mater/cli/src/convert.rs

mater/cli/src/main.rs

mater/lib/src/stores/filestore.rs

jmg-duarte · 2025-02-05T10:14:58Z

mater/lib/src/unixfs/mod.rs

+                sizes.push(link_info.raw_data_length);
+                links.push(PbLink {
+                    cid: *child_cid,
+                    name: Some("".to_string()),


Couldn't it be None instead?

ideally, but now if we use None some tests will fail

failures: stores::blockstore::tests::byte_eq_spaceglenda stores::blockstore::tests::dedup_lorem_roundtrip stores::filestore::test::test_spaceglenda_roundtrip v2::writer::tests::full_spaceglenda

it's cause changing the link’s name from Some("") to None affects the serialized output. In our DAG‑PB the “Name” field is encoded as a field in the protobuf message. When you set it to Some(""), even though the string is empty, the encoder still writes the tag and a length (which will be 0). In contrast, if you set it to None, that field is omitted entirely. This difference causes the final serialized CAR file to have a slightly different size (and different offsets for index entries) compared to the expected reference produced by go‑car or as specified by our tests.

Since our tests compare exact byte lengths and offsets (for example, expecting 1358 bytes versus 1350 bytes, etc.), the change leads to mismatches. In this case, if the expected behavior (or the spec) requires that an empty name is explicitly encoded as an empty string, you need to keep using Some("").

“no name” and “empty name” are not equivalent, in the encoding they differ. The tests check for the exact binary output, so we gotta keep the field present as Some("") if you we to match the expected output.

I remember now. Can you leave a comment explaining the edge case? Since we're looking to match go-car's impl, etc

mater/lib/src/unixfs/mod.rs

aidan46 · 2025-02-06T07:05:17Z

mater/cli/src/main.rs

+                raw,
+            );
+
+            let cid = convert_file_to_car(&input_path, &output_path, config, false).await?;


We seem to not overwrite by default now. If we do not want to support overwriting can we remove this boolean?

Let's just not remove it the overwrite flag at all, what isn't broken doesn't need fixing :)

we are supporting (see #728 (comment))

overwrite flag was useful in our expriments, I'd let it be.

mater/lib/src/stores/blockstore.rs

mater/lib/src/stores/filestore.rs

…nixfs-dag

cernicc

Publishing a partial review.

cernicc · 2025-02-11T09:07:11Z

storage-provider/server/src/storage.rs

@@ -23,6 +24,8 @@ use tokio_util::{
 use tower_http::trace::TraceLayer;
 use uuid::Uuid;

+type BoxedStream = Pin<Box<dyn Stream<Item = Result<Bytes, std::io::Error>> + Send>>;


You can use https://docs.rs/futures/latest/futures/stream/type.BoxStream.html

cernicc · 2025-02-11T09:19:41Z

storage-provider/server/src/storage.rs

            tracing::error!(%err, "failed to execute blocking task");
            (StatusCode::INTERNAL_SERVER_ERROR, err.to_string())
        })??;

-    // Branching needed here since the resulting `StreamReader`s don't have the same type
+    // Determine how to obtain the file's bytes:
    let file_cid = if request.headers().contains_key("Content-Type") {


This change is the biggest in the upload function. I am trying to understand what it actually changes, but I can't see the reason why the custom stream is needed. Field already implements the Stream trait.

I hadn't noticed this before. Why is a branch changed and not the other? Currently, this code is working why move to a custom made stream?

cernicc · 2025-02-11T09:21:46Z

storage-provider/server/src/storage.rs

-/// Reads bytes from the source and writes them to a CAR file.
+/// Converts a source stream into a CARv2 file and writes it to an output stream.
+///
+/// Send + 'static bounds are required because the UnixFS processing involves:


The bounds descriptions are not needed in the docs

cernicc · 2025-02-11T09:32:38Z

mater/lib/src/lib.rs

+
+    /// Catch-all error for miscellaneous cases.
+    #[error("other error: {0}")]
+    Other(String),


This Other variant is used only once. Instead of using Other you should use CidError or InvalidCid and remove this variant.

jmg-duarte

I'm sorry but I can't allow this PR to go through with all the seemingly unrelated refactors.

Some of them are even clear downgrades, like removing logging from unhandled errors.

jmg-duarte · 2025-02-11T09:49:24Z

storage-provider/server/src/storage.rs

-            tracing::error!(%err, path = %file_path.display(), "failed to remove uploaded piece");
-        }
-
+        let _ = tokio::fs::remove_file(&file_path).await;


You're removing the log and not returning the error.
Silently failing and leaving the user in the dark IS NOT OK

jmg-duarte · 2025-02-11T10:02:37Z

storage-provider/server/src/storage.rs

+    .await
    .map_err(|err| {
        tracing::error!(%err, "failed to rename the CAR file");
        (StatusCode::INTERNAL_SERVER_ERROR, err.to_string())
-    })
-    .await?;
+    })?;


Any reason the await is better here?

jmg-duarte · 2025-02-11T10:04:41Z

storage-provider/server/src/storage.rs

-    // We need to rename the file since the original storage name is based on the whole deal proposal CID,
-    // however, the piece is stored based on its piece_cid


I don't understand why this is being removed

jmg-duarte · 2025-02-11T10:05:18Z

storage-provider/server/src/storage.rs

-    // Calculate the piece commitment in the blocking thread pool since `calculate_piece_commitment`
-    // is CPU intensive — i.e. blocking — potentially improvement is to move this completely out of
-    // the tokio runtime into an OS thread


Recurring theme: why are you removing comments that explain the context behind the decisions?

jmg-duarte · 2025-02-11T10:06:03Z

storage-provider/server/src/storage.rs

-        let _ = tokio::fs::remove_file(&file_path).await.inspect_err(
-            |err| tracing::error!(%err, path = %file_path.display(), "failed to delete file"),
-        );


Removing the error logs again. Please explain why this would be a good idea

jmg-duarte · 2025-02-11T10:10:17Z

storage-provider/server/src/storage.rs

            .await
            .map_err(|err| {
                tracing::error!(%err, "failed to store file into CAR archive");
                (StatusCode::INTERNAL_SERVER_ERROR, err.to_string())
            })?
    } else {
-        // Read the request body into a CAR archive
+        // For direct uploads, convert the request body into a stream.


What is a direct upload?

jmg-duarte · 2025-02-11T10:12:18Z

storage-provider/server/src/storage.rs

    let deal_cid = cid::Cid::from_str(&cid).map_err(|err| {
        tracing::error!(cid, "failed to parse cid");
        (StatusCode::BAD_REQUEST, err.to_string())
    })?;

+    // Use deal_db (we need it now, so we clone it)


This comment adds no value. I don't get it, you're removing contextual comments that provide insight to why things are done the way they are, but add comments like this where it's very obvious what's going on.

jmg-duarte · 2025-02-11T10:12:53Z

storage-provider/server/src/storage.rs

    let proposed_deal =
-        // Move the fetch to the blocking pool since the RocksDB API is sync


More context being removed

jmg-duarte · 2025-02-11T10:15:58Z

mater/lib/src/stores/filestore.rs

+            let read_bytes = source.read_buf(&mut buf).await?;
+            trace!(bytes_read = read_bytes, buffer_size = buf.len(), "Buffer read status");
+            // EOF but there's still content to yield -> yield it
+            while buf.len() >= chunk_size {
                // The buffer may have a larger capacity than chunk_size due to reserve
                // this also means that our read may have read more bytes than we expected,
                // thats why we check if the length if bigger than the chunk_size and if so
                // we split the buffer to the chunk_size, then freeze and return
                let chunk = buf.split_to(chunk_size);
                yield chunk.freeze();
            } // otherwise, the buffer is not full, so we don't do a thing
+
+            if read_bytes == 0 && !buf.is_empty() {
+                let chunk = buf.split();
+                yield chunk.freeze();
+                break;
+            } else if read_bytes == 0 {
+                break;
+            }


Unless you can explain where the bug was, I expect this changes to be reverted.

I can't allow you to refactor a core piece of logic that is not broken and was the product of multiple discussions.

I think I understand what was going on.
If chunk_size =2, but you read_bytes == 0, and there is still buf.len() == 4 from the previous invocation, you should not yield it in its entirety, but continue chunking it.
It makes sense, HOWEVER.

Then there is a bug/or it should be panic in here:

if read_bytes == 0 && !buf.is_empty() { let chunk = buf.split(); yield chunk.freeze(); break;

because if we're chunking, and loop above gave all of the nice sized chunks, why we're yielding something which is not chunk_size? Is this intended?

Imagine, buf.len() == 5, chunk_size == 2, it goes:

yield 2; yield 2; yield 1;

so the last yielded chunk won't be a nice chunk.

.....

Then I took a look at: let mut buf = BytesMut::with_capacity(chunk_size);, on the first look it looks like it's always going to read source.read_buf(&mut buf).await?, chunk_size into buf. But is it?

Is BytesMut::with_capacity(chunk_size) guaranteeing that source.read_buf(&mut buf).await? will read always at most chunk_size bytes? Quick search told me that it depends on the behaviour of underlying read_buf, so not sure.

jmg-duarte · 2025-02-11T10:16:54Z

storage-provider/server/src/storage.rs

+            // Use `file_path` here instead of the undefined `piece_path`
+            let (piece_commitment, _) = commp(&file_path)?;


Nothing else changed. Was there a bug in delia?

th7nder · 2025-02-11T12:46:11Z

mater/cli/src/main.rs

+                raw,
+            );
+
+            let cid = convert_file_to_car(&input_path, &output_path, config, false).await?;


overwrite flag was useful in our expriments, I'd let it be.

th7nder · 2025-02-11T12:48:12Z

mater/lib/src/lib.rs

+            if processed.contains(&current_cid) {
+                continue;


This logic is duplicate in the DFS. We're checking it already here.

if !processed.contains(&link.cid) { to_process.push(link.cid); }

th7nder · 2025-02-11T12:50:42Z

mater/lib/src/lib.rs

+            // Write the raw block data. In a real UnixFS traversal you might need
+            // to reconstruct file content in order.
+            output.write_all(&block_bytes).await?;


How is that not real UnixFS traversal?
I get confused, what extract_content_via_index is supposed to do?
What do we mean by index if actually we are not using index anywhere?

th7nder · 2025-02-11T15:06:34Z

mater/lib/src/lib.rs

+            if current_cid.codec() == crate::multicodec::DAG_PB_CODE {
+                let mut cursor = std::io::Cursor::new(&block_bytes);
+                // Propagate any error that occurs during decoding.
+                let pb_node: ipld_dagpb::PbNode =
+                    DagPbCodec::decode(&mut cursor).map_err(Error::DagPbError)?;
+                for link in pb_node.links {
+                    if !processed.contains(&link.cid) {
+                        to_process.push(link.cid);
+                    }
+                }
+            }


Suggested change

if current_cid.codec() == crate::multicodec::DAG_PB_CODE {

let mut cursor = std::io::Cursor::new(&block_bytes);

// Propagate any error that occurs during decoding.

let pb_node: ipld_dagpb::PbNode =

DagPbCodec::decode(&mut cursor).map_err(Error::DagPbError)?;

for link in pb_node.links {

if !processed.contains(&link.cid) {

to_process.push(link.cid);

}

}

}

if current_cid.codec() == crate::multicodec::DAG_PB_CODE {

let mut cursor = std::io::Cursor::new(&block_bytes);

// Propagate any error that occurs during decoding.

let pb_node: ipld_dagpb::PbNode =

DagPbCodec::decode(&mut cursor).map_err(Error::DagPbError)?;

for link in pb_node.links {

if !processed.contains(&link.cid) {

to_process.push(link.cid);

}

}

} else {

return Err(Error::UnsupportedCidCodec(current_cid.codec()));

}

th7nder · 2025-02-11T15:07:31Z

mater/lib/src/lib.rs

+impl<R> CarBlockStore<R>
+where
+    R: AsyncSeek + AsyncReadExt + Unpin,
+{


Can't we pull it in under one impl block? they seem to be generic over the same trait bounds.

I don't know why one is AsyncSeek, the other AsyncSeekExt though.

th7nder · 2025-02-11T15:26:00Z

mater/lib/src/stores/filestore.rs

+            let read_bytes = source.read_buf(&mut buf).await?;
+            trace!(bytes_read = read_bytes, buffer_size = buf.len(), "Buffer read status");
+            // EOF but there's still content to yield -> yield it
+            while buf.len() >= chunk_size {
                // The buffer may have a larger capacity than chunk_size due to reserve
                // this also means that our read may have read more bytes than we expected,
                // thats why we check if the length if bigger than the chunk_size and if so
                // we split the buffer to the chunk_size, then freeze and return
                let chunk = buf.split_to(chunk_size);
                yield chunk.freeze();
            } // otherwise, the buffer is not full, so we don't do a thing
+
+            if read_bytes == 0 && !buf.is_empty() {
+                let chunk = buf.split();
+                yield chunk.freeze();
+                break;
+            } else if read_bytes == 0 {
+                break;
+            }


I think I understand what was going on.
If chunk_size =2, but you read_bytes == 0, and there is still buf.len() == 4 from the previous invocation, you should not yield it in its entirety, but continue chunking it.
It makes sense, HOWEVER.

Then there is a bug/or it should be panic in here:

if read_bytes == 0 && !buf.is_empty() { let chunk = buf.split(); yield chunk.freeze(); break;

because if we're chunking, and loop above gave all of the nice sized chunks, why we're yielding something which is not chunk_size? Is this intended?

Imagine, buf.len() == 5, chunk_size == 2, it goes:

yield 2; yield 2; yield 1;

so the last yielded chunk won't be a nice chunk.

.....

Then I took a look at: let mut buf = BytesMut::with_capacity(chunk_size);, on the first look it looks like it's always going to read source.read_buf(&mut buf).await?, chunk_size into buf. But is it?

Is BytesMut::with_capacity(chunk_size) guaranteeing that source.read_buf(&mut buf).await? will read always at most chunk_size bytes? Quick search told me that it depends on the behaviour of underlying read_buf, so not sure.

th7nder · 2025-02-11T15:27:42Z

mater/lib/src/stores/filestore.rs

+        car_v1_start.try_into().unwrap(),
+        (index_offset - car_v1_start).try_into().unwrap(),
+        index_offset.try_into().unwrap(),


th7nder · 2025-02-11T15:28:22Z

mater/lib/src/stores/filestore.rs

+
+    writer.finish().await?;
+
+    Ok(root.unwrap())


How are we sure that root is not None?

th7nder · 2025-02-11T15:29:26Z

mater/lib/src/stores/filestore.rs

+    let chunker = async_stream::try_stream! {
+        let mut buf = BytesMut::with_capacity(chunk_size);
+        loop {
+            let read_bytes = source.read_buf(&mut buf).await?;
+            while buf.len() >= chunk_size {
+                let chunk = buf.split_to(chunk_size);
+                yield chunk.freeze();
+            }
+
+            if read_bytes == 0 && !buf.is_empty() {
+                let chunk = buf.split();
+                yield chunk.freeze();
+                break;
+            } else if read_bytes == 0 {
+                break;
+            }
+        }
+    };
+


If we're reusing the chunkers across two methods, it deserves to be a separate method.
This logic is complex enough not to be duplicated by copy pastying.

th7nder · 2025-02-11T15:33:04Z

mater/lib/src/stores/filestore.rs

+    Ok(root.unwrap())
+}
+
+async fn balanced_import_unixfs<Src, Out>(


Those methods almost look exactly the same, I struggle to find the difference between those two.

Add docs for this one

Can we change it's more clear that most of the logic is actually the same? Like extract some methods etc?

pete-eiger marked this pull request as draft February 4, 2025 10:29

pete-eiger force-pushed the feat/675/mater-convert-wrap-contents-in-unixfs-dag branch from d6ab6b4 to fe052bb Compare February 4, 2025 10:50

pete-eiger self-assigned this Feb 4, 2025

pete-eiger requested a review from a team February 4, 2025 10:51

pete-eiger added the ready for review Review is needed label Feb 4, 2025

pete-eiger linked an issue Feb 4, 2025 that may be closed by this pull request

feat: Mater convert, wrap contents in unixfs dag #675

Open

pete-eiger marked this pull request as ready for review February 4, 2025 10:51

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 4, 2025

pete-eiger force-pushed the feat/675/mater-convert-wrap-contents-in-unixfs-dag branch from cdbc77a to 3edb9d5 Compare February 4, 2025 11:03

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 4, 2025

cernicc reviewed Feb 5, 2025

View reviewed changes

jmg-duarte requested changes Feb 5, 2025

View reviewed changes

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 5, 2025

pete-eiger force-pushed the feat/675/mater-convert-wrap-contents-in-unixfs-dag branch from 9a9746a to 208c554 Compare February 5, 2025 15:24

feat: unixfs dag

636873a

pete-eiger force-pushed the feat/675/mater-convert-wrap-contents-in-unixfs-dag branch from 208c554 to 636873a Compare February 5, 2025 15:25

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 5, 2025

aidan46 reviewed Feb 6, 2025

View reviewed changes

mater/lib/src/stores/blockstore.rs Outdated Show resolved Hide resolved

jmg-duarte reviewed Feb 6, 2025

View reviewed changes

mater/lib/src/stores/filestore.rs Outdated Show resolved Hide resolved

pete-eiger added 2 commits February 6, 2025 13:55

merge develop

50300d4

address feedback

5ba804f

pete-eiger requested review from aidan46, cernicc and jmg-duarte February 6, 2025 12:42

pete-eiger removed the ready for review Review is needed label Feb 6, 2025

pete-eiger added the ready for review Review is needed label Feb 6, 2025

Merge branch 'develop' into feat/675/mater-convert-wrap-contents-in-u…

bcfb522

…nixfs-dag

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 6, 2025

format

d73d259

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

pete-eiger added 2 commits February 7, 2025 12:24

format again

68fa0eb

Merge branch 'develop' into feat/675/mater-convert-wrap-contents-in-u…

29c7354

…nixfs-dag

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

clippy issues

7ef51e5

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

ci

9a4485e

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

format

eb8fdfa

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

format

6a043bd

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

fmt

cc906f8

pete-eiger added ready for review Review is needed and removed ready for review Review is needed labels Feb 7, 2025

cernicc reviewed Feb 11, 2025

View reviewed changes

jmg-duarte requested changes Feb 11, 2025

View reviewed changes

th7nder requested changes Feb 11, 2025

View reviewed changes

pete-eiger closed this Feb 11, 2025

		// We need to rename the file since the original storage name is based on the whole deal proposal CID,
		// however, the piece is stored based on its piece_cid

		let proposed_deal =
		// Move the fetch to the blocking pool since the RocksDB API is sync

		// Use `file_path` here instead of the undefined `piece_path`
		let (piece_commitment, _) = commp(&file_path)?;

support unixfs dag in mater #728

support unixfs dag in mater #728

Conversation

pete-eiger commented Feb 4, 2025 • edited Loading

Description

Checklist

Open questions (aside from the ones raised in the checklist)

jmg-duarte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmg-duarte Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cernicc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmg-duarte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pete-eiger commented Feb 4, 2025 •

edited

Loading

jmg-duarte Feb 10, 2025 •

edited

Loading