Skip to content

PEP 694: Abstract file upload mechanisms #4431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ewdurbin
Copy link
Member

@ewdurbin ewdurbin commented May 21, 2025

This attempts to defer the implementation details of getting the bits of a given artifact from the client to the server.

The primary motivation here is to decouple this PEP from resumable/multi-part uploads, provide flexibility to implementers of PEP 694, and allow for new upload mechanisms without a PEP cycle.


📚 Documentation preview 📚: https://pep-previews--4431.org.readthedocs.build/

@ewdurbin
Copy link
Member Author

I spent some time to sketch this out as a Finite State Machine in pypi/warehouse#18174, and will be using what I learned to refine this a bit!

@ewdurbin ewdurbin marked this pull request as ready for review May 28, 2025 14:13
@ewdurbin ewdurbin requested review from dstufft and warsaw as code owners May 28, 2025 14:13
@ewdurbin
Copy link
Member Author

@dstufft @warsaw I think this is ready for a proper review.

@warsaw
Copy link
Member

warsaw commented May 28, 2025

I'm planning on taking a closer look later today.

Copy link
Member

@dstufft dstufft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments, but overall the changes look good to me.

I just wanted to note a few things I came across when reading the entire PEP (with these changes incorporated-- and probably a number of these came from my original PEP so that's on me :D ).

  • The PEP specifies that the endpoint for PyPI will be https://upload.pypi.org/2.0. I would probably remove that from the PEP and let PyPI decide what it's endpoint will be (in particular the /2.0 part is somewhat confusing given the PEP also uses conneg).
  • The PEP calls out the inability to parallelize or resume an upload as a problem to be solved, and then later states the PEP solves all the identified problems. The TUS based approach didn't really solve parallelization to begin with, and the removal of TUS means the PEP also doesn't solve resuming an uploading. I think that's fine, but we should update the wording.
  • The content type handling is kind of wonky I think. Much like PEP 691 the client can use the Accept header to request a particular content type from the server, and the server includes the full version number in the meta.api-version key in the response. However, the requests appear to only be using the meta.api-version key. I think ideally we want to have requests using a correct Content-Type for their request (and the latest wouldn't be supported here), and have the server use that for handling the request data. We probably want to also explicitly require that the meta.api-version matches the Content-Type for major version.

@ewdurbin
Copy link
Member Author

ewdurbin commented May 30, 2025

  • The PEP calls out the inability to parallelize or resume an upload as a problem to be solved, and then later states the PEP solves all the identified problems. The TUS based approach didn't really solve parallelization to begin with, and the removal of TUS means the PEP also doesn't solve resuming an uploading. I think that's fine, but we should update the wording.

While this PEP would no longer directly implement resumable or parallel uploads, it does solve the problem of how to address them. Individual file-upload sessions may occur in parallel if a server chooses to implement a mechanism that can support it, and similarly resumable uploads can be implemented as a mechanism. I'll clarify it in the "The new upload API...", 05d7fc2

@ewdurbin
Copy link
Member Author

ewdurbin commented May 30, 2025

Realization while specifying http-post-application-octet-stream: we need to support PEP 740 style attestations... tagging @woodruffw for thoughts :) See: 2ef077c

@ewdurbin
Copy link
Member Author

  • The content type handling is kind of wonky I think. Much like PEP 691 the client can use the Accept header to request a particular content type from the server, and the server includes the full version number in the meta.api-version key in the response. However, the requests appear to only be using the meta.api-version key. I think ideally we want to have requests using a correct Content-Type for their request (and the latest wouldn't be supported here), and have the server use that for handling the request data. We probably want to also explicitly require that the meta.api-version matches the Content-Type for major version.

See: 924a27d

@ewdurbin
Copy link
Member Author

  • The PEP specifies that the endpoint for PyPI will be https://upload.pypi.org/2.0. I would probably remove that from the PEP and let PyPI decide what its endpoint will be (in particular the /2.0 part is somewhat confusing given the PEP also uses conneg).

See: f8469cf

Co-authored-by: Donald Stufft <[email protected]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bouncing between reading the PR preview and the diff. I'll capture some thoughts here based on text not touched in the PR (which I don't think GH gives me the UI to add inline comments to).

  • Instead of "a standard API" I think we're now talking about "an extensible API" with some standard behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've pretty much run out of gas. I think some of my comments may not make a ton of sense since I reviewed it sequentially rather than reading the whole thing and then composing my feedback. Apologies for that.

Overall, I think this is a really good simplification, and I really like the direction it's going in. I know I have a lot of musings, feedback, thoughts, comments, and suggestions sprinkled throughout, and I hope they're moderately helpful.

If it would be helpful, I can try to edit the PR locally and push changes, or I could branch your PR and push a new branch/PR, or we can just try to make it all work here. Happy to also chat about it separately!

@@ -24,7 +24,7 @@ with standardization, the upload API provides additional useful features such as

* artifacts which can be overwritten and replaced, until a session is published;

* asynchronous and "chunked", resumable file uploads, for more efficient use of network bandwidth;
* flexible file upload mechanisms for index operators;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* flexible file upload mechanisms for index operators;
* flexible file upload mechanisms for index operators;
  • a protocol to extend the supported upload mechanisms in the future without requiring a full PEP; these can be standardized and recommended for all indexes, or be index-specific.

The new upload API proposed in this PEP solves all of these problems, providing for a much more
flexible, bandwidth friendly approach, with better error reporting, a better release testing
experience, and atomic and simultaneous publishing of all release artifacts.
The new upload API proposed in this PEP provides a solution to all of these problems,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The new upload API proposed in this PEP provides a solution to all of these problems,
The new upload API proposed in this PEP provides a solution to all of these problems,

The new upload API proposed in this PEP provides an immediate solution to many of these problems, and defines a flexible mechanism for future support of the other problems by extension.

flexible, bandwidth friendly approach, with better error reporting, a better release testing
experience, and atomic and simultaneous publishing of all release artifacts.
The new upload API proposed in this PEP provides a solution to all of these problems,
providing for a much more flexible approach, with support for servers to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
providing for a much more flexible approach, with support for servers to
providing for a much more flexible approach, with support for servers to

In the future, indexes can

experience, and atomic and simultaneous publishing of all release artifacts.
The new upload API proposed in this PEP provides a solution to all of these problems,
providing for a much more flexible approach, with support for servers to
implement resumable and parallel uploads via mechanisms,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
implement resumable and parallel uploads via mechanisms,
implement resumable and parallel uploads via mechanisms,

implement resumable and parallel uploads via extensions,


File Upload Session Completion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where the response may be specific to the mechanism being used. I think I understand what you're getting at though. You want the client to signal to the server that it's done exchanging the file to the server. But that means that completing a file upload is a two step process:

  1. File is completely uploaded using whatever protocol is defined by the mechanism
  2. Client also has to signal to the server that the upload is completed.

This is rather than the mechanism itself communicating to the server that 2) has been completed. I'm guessing that a specific case you might be thinking about is S3 exchange.

In that case, the server says, hey you can use the S3 pre-signed URL protocol to upload your file. I don't know anything about those details, and in fact I'm out of the loop. You handle it, and then when you're done, you tell me you're done and I can do any post-upload processing I need to do. This is rather than setting up a way for S3 to tell PyPI that the upload is finished (e.g. through a webhook or some such).

Have I got that right, or are you thinking about something else?


.. code-block:: email
Content-Type: application/vnd.pypi.upload.v2+json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Once the client has retrieved the offset that they need to start from, they can upload the rest of
the file as described above, either in a single request containing all of the remaining bytes, or in
multiple chunks as per the above protocol.
After receiving this requests the server **MAY** perform additional asynchronous processing on the file,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm on the right track with my thinking about the intent above, then I think if we keep this "File Upload Session Completion" request, the server MUST respond with either a 200 or 202 in the success case (and of course appropriate error codes if a failure occurs). The server would respond with a 200 if the post-completion processing is done synchronously to the request. It would respond with a 202 if that processing must be done asynchronously (e.g. it would take a long time to verify a checksum or such). In the later case, there would have to be an endpoint that the client could poll to get the status of the post-processing.

One option there would be to just post the same request to the same endpoint but use "action": "status" because I think if we make the file upload endpoint mechanism specific, we can't put it here.

File Upload Mechanisms
----------------------

Servers **MUST** implement :ref:`required file upload mechansisms <required-file-upload-mechanisms>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Servers **MUST** implement :ref:`required file upload mechansisms <required-file-upload-mechanisms>`.
Servers **MUST** implement :ref:`required file upload mechansisms <required-file-upload-mechanisms>`.

Servers MUST implement :ref:required file upload mechanisms <required-file-upload-mechanisms>.


A given server **MAY** implement an arbitrary number of server specific mechanisms
and is responsible for documenting their usage.
Server specific implementations **MUST** be prefixed with ``vnd-``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now that I'm here (and admittedly, running out of gas -- it's EOW) I see that my earlier comment about the mechanism name of pypi-atomic wouldn't be appropriate because it would be a required protocol and not a vendor protocol. So maybe that would be changed to atomic instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants