Skip to content

Conversation

@IsaevIlya
Copy link
Contributor

@IsaevIlya IsaevIlya commented Jul 24, 2025

Description

Improved Performance for Partial Checkpoint Loading

Background

By default, PyTorch sorts checkpoint data based on size, which distributes tensors/weights randomly across checkpoint shards. While this approach works well with local storage, it can impact performance when working with cloud storage. Currently, PyTorch's checkpoint loading process doesn't follow the same order used during saving, resulting in non-sequential file access patterns.

Changes

1. Sequential Read Optimization

  • Modified the ordering of items in LocalPlan based on their actual offset in checkpoint shards
  • Ensures sequential reading of data, improving I/O efficiency

2. Custom Sorting for Partial Loading

  • Added ability to provide a custom sorting key function during checkpoint saving
  • Allows users to group specific data at the beginning of checkpoints
  • Optimizes partial loading scenarios by reducing the amount of data that needs to be read

Usage Example

If you need to load only model layers that start with "model.model.layers.1", you can group these tensors at the beginning of checkpoint shards:

def sort_key_func(item):
    return not item.index.fqn.startswith("model.model.layers.1"), item.index.fqn

def save_checkpoint(state_dict, region, uri):
    writer = S3StorageWriter(region, uri, sort_key=sort_key_func)
    dcp.save(state_dict, storage_writer=storage_writer)

Results

The resulting checkpoint shard layout will prioritize tensors starting with "model.model.layers.1":

File name: __0_0.distcp
        key: model.model.layers.10.self_attn.o_proj.weight     ,        offset: 0
        key: model.model.layers.11.mlp.down_proj.weight        ,        offset: 67110441
        key: model.model.layers.11.self_attn.k_proj.weight     ,        offset: 247467090
        key: model.model.layers.13.mlp.gate_proj.weight        ,        offset: 314577531
        key: model.model.layers.15.input_layernorm.weight      ,        offset: 494934180
        key: model.model.layers.15.post_attention_layernorm.weight,     offset: 494952141
        ...
        key: model.lm_head.weight                              ,        offset: 1124125164
        key: model.model.embed_tokens.weight                   ,        offset: 1950404629
        key: model.model.layers.21.mlp.gate_proj.weight        ,        offset: 2776684094
        ....
  • I have updated the CHANGELOG or README if appropriate

By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

@IsaevIlya IsaevIlya requested a review from a team as a code owner July 24, 2025 16:05
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 16:05 — with GitHub Actions Inactive
@IsaevIlya IsaevIlya marked this pull request as draft July 24, 2025 16:05
@IsaevIlya IsaevIlya temporarily deployed to integration-tests July 24, 2025 20:51 — with GitHub Actions Inactive
Copy link
Contributor

@jet-tong jet-tong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the PR with "model.model.layers.1" sort key function, and can confirm that these changes speed up partial checkpoint loading.

@IsaevIlya IsaevIlya marked this pull request as ready for review August 6, 2025 16:25
@IsaevIlya IsaevIlya force-pushed the experiment/dcp-save-ordering branch from 8c10ef5 to 9e85202 Compare August 7, 2025 07:07
@jet-tong jet-tong temporarily deployed to integration-tests September 12, 2025 16:17 — with GitHub Actions Inactive
…kpoints. Enable custom sorting for tensor/weights when creating checkpoints.
@jet-tong jet-tong force-pushed the experiment/dcp-save-ordering branch from 5c5234f to 3440801 Compare September 17, 2025 15:50
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 17, 2025 15:50 — with GitHub Actions Inactive
jet-tong pushed a commit to jet-tong/s3-connector-for-pytorch that referenced this pull request Sep 29, 2025
…kpoints

Cherry-picked prepare_local_plan method from upstream PR awslabs#352.
Sequentially loads items based on their actual offset in checkpoint shards,
ensuring sequential access patterns and improving I/O efficiency.
jet-tong pushed a commit to jet-tong/s3-connector-for-pytorch that referenced this pull request Oct 6, 2025
…kpoints

Cherry-picked prepare_local_plan method from upstream PR awslabs#352.
Sequentially loads items based on their actual offset in checkpoint shards,
ensuring sequential access patterns and improving I/O efficiency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants