feat: list all registered schedulers (#1009) #1050

clumsy · 2025-04-23T15:26:30Z

A simple merge for the list of all registered schedulers.

Test plan:
[x] all existing tests should pass

kiukchung · 2025-04-23T20:00:12Z

could you provide more context to what you want to achieve with this?

clumsy · 2025-04-23T20:33:30Z

All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there.

kiukchung · 2025-04-30T17:58:40Z

Hi @clumsy thanks for the pointer. torchx.schedulers.get_scheduler_factories() is a public API and this change is not backwards-compatible for the case get_scheduler_factories(skip_defaults=False) where there exists registered entrypoint schedulers. Now users will get their configured + default schedulers instead of just their configured ones.

Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users.

clumsy · 2025-05-01T12:26:35Z

Sure, @kiukchung

Take NeMo for example, NVidia ships it with all dependencies, including nemo-run (https://github.com/NVIDIA/NeMo/blob/94589bde88fab1997c842be4e000faf69180cffb/nemo/collections/common/parts/nemo_run_utils.py#L18)

Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48

Thus we cannot use local_cwd from within the container for example, or if we have nemo-run installed.

It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one?

kiukchung · 2025-05-01T19:29:47Z

@clumsy ah that's an interesting edge-case. What you basically want is for torchx.schedulers to be additive. We don't treat it as such today. Since Python entrypoint groups don't compound, we have to come up with a convention for the group names.

One thing to note about DEFAULT_SCHEDULER_MODULES is that TorchX treats them as the default if you haven't registered your own (akin to map.get("key", default="DEFAULT_VAL")) rather than "generally useful ones" that get added regardless of whether you have your own registrations. This is generally the case for all the TorchX configurations exposed as entrypoints (see: https://pytorch.org/torchx/latest/advanced.html)

We could do something like: {org_name}.torchx.schedulers and at load time select *.torchx.schedulers entrypoint groups. For BC we'd also have to keep reading torchx.schedulers.

If you're open to it, you can add support for prefixes in torch.util.entrypoints.load() (

torchx/torchx/util/entrypoints.py

Line 54 in 9120355

    
           group: str, default: Optional[Dict[str, Any]] = None, skip_defaults: bool = False

)

and make a change in torchx.schedulers.get_schedulers() to call the load() fn appropriately.

There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?)

clumsy · 2025-05-13T16:48:43Z

Thanks, @kiukchung! The proposed solution works for me and I can provide the implementation shortly.

clumsy · 2025-05-13T17:14:29Z

Please check this implementation, @kiukchung , @andywag, @d4l3k, @tonykao8080

The existing commands continue to work, e.g.:

torchx runopts

my_package.custom_local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[debug=DEBUG],[image_repo=IMAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones from copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        debug=DEBUG (bool, False)
            run a container with noop entrypoint, useful for debugging environment
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

torchx run -s my_package.custom_local_cwd

usage: torchx run [-h]
                  [-s {local_docker,local_cwd,slurm,kubernetes,kubernetes_mcad,aws_batch,aws_sagemaker,gcp_batch,ray,lsf,my_package.custom_aws_batch,my_package.aws_sagemaker,my_package.custom_gcp_batch,my_package.kubernetes,my_package.kubernetes_mcad,my_package.custom_local_cwd,my_package.local_docker,my_package.lsf,my_package.custom_aws_batch,my_package.custom_local_docker,}]

torchx run -s my_package.custom_local_cwd utils.echo --msg "test"

torchx 2025-05-13 12:52:15 INFO     Tracker configurations: {}
torchx 2025-05-13 12:52:15 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2025-05-13 12:52:15 INFO     Log directory is: /var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg
my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     Launched app: my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles:
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg/torchx/echo-gww0rzcc0z72tc

facebook-github-bot · 2025-05-14T17:41:28Z

@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

feat: list all registered schedulers (pytorch#1009)

57ec248

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025

clumsy pushed a commit to clumsy/torchx that referenced this pull request May 13, 2025

feat: allow registering entry points with prefixes (pytorch#1050)

1d05766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: list all registered schedulers (#1009) #1050

feat: list all registered schedulers (#1009) #1050

clumsy commented Apr 23, 2025

kiukchung commented Apr 23, 2025

clumsy commented Apr 23, 2025 •

edited

Loading

kiukchung commented Apr 30, 2025

clumsy commented May 1, 2025

kiukchung commented May 1, 2025 •

edited

Loading

clumsy commented May 13, 2025

clumsy commented May 13, 2025

facebook-github-bot commented May 14, 2025

feat: list all registered schedulers (#1009) #1050

Are you sure you want to change the base?

feat: list all registered schedulers (#1009) #1050

Conversation

clumsy commented Apr 23, 2025

kiukchung commented Apr 23, 2025

clumsy commented Apr 23, 2025 • edited Loading

kiukchung commented Apr 30, 2025

clumsy commented May 1, 2025

kiukchung commented May 1, 2025 • edited Loading

clumsy commented May 13, 2025

clumsy commented May 13, 2025

facebook-github-bot commented May 14, 2025

clumsy commented Apr 23, 2025 •

edited

Loading

kiukchung commented May 1, 2025 •

edited

Loading