Skip to content

feat: list all registered schedulers (#1009) #1050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Apr 23, 2025

A simple merge for the list of all registered schedulers.

Test plan:
[x] all existing tests should pass

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025
@kiukchung
Copy link
Contributor

could you provide more context to what you want to achieve with this?

@clumsy
Copy link
Contributor Author

clumsy commented Apr 23, 2025

All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there.

@kiukchung
Copy link
Contributor

Hi @clumsy thanks for the pointer. torchx.schedulers.get_scheduler_factories() is a public API and this change is not backwards-compatible for the case get_scheduler_factories(skip_defaults=False) where there exists registered entrypoint schedulers. Now users will get their configured + default schedulers instead of just their configured ones.

Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users.

@clumsy
Copy link
Contributor Author

clumsy commented May 1, 2025

Sure, @kiukchung

Take NeMo for example, NVidia ships it with all dependencies, including nemo-run (https://github.com/NVIDIA/NeMo/blob/94589bde88fab1997c842be4e000faf69180cffb/nemo/collections/common/parts/nemo_run_utils.py#L18)

Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48

Thus we cannot use local_cwd from within the container for example, or if we have nemo-run installed.

It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one?

@kiukchung
Copy link
Contributor

kiukchung commented May 1, 2025

@clumsy ah that's an interesting edge-case. What you basically want is for torchx.schedulers to be additive. We don't treat it as such today. Since Python entrypoint groups don't compound, we have to come up with a convention for the group names.

One thing to note about DEFAULT_SCHEDULER_MODULES is that TorchX treats them as the default if you haven't registered your own (akin to map.get("key", default="DEFAULT_VAL")) rather than "generally useful ones" that get added regardless of whether you have your own registrations. This is generally the case for all the TorchX configurations exposed as entrypoints (see: https://pytorch.org/torchx/latest/advanced.html)

We could do something like: {org_name}.torchx.schedulers and at load time select *.torchx.schedulers entrypoint groups. For BC we'd also have to keep reading torchx.schedulers.

If you're open to it, you can add support for prefixes in torch.util.entrypoints.load() (

group: str, default: Optional[Dict[str, Any]] = None, skip_defaults: bool = False
)

and make a change in torchx.schedulers.get_schedulers() to call the load() fn appropriately.

There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?)

@clumsy
Copy link
Contributor Author

clumsy commented May 13, 2025

Thanks, @kiukchung! The proposed solution works for me and I can provide the implementation shortly.

clumsy pushed a commit to clumsy/torchx that referenced this pull request May 13, 2025
@clumsy
Copy link
Contributor Author

clumsy commented May 13, 2025

Please check this implementation, @kiukchung , @andywag, @d4l3k, @tonykao8080

The existing commands continue to work, e.g.:

  • torchx runopts
my_package.custom_local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[debug=DEBUG],[image_repo=IMAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones from copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        debug=DEBUG (bool, False)
            run a container with noop entrypoint, useful for debugging environment
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.
  • torchx run -s my_package.custom_local_cwd
usage: torchx run [-h]
                  [-s {local_docker,local_cwd,slurm,kubernetes,kubernetes_mcad,aws_batch,aws_sagemaker,gcp_batch,ray,lsf,my_package.custom_aws_batch,my_package.aws_sagemaker,my_package.custom_gcp_batch,my_package.kubernetes,my_package.kubernetes_mcad,my_package.custom_local_cwd,my_package.local_docker,my_package.lsf,my_package.custom_aws_batch,my_package.custom_local_docker,}]
  • torchx run -s my_package.custom_local_cwd utils.echo --msg "test"
torchx 2025-05-13 12:52:15 INFO     Tracker configurations: {}
torchx 2025-05-13 12:52:15 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2025-05-13 12:52:15 INFO     Log directory is: /var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg
my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     Launched app: my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles:
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg/torchx/echo-gww0rzcc0z72tc

@facebook-github-bot
Copy link
Contributor

@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants