Skip to content

job_def.submit(cluster) fails with 503 Server Error #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kpouget opened this issue Jul 26, 2023 · 2 comments
Open

job_def.submit(cluster) fails with 503 Server Error #249

kpouget opened this issue Jul 26, 2023 · 2 comments

Comments

@kpouget
Copy link

kpouget commented Jul 26, 2023

As part of my automated Codeflare testing, I'm hitting this exception:

ERROR:root:Caught exception HTTPError: 503 Server Error: Service Unavailable for url: http://ray-dashboard-mnisttest-user0-codeflare-sdk-user-test-user-0.apps.kpouget-sutest-20230726-07h01.psap.aws.rhperfscale.org/api/version
Traceback (most recent call last):
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 180, in <module>
    sys.exit(main())
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 175, in main
    fire.Fire(Entrypoint())
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 49, in wrapper
    fct(*args, **kwargs)
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 148, in sdk_user_run_one
    test_sdk_user.run_one()
  File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 165, in run_one
    timeout(entrypoint.main,
  File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 148, in timeout
    return func(*args, **kwargs)
  File "/mnt/logs/002__run_one/sample.py", line 34, in main
    job = job_def.submit(cluster)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/job/jobs.py", line 166, in submit
    return DDPJob(self, cluster)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/job/jobs.py", line 174, in __init__
    self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
  File "/opt/venv/lib/python3.9/site-packages/torchx/runner/api.py", line 278, in schedule
    app_id = sched.schedule(dryrun_info)
  File "/opt/venv/lib/python3.9/site-packages/torchx/schedulers/ray_scheduler.py", line 199, in schedule
    client: JobSubmissionClient = JobSubmissionClient(
  File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 100, in __init__
    self._check_connection_and_version(
  File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 228, in _check_connection_and_version
    self._check_connection_and_version_with_url(min_version, version_error_message)
  File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 245, in _check_connection_and_version_with_url
    r.raise_for_status()
  File "/opt/venv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://ray-dashboard-mnisttest-user0-codeflare-sdk-user-test-user-0.apps.kpouget-sutest-20230726-07h01.psap.aws.rhperfscale.org/api/version

This python file is being executed:

    # Create our cluster and submit appwrapper
    cluster = Cluster(ClusterConfiguration(
        namespace=namespace, name=f"mnisttest-user{user_idx}",
        min_worker=2, max_worker=2,
        min_cpus=2, max_cpus=2,
        min_memory=4, max_memory=4,
        gpu=0,
        instascale=False))
    # Bring up the cluster
    cluster.up()
    cluster.wait_ready()
    cluster.status()
    cluster.details()

    job_def = DDPJobDefinition(name="mnisttest", script="mnist.py", workspace=".", scheduler_args={"requirements": "./requirements.txt"})
    job = job_def.submit(cluster)

and the last line raises the exception.


  • Codeflare SDK is installed from pip (latest version)
    • I'll remove the --quiet flag to capture the exact version being installed
  • Codeflare stack is installed from ODH + OpenShift Codeflare operator
@kpouget
Copy link
Author

kpouget commented Jul 26, 2023

hum, I noticed that I had this line missing from the example took the code from :

    image="quay.io/project-codeflare/ray:2.5.0-py38-cu116",

and thus, my RayCluster was created with the following (default?) image:

"image": "ghcr.io/foundation-model-stack/base:ray2.1.0-py38-gpu-pytorch1.12.0cu116-20221213-193103",

@Maxusmusti, is the default image supposed to work? 🤔

@Maxusmusti
Copy link
Collaborator

Yeah, that image should still work, though it's worth noting that in the upcoming release the default will be switching to the new 2.5.0 image linked there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants