Skip to content

job_def.submit(cluster) fails with 503 Server Error #249

Open
@kpouget

Description

@kpouget

As part of my automated Codeflare testing, I'm hitting this exception:

ERROR:root:Caught exception HTTPError: 503 Server Error: Service Unavailable for url: http://ray-dashboard-mnisttest-user0-codeflare-sdk-user-test-user-0.apps.kpouget-sutest-20230726-07h01.psap.aws.rhperfscale.org/api/version
Traceback (most recent call last):
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 180, in <module>
    sys.exit(main())
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 175, in main
    fire.Fire(Entrypoint())
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 49, in wrapper
    fct(*args, **kwargs)
  File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 148, in sdk_user_run_one
    test_sdk_user.run_one()
  File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 165, in run_one
    timeout(entrypoint.main,
  File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 148, in timeout
    return func(*args, **kwargs)
  File "/mnt/logs/002__run_one/sample.py", line 34, in main
    job = job_def.submit(cluster)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/job/jobs.py", line 166, in submit
    return DDPJob(self, cluster)
  File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/job/jobs.py", line 174, in __init__
    self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
  File "/opt/venv/lib/python3.9/site-packages/torchx/runner/api.py", line 278, in schedule
    app_id = sched.schedule(dryrun_info)
  File "/opt/venv/lib/python3.9/site-packages/torchx/schedulers/ray_scheduler.py", line 199, in schedule
    client: JobSubmissionClient = JobSubmissionClient(
  File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 100, in __init__
    self._check_connection_and_version(
  File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 228, in _check_connection_and_version
    self._check_connection_and_version_with_url(min_version, version_error_message)
  File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 245, in _check_connection_and_version_with_url
    r.raise_for_status()
  File "/opt/venv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://ray-dashboard-mnisttest-user0-codeflare-sdk-user-test-user-0.apps.kpouget-sutest-20230726-07h01.psap.aws.rhperfscale.org/api/version

This python file is being executed:

    # Create our cluster and submit appwrapper
    cluster = Cluster(ClusterConfiguration(
        namespace=namespace, name=f"mnisttest-user{user_idx}",
        min_worker=2, max_worker=2,
        min_cpus=2, max_cpus=2,
        min_memory=4, max_memory=4,
        gpu=0,
        instascale=False))
    # Bring up the cluster
    cluster.up()
    cluster.wait_ready()
    cluster.status()
    cluster.details()

    job_def = DDPJobDefinition(name="mnisttest", script="mnist.py", workspace=".", scheduler_args={"requirements": "./requirements.txt"})
    job = job_def.submit(cluster)

and the last line raises the exception.


  • Codeflare SDK is installed from pip (latest version)
    • I'll remove the --quiet flag to capture the exact version being installed
  • Codeflare stack is installed from ODH + OpenShift Codeflare operator

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions