Open
Description
As part of my automated Codeflare testing, I'm hitting this exception:
ERROR:root:Caught exception HTTPError: 503 Server Error: Service Unavailable for url: http://ray-dashboard-mnisttest-user0-codeflare-sdk-user-test-user-0.apps.kpouget-sutest-20230726-07h01.psap.aws.rhperfscale.org/api/version
Traceback (most recent call last):
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 180, in <module>
sys.exit(main())
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 175, in main
fire.Fire(Entrypoint())
File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 49, in wrapper
fct(*args, **kwargs)
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 148, in sdk_user_run_one
test_sdk_user.run_one()
File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 165, in run_one
timeout(entrypoint.main,
File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 148, in timeout
return func(*args, **kwargs)
File "/mnt/logs/002__run_one/sample.py", line 34, in main
job = job_def.submit(cluster)
File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/job/jobs.py", line 166, in submit
return DDPJob(self, cluster)
File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/job/jobs.py", line 174, in __init__
self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
File "/opt/venv/lib/python3.9/site-packages/torchx/runner/api.py", line 278, in schedule
app_id = sched.schedule(dryrun_info)
File "/opt/venv/lib/python3.9/site-packages/torchx/schedulers/ray_scheduler.py", line 199, in schedule
client: JobSubmissionClient = JobSubmissionClient(
File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 100, in __init__
self._check_connection_and_version(
File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 228, in _check_connection_and_version
self._check_connection_and_version_with_url(min_version, version_error_message)
File "/opt/venv/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 245, in _check_connection_and_version_with_url
r.raise_for_status()
File "/opt/venv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://ray-dashboard-mnisttest-user0-codeflare-sdk-user-test-user-0.apps.kpouget-sutest-20230726-07h01.psap.aws.rhperfscale.org/api/version
This python file is being executed:
# Create our cluster and submit appwrapper
cluster = Cluster(ClusterConfiguration(
namespace=namespace, name=f"mnisttest-user{user_idx}",
min_worker=2, max_worker=2,
min_cpus=2, max_cpus=2,
min_memory=4, max_memory=4,
gpu=0,
instascale=False))
# Bring up the cluster
cluster.up()
cluster.wait_ready()
cluster.status()
cluster.details()
job_def = DDPJobDefinition(name="mnisttest", script="mnist.py", workspace=".", scheduler_args={"requirements": "./requirements.txt"})
job = job_def.submit(cluster)
and the last line raises the exception.
- Codeflare SDK is installed from pip (latest version)
- I'll remove the
--quiet
flag to capture the exact version being installed
- I'll remove the
- Codeflare stack is installed from ODH + OpenShift Codeflare operator
Metadata
Metadata
Assignees
Labels
No labels