Open
Description
As part of my automated Codeflare testing, I'm hitting this exception:
Traceback (most recent call last):
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 180, in <module>
sys.exit(main())
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 175, in main
fire.Fire(Entrypoint())
File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 49, in wrapper
fct(*args, **kwargs)
File "/opt/ci-artifacts/src/testing/codeflare/test.py", line 148, in sdk_user_run_one
test_sdk_user.run_one()
File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 165, in run_one
timeout(entrypoint.main,
File "/opt/ci-artifacts/src/testing/codeflare/test_sdk_user.py", line 148, in timeout
return func(*args, **kwargs)
File "/mnt/logs/002__run_one/sample.py", line 28, in main
cluster.wait_ready()
File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 221, in wait_ready
status, ready = self.status(print_to_console=False)
File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 160, in status
appwrapper = _app_wrapper_status(self.config.name, self.config.namespace)
File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 345, in _app_wrapper_status
return _map_to_app_wrapper(cluster)
File "/opt/venv/lib/python3.9/site-packages/codeflare_sdk/cluster/cluster.py", line 469, in _map_to_app_wrapper
status=AppWrapperStatus(cluster_model.status.state.lower()),
TypeError: 'MissingModel' object is not callable
This python file is being executed:
# Create our cluster and submit appwrapper
cluster = Cluster(ClusterConfiguration(
namespace=namespace, name=f"mnisttest-user{user_idx}",
min_worker=2, max_worker=2,
min_cpus=2, max_cpus=2,
min_memory=4, max_memory=4,
gpu=0,
instascale=False))
# Bring up the cluster
cluster.up()
cluster.wait_ready() # <-- this line raises the exception
cluster.status()
cluster.details()
job_def = DDPJobDefinition(name="mnisttest", script="mnist.py", workspace=".", scheduler_args={"requirements": "./requirements.txt"})
job = job_def.submit(cluster)
The RayCluster
Pods
are pending because of project-codeflare/multi-cluster-app-dispatcher#512, but codeflare-sdk
shouldn't fail because of it:
codeflare-sdk-user-test-user-1 mnisttest-user1-head-v7fn8 0/1 Pending 0 6m43s <none> <none> <none> <none>
codeflare-sdk-user-test-user-1 nisttest-user1-worker-small-group-mnisttest-user1-dwhb4 0/1 Pending 0 6m43s <none> <none> <none> <none>
codeflare-sdk-user-test-user-1 nisttest-user1-worker-small-group-mnisttest-user1-xccpd 0/1 Pending 0 6m43s <none> <none> <none> <none>
Here is the state of the AppWrapper (captured manually after the test):
appwrapper.yaml.log
- Codeflare SDK is installed from pip (latest version)
- I'll remove the
--quiet
flag to capture the exact version being installed
- I'll remove the
- Codeflare stack is installed from ODH + OpenShift Codeflare operator
Metadata
Metadata
Assignees
Labels
No labels