Skip to content

Update SDK e2e test to reflect poetry changes #315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Srihari1192
Copy link
Contributor

@Srihari1192 Srihari1192 commented Oct 3, 2023

Issue link

closes #250

What changes have been made

Updated the test TestMNISTRayClusterSDK to use poetry for installing CodeFlare SDK by replacing pip
Added the script install-codeflare-sdk.sh for installing codeflare-sdk using poetry and can reuse this in another tests

Verification steps

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@sutaakar
Copy link
Contributor

sutaakar commented Oct 4, 2023

@Srihari1192 The TestMNISTRayClusterSDK is failing, can you check it?

@Srihari1192
Copy link
Contributor Author

@Srihari1192 The TestMNISTRayClusterSDK is failing, can you check it?

Test TestMNISTRayClusterSDK Failing at the step Waiting for requested resources to be set up cluster status as queueing and from operator logs seems the test failing because of Insufficient resources to dispatch AppWrapper..

    1 queuejob_controller_ex.go:1242] [ScheduleNext] [Agent Mode] Failed to dispatch app wrapper 'test-ns-qvqq7/mnist' due to insufficient resources, activeQ=true Unsched=false &qj=0xc0008a2c00 Version=5500 Status={Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State:Pending Message: SystemPriority:0 QueueJobState:HeadOfLine ControllerFirstTimestamp:2023-10-03 14:25:25.672931 +0000 UTC ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:true Sender:before ScheduleNext - setHOL Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-10-03 14:25:25.672931 +0000 UTC LastTransitionMicroTime:2023-10-03 14:25:25.672932 +0000 UTC Reason: Message:} {Type:Queueing Status:True LastUpdateMicroTime:2023-10-03 14:25:25.672945 +0000 UTC LastTransitionMicroTime:2023-10-03 14:25:25.672945 +0000 UTC Reason:AwaitingHeadOfLine Message:} {Type:HeadOfLine Status:True LastUpdateMicroTime:2023-10-03 14:25:25.695766 +0000 UTC LastTransitionMicroTime:2023-10-03 14:25:25.695766 +0000 UTC Reason:FrontOfQueue. Message:} {Type:Backoff Status:True LastUpdateMicroTime:2023-10-03 14:25:25.730656 +0000 UTC LastTransitionMicroTime:2023-10-03 14:25:25.730656 +0000 UTC Reason:AppWrapperNotRunnable. Message:Insufficient resources to dispatch AppWrapper.}] PendingPodConditions:[] TotalCPU:0 TotalMemory:0 TotalGPU:0 RequeueingTimeInSeconds:0 NumberOfRequeueings:0}```

@sutaakar
Copy link
Contributor

sutaakar commented Oct 4, 2023

@Srihari1192 That is because of resource limit for Ray head.
Can you try to apply head_cpus and head_mem with some low values?

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running locally on a kind cluster and it is failing for me also. I have added
head_cpus=1, head_memory=0.2, to the cluster configuration but am still getting an error (I've tried increasing these also). The job is not reaching the desired complete or failed state. The end of the log looks like this...
image

@Srihari1192
Copy link
Contributor Author

Srihari1192 commented Oct 4, 2023

@Srihari1192 That is because of resource limit for Ray head. Can you try to apply head_cpus and head_mem with some low values?

Tried running in fork repo by setting min values for head and cluster = Cluster(ClusterConfiguration( name='mnist', namespace=namespace, num_workers=1, min_cpus='100m', max_cpus=1, min_memory=0.1, max_memory=1, num_gpus=0, head_cpus='100m', head_memory=0.5, instascale=False, ))
tests failing with error

E1004 12:24:34.027917 1 genericresource.go:224] mapping error from raw object: no matches for kind "Route" in version "route.openshift.io/v1"E1004 12:24:34.027989 1 queuejob_controller_ex.go:2064] [manageQueueJob] Error dispatching generic item for app wrapper='test-ns-8clhp/mnist' type=no matches for kind "Route" in version "route.openshift.io/v1" err=%!v(MISSING) E1004 12:24:34.895897 1 genericresource.go:106] mapping error from raw object:no matches for kind "Route" in version "route.openshift.io/v1"E1004 12:24:34.895930 1 queuejob_controller_ex.go:2194] [Cleanup] Error deleting generic item , from app wrapper='test-ns-8clhp/mnist' err=no matches for kind "Route" in version "route.openshift.io/v1". E1004 12:24:34.895956 1 queuejob_controller_ex.go:2097] Failed to delete resources associated with app wrapper: 'test-ns-8clhp/mnist', err 1 error occurred: * no matches for kind "Route" in version "route.openshift.io/v1" W1004 12:24:34.895984 1 queuejob_controller_ex.go:1932] [worker] Fail to process item from eventQueue, err 1 error occurred: * no matches for kind "Route" in version "route.openshift.io/v1" . Attempting to re-enqueque... W1004 12:24:34.895999 1 queuejob_controller_ex.go:1936] [worker] Item re-enqueued. E1004 12:24:36.635570 1 genericresource.go:106] mapping error from raw object:no matches for kind "Route" in version "route.openshift.io/v1"E1004 12:24:36.635605 1 queuejob_controller_ex.go:2194] [Cleanup] Error deleting generic item , from app wrapper='test-ns-8clhp/mnist' err=no matches for kind "Route" in version "route.openshift.io/v1". E1004 12:24:36.635618 1 queuejob_controller_ex.go:1874] [worker] Failed to delete resources for AppWrapper Job 'test-ns-8clhp/mnist', err=1 error occurred: * no matches for kind "Route" in version "route.openshift.io/v1" W1004 12:24:36.635741 1 queuejob_controller_ex.go:1932] [worker] Fail to process item from eventQueue, err 1 error occurred: * no matches for kind "Route" in version "route.openshift.io/v1"

@Srihari1192
Copy link
Contributor Author

I think test were failing because of Ingress is not supporting in Kind cluster.. It should work after the PR merged

@Srihari1192 Srihari1192 force-pushed the sdk-install-poetry-250 branch 3 times, most recently from 4baf500 to ffc2004 Compare November 9, 2023 10:22
@Srihari1192 Srihari1192 force-pushed the sdk-install-poetry-250 branch from ca86142 to 7ee094b Compare November 15, 2023 06:01
@Srihari1192 Srihari1192 force-pushed the sdk-install-poetry-250 branch 7 times, most recently from c39c881 to 401bca9 Compare November 17, 2023 05:14
@Srihari1192 Srihari1192 force-pushed the sdk-install-poetry-250 branch 6 times, most recently from b0746ac to 50a5a9c Compare November 20, 2023 16:08
Add Ingress domain for sdk e2e test

Revert "Add Ingress domain for sdk e2e test"

This reverts commit ffc2004.
@Srihari1192 Srihari1192 force-pushed the sdk-install-poetry-250 branch from 50a5a9c to e811c04 Compare November 21, 2023 06:09
@Srihari1192 Srihari1192 marked this pull request as ready for review November 21, 2023 06:30
@astefanutti
Copy link
Contributor

/lgtm

@astefanutti
Copy link
Contributor

/approve

Copy link

openshift-ci bot commented Nov 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, ChristianZaccaria

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 42cce2d into project-codeflare:main Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update SDK e2e test to reflect poetry changes
6 participants