Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding prow job config for gcsfuse pytorch dino model for test #1950

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions prow/prowjobs/GoogleCloudPlatform/gcsfuse/OWNER
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
approvers:
- sethiay
reviewers:
- sethiay
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
periodics:
- name: gcsfuse-pytorch-dino-periodic
cluster: gcsfuse-prow-test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{"component":"checkconfig","file":"k8s.io/test-infra/prow/cmd/checkconfig/main.go:91","func":"main.reportWarning","level":"warning","msg":"invalid periodic job: job configuration for "gcsfuse-pytorch-dino-periodic" specifies unknown 'cluster' value "gcsfuse-prow-test"","severity":"warning","time":"2023-05-18T19:00:49Z"}

Is this a new cluster? Prow doesn't seem to recognize it. Can you point to where it was configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a new cluster. I just ran https://github.com/GoogleCloudPlatform/oss-test-infra/blob/master/prow/oss/create-build-cluster.sh. Is there any other configurations that I need to do ? If yes, please point to it.

Prow jobs in this repository must contain TestGrid annotations or be explicitly opted-out. Try removing the commit that deletes the TestGrid annotations.

I removed the TestGrid annotations because that was giving me error - "No dashboard found". Do I have to first create the TestGrid dashboard ? If yes, is it true that I just need to add an entry here; https://github.com/GoogleCloudPlatform/oss-test-infra/blob/master/testgrid/config.yaml ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as TestGrid dashboards go, that's correct; add your new dashboard there and then you can use it in an annotation.

As for what steps you need to take after running that script for Prow to recognize a new build cluster; I'll look into that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct name for the cluster you created. Typically the script creates a cluster with the name build-{TEAM} so I think yours is build-gcsfuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name in the script to gcsfuse-prow-test and can confirm cluster with this name is present in my project as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe you did not edit the script correctly.
It looks like the cluster is configured as build-gcsfuse https://github.com/GoogleCloudPlatform/oss-test-infra/blob/master/prow/oss/gencred-config/gencred-config.yaml

and build-gcsfuse is in a lot of the config https://grep.app/search?q=build-gcsfuse&filter[repo][0]=GoogleCloudPlatform/oss-test-infra

I am not seeing any config for gcsfuse-prow-test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I will creating another cluster again.

interval: 180h
decorate: true
spec:
containers:
- name: pytorch-dino-model
image: us-west1-docker.pkg.dev/gcs-fuse-test-ml/test-images/pytorch-dino:latest
securityContext:
privileged: true
command:
- "/bin/sh"
- "-c"
- ./setup_container.sh;
resources:
limits:
cpu: "22"
memory: 120Gi
nvidia.com/gpu: "2"
requests:
cpu: "22"
memory: 120Gi
nvidia.com/gpu: "2"
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 128Gi