Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSet usage in another operator #131

Closed
vsoch opened this issue May 9, 2023 · 7 comments
Closed

JobSet usage in another operator #131

vsoch opened this issue May 9, 2023 · 7 comments

Comments

@vsoch
Copy link
Contributor

vsoch commented May 9, 2023

Hiya! I've started experimenting with using JobSet from another operator (the Flux Operator) and finished most of the implementation (REF, but when I got up to testing it (e.g., I need to still check how the new headless service that is created for me works to update the broker config) but I ran into some issues. What I'm doing is:

Creating creating the cluster:

$ kind create cluster

Installing the JobSet, per the instructions here

VERSION=v0.1.3
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml

And trying to create a hello world example MiniCluster with the JobSet:

$ kubectl create namespace flux-operator
$ kubectl apply -f examples/dist/flux-operator-dev.yaml
$ kubectl apply -f examples/tests/jobset/minicluster.yaml

I get this error - the type is not registered:

2023-05-09T00:09:51Z    ERROR   Reconciler error        {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "MiniCluster": {"name":"flux-sample","namespace":"flux-operator"}, "namespace": "flux-operator", "name": "flux-sample", "reconcileID": "1db5223d-1f8a-4e49-a219-8d3ba55de826", "error": "no kind is registered for the type v1alpha1.JobSet in scheme \"pkg/runtime/scheme.go:100\""}

And the issue (I think) is that the Jobset namespace isn't really part of the main cluster, or at least can't be seen by the flux-operator namespace, or something of that flavor? I'm thinking there likely needs to be a way to customize how it's installed so the kind "JobSet" is known. This might also be helpful to have in the install docs, for the cases when a developer wants to use it in their applications.

Thank you!

@vsoch
Copy link
Contributor Author

vsoch commented May 9, 2023

@vsoch
Copy link
Contributor Author

vsoch commented May 9, 2023

okay maybe(?) getting closer - I registered the scheme:

package main

import (
	"flag"
	"os"

	// Import all Kubernetes client auth plugins (e.g. Azure, GCP, OIDC, etc.)
	// to ensure that exec-entrypoint and run can make use of them.
	_ "k8s.io/client-go/plugin/pkg/client/auth"

	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
	"k8s.io/apimachinery/pkg/runtime/serializer"

	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
	clientgoscheme "k8s.io/client-go/kubernetes/scheme"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client/apiutil"
	"sigs.k8s.io/controller-runtime/pkg/healthz"
	"sigs.k8s.io/controller-runtime/pkg/log/zap"

	api "flux-framework/flux-operator/api/v1alpha1"

+	jobset "sigs.k8s.io/jobset/api/v1alpha1"

	"flux-framework/flux-operator/controllers/core"
	//+kubebuilder:scaffold:imports
)

var (
	scheme   = runtime.NewScheme()
	setupLog = ctrl.Log.WithName("setup")
)

func init() {
	utilruntime.Must(clientgoscheme.AddToScheme(scheme))

	utilruntime.Must(api.AddToScheme(scheme))
+	utilruntime.Must(jobset.AddToScheme(scheme))

	//+kubebuilder:scaffold:scheme
}

and now it's a permissions thing, but it seems to be seen!

W0509 00:48:36.257473       1 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1alpha1.JobSet: jobsets.jobset.x-k8s.io is forbidden: User "system:serviceaccount:operator-system:operator-controller-manager" cannot list resource "jobsets" in API group "jobset.x-k8s.io" at the cluster scope
E0509 00:48:36.257494       1 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1alpha1.JobSet: failed to list *v1alpha1.JobSet: jobsets.jobset.x-k8s.io is forbidden: User "system:serviceaccount:operator-system:operator-controller-manager" cannot list resource "jobsets" in API group "jobset.x-k8s.io" at the cluster scope

Going to mess with rbac

@vsoch
Copy link
Contributor Author

vsoch commented May 9, 2023

okay - added this to rbac:

//+kubebuilder:rbac:groups=jobset.x-k8s.io,resources=jobset,verbs=get;list;watch;create;update;patch;delete;exec
//+kubebuilder:rbac:groups=jobset.x-k8s.io,resources=jobset/status,verbs=get;list;watch;create;update;patch;delete;exec

but it can't find it still:

2023-05-09T00:59:10Z    ERROR   Reconciler error        {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "MiniCluster": {"name":"flux-sample","namespace":"flux-operator"}, "namespace": "flux-operator", "name": "flux-sample", "reconcileID": "37af6527-8f32-4286-8106-5d6bbf8d398b", "error": "no matches for kind \"JobSet\" in version \"jobset.x-k8s.io/v1alpha1\""}

I think perhaps it doesn't know the namespace to look in?

@vsoch
Copy link
Contributor Author

vsoch commented May 9, 2023

okay getting closer!

2023-05-09T01:25:46Z    ERROR   minicluster-reconciler  Failed to create new MiniCluster Batch Job      {"Namespace:": "flux-operator", "Name:": "flux-sample", "error": "admission webhook \"vjobset.kb.io\" denied the request: EnableDNSHostnames requires job to be in indexed completion mode"}

This is just a broker pod, so there is only one, but I'll try setting it to be in completion mode. It really just needs to be part of a network, and hopefully the two jobs in the JobSet can see one another ... will find out soon!

@vsoch
Copy link
Contributor Author

vsoch commented May 9, 2023

Still getting closer (forgot to set the owner reference)

image

@vsoch
Copy link
Contributor Author

vsoch commented May 9, 2023

okay this all seems to be working - and I'm at the point now of needing to understand how JobSet actually works (it's suspending my jobs and not running any pods, lol). Sorry for the noise! I'll likely engage with folks in another issue / slack tomorrow with my many dumb questions.

@vsoch vsoch closed this as completed May 9, 2023
@ahg-g
Copy link
Contributor

ahg-g commented May 9, 2023

This is great, thanks @vsoch! I will respond to your questions on the other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants