Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VK node is not working in-edge deployment type #357

Open
hakanaltindag opened this issue Jan 17, 2025 · 10 comments
Open

VK node is not working in-edge deployment type #357

hakanaltindag opened this issue Jan 17, 2025 · 10 comments
Assignees
Labels
question Further information is requested triage

Comments

@hakanaltindag
Copy link

OS: Rocky 8
Kubernetes Version: 1.24.17

I have created a kubernetes cluster from ground-up. I have edge node it can communicates cluster. I followed the installation steps. I can see vk-node in cluster but it is Not Ready status. I inspected node, I only see NetworkUnavailable true and message is " RouteController created a route"

And pod logs below;

time="2025-01-17T11:55:43Z" level=error msg="server error: 502" time="2025-01-17T11:55:43Z" level=error msg="Ping Failed with exit code: -1" time="2025-01-17T11:55:43Z" level=error msg="Error: <nil>" time="2025-01-17T11:55:43Z" level=info msg=endNodeLoop time="2025-01-17T11:55:43Z" level=debug msg="Received node status update" time="2025-01-17T11:55:43Z" level=debug msg="got node from api server" time="2025-01-17T11:55:43Z" level=debug msg="Generated three way patch" error="<nil>" patch="{\"metadata\":{\"annotations\":{\"virtual-kubelet.io/last-applied-node-status\":\"{\\\"capacity\\\":{\\\"cpu\\\":\\\"10\\\",\\\"memory\\\":\\\"64Gi\\\",\\\"nvidia.com/gpu\\\":\\\"0\\\",\\\"pods\\\":\\\"10\\\"},\\\"allocatable\\\":{\\\"cpu\\\":\\\"10\\\",\\\"memory\\\":\\\"64Gi\\\",\\\"nvidia.com/gpu\\\":\\\"0\\\",\\\"pods\\\":\\\"10\\\"},\\\"conditions\\\":[{\\\"type\\\":\\\"Ready\\\",\\\"status\\\":\\\"False\\\",\\\"lastHeartbeatTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"lastTransitionTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"reason\\\":\\\"KubeletPending\\\",\\\"message\\\":\\\"kubelet is pending.\\\"},{\\\"type\\\":\\\"OutOfDisk\\\",\\\"status\\\":\\\"False\\\",\\\"lastHeartbeatTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"lastTransitionTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"reason\\\":\\\"KubeletHasSufficientDisk\\\",\\\"message\\\":\\\"kubelet has sufficient disk space available\\\"},{\\\"type\\\":\\\"MemoryPressure\\\",\\\"status\\\":\\\"False\\\",\\\"lastHeartbeatTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"lastTransitionTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"reason\\\":\\\"KubeletHasSufficientMemory\\\",\\\"message\\\":\\\"kubelet has sufficient memory available\\\"},{\\\"type\\\":\\\"DiskPressure\\\",\\\"status\\\":\\\"False\\\",\\\"lastHeartbeatTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"lastTransitionTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"reason\\\":\\\"KubeletHasNoDiskPressure\\\",\\\"message\\\":\\\"kubelet has no disk pressure\\\"},{\\\"type\\\":\\\"NetworkUnavailable\\\",\\\"status\\\":\\\"True\\\",\\\"lastHeartbeatTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"lastTransitionTime\\\":\\\"2025-01-17T11:55:43Z\\\",\\\"reason\\\":\\\"RouteCreated\\\",\\\"message\\\":\\\"RouteController created a route\\\"}],\\\"addresses\\\":[{\\\"type\\\":\\\"InternalIP\\\",\\\"address\\\":\\\"10.233.64.13\\\"}],\\\"daemonEndpoints\\\":{\\\"kubeletEndpoint\\\":{\\\"Port\\\":10250}},\\\"nodeInfo\\\":{\\\"machineID\\\":\\\"\\\",\\\"systemUUID\\\":\\\"\\\",\\\"bootID\\\":\\\"\\\",\\\"kernelVersion\\\":\\\"\\\",\\\"osImage\\\":\\\"\\\",\\\"containerRuntimeVersion\\\":\\\"\\\",\\\"kubeletVersion\\\":\\\"0.3.6\\\",\\\"kubeProxyVersion\\\":\\\"\\\",\\\"operatingSystem\\\":\\\"linux\\\",\\\"architecture\\\":\\\"virtual-kubelet\\\"}}\"},\"creationTimestamp\":null},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Ready\"},{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"NetworkUnavailable\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2025-01-17T11:55:43Z\",\"lastTransitionTime\":\"2025-01-17T11:55:43Z\",\"type\":\"Ready\"},{\"lastHeartbeatTime\":\"2025-01-17T11:55:43Z\",\"lastTransitionTime\":\"2025-01-17T11:55:43Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2025-01-17T11:55:43Z\",\"lastTransitionTime\":\"2025-01-17T11:55:43Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2025-01-17T11:55:43Z\",\"lastTransitionTime\":\"2025-01-17T11:55:43Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2025-01-17T11:55:43Z\",\"lastTransitionTime\":\"2025-01-17T11:55:43Z\",\"type\":\"NetworkUnavailable\"}]}}" time="2025-01-17T11:55:43Z" level=debug msg="updated node status in api server" node.Status.Conditions="[{Ready False 2025-01-17 11:55:43 +0000 UTC 2025-01-17 11:55:43 +0000 UTC KubeletPending kubelet is pending.} {OutOfDisk False 2025-01-17 11:55:43 +0000 UTC 2025-01-17 11:55:43 +0000 UTC KubeletHasSufficientDisk kubelet has sufficient disk space available} {MemoryPressure False 2025-01-17 11:55:43 +0000 UTC 2025-01-17 11:55:43 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2025-01-17 11:55:43 +0000 UTC 2025-01-17 11:55:43 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {NetworkUnavailable True 2025-01-17 11:55:43 +0000 UTC 2025-01-17 11:55:43 +0000 UTC RouteCreated RouteController created a route}]" node.resourceVersion=27304 time="2025-01-17T11:55:45Z" level=info msg="No pods to monitor, waiting for the next loop to start" time="2025-01-17T11:55:45Z" level=info msg="statusLoop=end" time="2025-01-17T11:55:45Z" level=info msg=statusLoop time="2025-01-17T11:55:45Z" level=debug msg="404 request not found" uri=/metrics/cadvisor vars="map[]" W0117 11:55:47.697675 1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Secret: field label not supported: spec.nodeName E0117 11:55:47.697710 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Secret: failed to list *v1.Secret: field label not supported: spec.nodeName time="2025-01-17T11:55:50Z" level=info msg="No pods to monitor, waiting for the next loop to start" time="2025-01-17T11:55:50Z" level=info msg="statusLoop=end" time="2025-01-17T11:55:50Z" level=info msg=statusLoop time="2025-01-17T11:55:50Z" level=debug msg="404 request not found" uri=/metrics vars="map[]"

Also i got error in oauth.log

192.168.23.11:2908 - b8a4a1f7-5596-484f-ab29-741ebfdedc63 - hakanaltindag [2025/01/17 11:57:17] 192.168.23.14:30443 POST /home/msevinc/.interlink.sock "/pinglink" HTTP/1.1 "Go-http-client/1.1" 502 2257 2.295

ConfigMap

`Name: cn04-vk-virtual-kubelet-config
Namespace: interlink
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: cn04-vk
meta.helm.sh/release-namespace: interlink

Data

InterLinkConfig.yaml:

InterlinkURL: "https://192.168.23.14"
InterlinkPort: "30443"
ExportPodData: false
VerboseLogging: true
ErrorsOnlyLogging: false
ServiceAccount: "cn04-vk"
Namespace: "interlink"
VKTokenFile: /opt/interlink/token
CPU: "10"
Memory: "64Gi"
Pods: "10"
nvidia.com/gpu: "0"
HTTP:
Insecure: true
KubeletHTTP:
Insecure: true

BinaryData

Events:
`

Do you have any idea about this issue? or can you provide a detailed documentation of installing interlink in-edge node scenario?

@dciangot
Copy link
Collaborator

Hi @hakanaltindag , thanks for reaching out. We'd like to ask for some additional information.

Which plugin are you using? The slurm one? In any case we might need the logs from the interLink api server and the plugin.

In addition, did the manual health check succeded as indicated here? https://intertwin-eu.github.io/interLink/docs/Cookbook#test-interlink-stack-health

Which version of interLink are you running?

@hakanaltindag
Copy link
Author

Installed slurm as described in documentation. Here is interlink log

time="2025-01-20T06:16:41Z" level=info msg="InterLink: received Ping call"
time="2025-01-20T06:16:41Z" level=info msg="InterLink: forwarding GetStatus call to sidecar"
time="2025-01-20T06:16:41Z" level=error msg="HTTP InterLink session Request-944c0c1f-f097-48f0-a1ae-53e27f8e6267: error doing DoReq() of ReqWithErrorWithSessionNumber error Get \"http://unix/status\": dial unix /home/msevinc/.plugin.sock: connect: no such file or directory"
2025/01/20 06:16:41 http: superfluous response.WriteHeader call from github.com/intertwin-eu/interlink/pkg/interlink/api.(*InterLinkHandler).Ping (ping.go:57)
time="2025-01-20T06:16:57Z" level=info msg="InterLink: received Ping call"
time="2025-01-20T06:16:57Z" level=info msg="InterLink: forwarding GetStatus call to sidecar"
time="2025-01-20T06:16:57Z" level=error msg="HTTP InterLink session Request-207b5822-ad86-4797-b3b1-b49857827ad9: error doing DoReq() of ReqWithErrorWithSessionNumber error Get \"http://unix/status\": dial unix /home/msevinc/.plugin.sock: connect: no such file or directory"
2025/01/20 06:16:57 http: superfluous response.WriteHeader call from github.com/intertwin-eu/interlink/pkg/interlink/api.(*InterLinkHandler).Ping (ping.go:57)
time="2025-01-20T06:17:29Z" level=info msg="InterLink: received Ping call"
time="2025-01-20T06:17:29Z" level=info msg="InterLink: forwarding GetStatus call to sidecar"
time="2025-01-20T06:17:29Z" level=error msg="HTTP InterLink session Request-727b75a6-7e45-40bf-b8b6-d30e02dbdfd5: error doing DoReq() of ReqWithErrorWithSessionNumber error Get \"http://unix/status\": dial unix /home/msevinc/.plugin.sock: connect: no such file or directory"
2025/01/20 06:17:29 http: superfluous response.WriteHeader call from github.com/intertwin-eu/interlink/pkg/interlink/api.(*InterLinkHandler).Ping (ping.go:57)
time="2025-01-20T06:18:02Z" level=info msg="InterLink: received Ping call"
time="2025-01-20T06:18:02Z" level=info msg="InterLink: forwarding GetStatus call to sidecar"
time="2025-01-20T06:18:02Z" level=error msg="HTTP InterLink session Request-db06841e-b3f6-4f58-bcee-82bea2697a84: error doing DoReq() of ReqWithErrorWithSessionNumber error Get \"http://unix/status\": dial unix /home/msevinc/.plugin.sock: connect: no such file or directory"
2025/01/20 06:18:02 http: superfluous response.WriteHeader call from github.com/intertwin-eu/interlink/pkg/interlink/api.(*InterLinkHandler).Ping (ping.go:57)

Connection test

`curl -v --unix-socket ${HOME}/.interlink.sock http://unix/pinglink

  • Trying /home/msevinc/.interlink.sock...
  • Connected to unix (/home/msevinc/.interlink.sock) port 80 (#0)

GET /pinglink HTTP/1.1
Host: unix
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 500 Internal Server Error
< Date: Mon, 20 Jan 2025 06:16:41 GMT
< Content-Length: 3
< Content-Type: text/plain; charset=utf-8
<

  • Connection #0 to host unix left intact
    503`

Also /logs/plugin.log is empty. Do you have any idea?

@dciangot dciangot added question Further information is requested triage labels Jan 20, 2025
@dciangot dciangot self-assigned this Jan 20, 2025
@dciangot
Copy link
Collaborator

thanks @hakanaltindag , it's definetely the plugin not starting for some reasons, all the rest looks like working fine.

I suppose you don't see any plugin process running, isn't it? If so I think I need your plugin config file and the command you are executing. If you are following the guide I suppose is the following though:

export SLURMCONFIGPATH=$PWD/interlink/manifests/plugin-config.yaml
$HOME/.interlink/bin/plugin &> $HOME/.interlink/logs/plugin.log &
echo $! > $HOME/.interlink/plugin.pid     

is this right? If so, then I need only the config file that you pass into the env variable above.

@hakanaltindag
Copy link
Author

Yes that's right.

[msevinc@cn04 ~]$ export SLURMCONFIGPATH=$PWD/interlink/manifests/plugin-config.yaml
[msevinc@cn04 ~]$ $HOME/.interlink/bin/plugin &> $HOME/.interlink/logs/plugin.log &
[1] 3559979
[msevinc@cn04 ~]$ echo $! > $HOME/.interlink/plugin.pid
[1]+  Done(126)               $HOME/.interlink/bin/plugin &> $HOME/.interlink/logs/plugin.log
[msevinc@cn04 ~]$ ps -ef | grep slurm
root        2689       1  0  2024 ?        00:00:13 /usr/sbin/slurmd -D -s

When i try to check logs

[msevinc@cn04 ~]$ cat $HOME/.interlink/logs/plugin.log
-sh: /home/msevinc/.interlink/bin/plugin: Permission denied
[msevinc@cn04 ~]$ sudo !!
sudo cat $HOME/.interlink/logs/plugin.log
-sh: /home/msevinc/.interlink/bin/plugin: Permission denied

Also my plugin-config.yaml

Socket: "unix:///home/msevinc/.plugin.sock"
InterlinkPort: "0"
SidecarPort: "4001"

CommandPrefix: ""
DataRootFolder: "/home/msevinc/.interlink/jobs/"
BashPath: /bin/bash
VerboseLogging: false
ErrorsOnlyLogging: false
SbatchPath: "/usr/bin/sbatch"
ScancelPath: "/usr/bin/scancel"
SqueuePath: "/usr/bin/squeue"
SingularityPrefix: ""

Healthcheck seems connected to interlink.sock

[msevinc@cn04 ~]$ curl -v --unix-socket ${HOME}/.interlink.sock  http://unix/pinglink
*   Trying /home/msevinc/.interlink.sock...
* Connected to unix (/home/msevinc/.interlink.sock) port 80 (#0)
> GET /pinglink HTTP/1.1
> Host: unix
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Date: Mon, 20 Jan 2025 08:14:48 GMT
< Content-Length: 3
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host unix left intact

The only question in my mind is what exactly should the interlink port and sidecar port? Interlink port 30443? sidecar port 4001?

@dciangot
Copy link
Collaborator

can you try with chmod +x $HOME/.interlink/bin/plugin ? it should fix it. If you confirm, I'll update the doc.

regarding the ports, the one shared looks good. interlink il listening on a unix socket, so no need to indicate any port to the plugin.

@hakanaltindag
Copy link
Author

yess! it worked. added node and now ready! It would be great if you could update the doc. Thanks for your support

@dciangot
Copy link
Collaborator

@Bianco95 can you take care of this?

@dciangot
Copy link
Collaborator

@hakanaltindag thanks for your patience!

If you want to reach out to share your case study, we are all eager to hear it! So, feel free to jump into the Slack channel (the link is on the home page of interlink)!

@dciangot
Copy link
Collaborator

dciangot commented Feb 8, 2025

@Bianco95 any news on this? please let me know if you can fix it! also @chbrandt I think this needs to be included in you PR

@Bianco95
Copy link
Collaborator

Bianco95 commented Feb 9, 2025

Ok I am going to update the doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

3 participants