Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to move stopped VM instance between cluster nodes (ceph storage) #1612

Open
tregubovav-dev opened this issue Jan 27, 2025 · 0 comments
Assignees
Milestone

Comments

@tregubovav-dev
Copy link

tregubovav-dev commented Jan 27, 2025

My incus cluster with ceph and cephfs storages unable to move stopped VM between cluster nodes. Cluster uses ceph and cephfs storages.
The command incus move <instance_name> <instance_name> --project <project_name> --target <target_node> always fail with message: Error: Migration operation failure: Instance move to destination failed on source: Failed migration on source: Error reading migration control target: websocket: close 1006 (abnormal closure): unexpected EOF

Pre-history:
I needed evacuate the node for maintenance and rebuilding host OS. Evacuation operation failed with the same error. Containers were moved to other nodes, except the VM. I tried to move it using incus move command with the same result. Finally I removed the node from cluster and then I was able to move the VM instance to other node.
Now I'm trying to return this VM to the rebuilt cluster node without success.

Cluster information:

  • 7x RPI4B nodes
  • incus 6.9
  • microceph cluster used as a storage backend for incus cluster.

Here is incus monitor --pretty command output:

DEBUG  [2025-01-27T13:52:47-08:00] Handling API request                          ip="<client>:54289" method=GET protocol=tls url=/1.0 username=<username>
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
INFO   [2025-01-27T13:52:47-08:00] http: TLS handshake error from <client>:65282: EOF
DEBUG  [2025-01-27T13:52:47-08:00] Handling API request                          ip="<client>:65286" method=GET protocol=tls url=/1.0 username=<username>
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Handling API request                          ip="<client>:65292" method=GET protocol=tls url="/1.0/events?project=<project>&target=node-02" username=<username>
DEBUG  [2025-01-27T13:52:47-08:00] Event listener server handler started         id=51297469-cade-48df-ae83-8c305978afdf local="node-01:8443" remote="<client>:65292"
INFO   [2025-01-27T13:52:47-08:00] http: TLS handshake error from <client>:65296: EOF
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Handling API request                          ip="<client>:65300" method=POST protocol=tls url="/1.0/instances/<instance>?project=<project>&target=node-02" username=<username>
DEBUG  [2025-01-27T13:52:47-08:00] New operation                                 class=task description="Migrating instance" operation=ff7b6168-bcf6-4467-ba31-3834a5dc8a6c project=<project>
DEBUG  [2025-01-27T13:52:47-08:00] Started operation                             class=task description="Migrating instance" operation=ff7b6168-bcf6-4467-ba31-3834a5dc8a6c project=<project>
INFO   [2025-01-27T13:52:47-08:00] http: TLS handshake error from <client>:65304: EOF
DEBUG  [2025-01-27T13:52:47-08:00] Connecting to a remote Incus over HTTPS       url="https://node-02:8443"
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] Handling API request                          ip="<client>:65265" method=GET protocol=tls url="/1.0/operations/ff7b6168-bcf6-4467-ba31-3834a5dc8a6c?project=<project>&target=node-02" username=<username>
DEBUG  [2025-01-27T13:52:47-08:00] Matched trusted cert                          fingerprint=<fingerprint> subject="<subject>"
DEBUG  [2025-01-27T13:52:47-08:00] New operation                                 class=websocket description="Migrating instance" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:47-08:00] Started operation                             class=websocket description="Migrating instance" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:47-08:00] Waiting for migration control connection on source  clusterMoveSourceName=<instance> instance=<instance> live=false project=<project> push=false
DEBUG  [2025-01-27T13:52:48-08:00] Connected to the websocket: wss://node-02:8443/1.0/events?project=<project>&target=node-02
DEBUG  [2025-01-27T13:52:48-08:00] Sending request to Incus                      etag= method=POST url="https://node-02:8443/1.0/instances?project=<project>&target=node-02"
DEBUG  [2025-01-27T13:52:48-08:00]
        {
                "architecture": "aarch64",
                "config": {
                        "image.architecture": "arm64",
                        "image.description": "Alpine edge arm64 (20240722_13:00)",
                        "image.os": "Alpine",
                        "image.release": "edge",
                        "image.requirements.secureboot": "false",
                        "image.serial": "20240722_13:00",
                        "image.type": "disk-kvm.img",
                        "image.variant": "default",
                        "volatile.base_image": "ac9da54e09b367f2f10c86110049a072ce91a18f10ebc3b829e99fec40417224",
                        "volatile.cloud-init.instance-id": "52a326f4-0d58-47cd-83be-5130f6d9c797",
                        "volatile.evacuate.origin": "node-05",
                        "volatile.last_state.power": "STOPPED",
                        "volatile.uuid": "4b53593d-e2f3-49a2-97a0-0e1f583b9fad",
                        "volatile.uuid.generation": "4b53593d-e2f3-49a2-97a0-0e1f583b9fad",
                        "volatile.vsock_id": "2532049073"
                },
                "devices": {},
                "ephemeral": false,
                "profiles": [
                        "<profile>"
                ],
                "stateful": false,
                "description": "",
                "name": "<instance>",
                "source": {
                        "type": "migration",
                        "certificate": "-----BEGIN CERTIFICATE-----\n<CERTIFICATE CONTENT\n-----END CERTIFICATE-----\n",
                        "mode": "pull",
                        "operation": "https://node-01:8443/1.0/operations/e6c7b100-ed5c-4914-ab41-cd705c26e9f4",
                        "secrets": {
                                "control": "ee64a8b54b4039d60b7191dccd7d1fa46c8a7a85710b758a6206bb64f59a9028",
                                "fs": "7920355491130277c06810da496cc7289caf95ea8508233f1bfcc0b4b63c6daf"
                        },
                        "source": "<instance>",
                        "allow_inconsistent": false
                },
                "instance_type": "",
                "type": "virtual-machine",
                "start": false
        }
DEBUG  [2025-01-27T13:52:48-08:00] Got operation from Incus
DEBUG  [2025-01-27T13:52:48-08:00]
        {
                "id": "6a322e9b-7a73-47c1-9865-28698610167f",
                "class": "task",
                "description": "Creating instance",
                "created_at": "2025-01-27T13:52:48.060995142-08:00",
                "updated_at": "2025-01-27T13:52:48.060995142-08:00",
                "status": "Running",
                "status_code": 103,
                "resources": {
                        "instances": [
                                "/1.0/instances/<instance>?project=<project>"
                        ]
                },
                "metadata": null,
                "may_cancel": false,
                "err": "",
                "location": "node-02"
        }
DEBUG  [2025-01-27T13:52:48-08:00] Sending request to Incus                      etag= method=GET url="https://node-02:8443/1.0/operations/6a322e9b-7a73-47c1-9865-28698610167f?project=<project>&target=node-02"
DEBUG  [2025-01-27T13:52:48-08:00] Allowing untrusted GET                        ip="node-02:49682" url="/1.0/operations/e6c7b100-ed5c-4914-ab41-cd705c26e9f4/websocket?secret=ee64a8b54b4039d60b7191dccd7d1fa46c8a7a85710b758a6206bb64f59a9028"
DEBUG  [2025-01-27T13:52:48-08:00] Connecting to operation                       class=websocket description="Migrating instance" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:48-08:00] Connected to operation                        class=websocket description="Migrating instance" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:48-08:00] Migration send starting                       instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:48-08:00] Instance operation lock created               action=migrate instance=<instance> project=<project> reusable=false
DEBUG  [2025-01-27T13:52:48-08:00] Migration control connection established on source  clusterMoveSourceName=<instance> instance=<instance> live=false project=<project> push=false
DEBUG  [2025-01-27T13:52:48-08:00] Got response struct from Incus
DEBUG  [2025-01-27T13:52:48-08:00]
        {
                "id": "6a322e9b-7a73-47c1-9865-28698610167f",
                "class": "task",
                "description": "Creating instance",
                "created_at": "2025-01-27T13:52:48.060995142-08:00",
                "updated_at": "2025-01-27T13:52:48.060995142-08:00",
                "status": "Running",
                "status_code": 103,
                "resources": {
                        "instances": [
                                "/1.0/instances/<instance>?project=<project>"
                        ]
                },
                "metadata": null,
                "may_cancel": false,
                "err": "",
                "location": "node-02"
        }
DEBUG  [2025-01-27T13:52:48-08:00] Allowing untrusted GET                        ip="node-02:49692" url="/1.0/operations/e6c7b100-ed5c-4914-ab41-cd705c26e9f4/websocket?secret=7920355491130277c06810da496cc7289caf95ea8508233f1bfcc0b4b63c6daf"
DEBUG  [2025-01-27T13:52:48-08:00] Connecting to operation                       class=websocket description="Migrating instance" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:48-08:00] Connected to operation                        class=websocket description="Migrating instance" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:48-08:00] MountInstance started                         driver=ceph instance=<instance> pool=<storage_pool> project=<project>
DEBUG  [2025-01-27T13:52:48-08:00] Activated RBD volume                          dev=/dev/rbd3 driver=ceph pool=<storage_pool> volName=virtual-machine_<project>_<instance>.block
DEBUG  [2025-01-27T13:52:49-08:00] Activated RBD volume                          dev=/dev/rbd7 driver=ceph pool=<storage_pool> volName=virtual-machine_<project>_<instance>
DEBUG  [2025-01-27T13:52:49-08:00] Mounted RBD volume                            dev=/dev/rbd7 driver=ceph options=discard path=/var/lib/incus/storage-pools/<storage_pool>/virtual-machines/<project>_<instance> pool=<storage_pool> volName=<project>_<instance>
DEBUG  [2025-01-27T13:52:49-08:00] UnmountInstance started                       driver=ceph instance=<instance> pool=<storage_pool> project=<project>
DEBUG  [2025-01-27T13:52:49-08:00] MountInstance finished                        driver=ceph instance=<instance> pool=<storage_pool> project=<project>
DEBUG  [2025-01-27T13:52:49-08:00] Unmounted RBD volume                          driver=ceph keepBlockDev=false path=/var/lib/incus/storage-pools/<storage_pool>/virtual-machines/<project>_<instance> pool=<storage_pool> volName=<project>_<instance>
DEBUG  [2025-01-27T13:52:50-08:00] Deactivated RBD volume                        driver=ceph pool=<storage_pool> volName=virtual-machine_<project>_<instance>
DEBUG  [2025-01-27T13:52:50-08:00] Set migration offer volume size               blockSize=10737418240 instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] UnmountInstance finished                      driver=ceph instance=<instance> pool=<storage_pool> project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Deactivated RBD volume                        driver=ceph pool=<storage_pool> volName=virtual-machine_<project>_<instance>.block
DEBUG  [2025-01-27T13:52:50-08:00] Waiting for migration offer response from target  instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Sending migration offer to target             instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Got migration offer response from target      instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Migrate send transfer started                 instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] MigrateInstance started                       args="&{IndexHeaderVersion:1 Name:<instance> Snapshots:[] MigrationType:{FSType:RBD Features:[]} TrackProgress:true MultiSync:false FinalSync:false Data:<nil> ContentType: AllowInconsistent:false Refresh:false Info:0x4001c203f0 VolumeOnly:false ClusterMove:true StorageMove:false}" driver=ceph instance=<instance> pool=<storage_pool> project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Migrate send control monitor started          instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Websocket: Sending barrier message            address="node-02:49692"
DEBUG  [2025-01-27T13:52:50-08:00] Sent migration index header, waiting for response  args="&{IndexHeaderVersion:1 Name:<instance> Snapshots:[] MigrationType:{FSType:RBD Features:[]} TrackProgress:true MultiSync:false FinalSync:false Data:<nil> ContentType: AllowInconsistent:false Refresh:false Info:0x4001c203f0 VolumeOnly:false ClusterMove:true StorageMove:false}" driver=ceph
 instance=<instance> pool=<storage_pool> project=<project> version=1
DEBUG  [2025-01-27T13:52:50-08:00] Websocket: Got barrier message                address="node-02:49692"
DEBUG  [2025-01-27T13:52:50-08:00] Received migration index header response      args="&{IndexHeaderVersion:1 Name:<instance> Snapshots:[] MigrationType:{FSType:RBD Features:[]} TrackProgress:true MultiSync:false FinalSync:false Data:<nil> ContentType: AllowInconsistent:false Refresh:false Info:0x4001c203f0 VolumeOnly:false ClusterMove:true StorageMove:false}" driver=ceph instance=<instance> pool=<storage_pool> project=<project> response="{StatusCode:200 Error: Refresh:0x4001627e78}" version=1
DEBUG  [2025-01-27T13:52:50-08:00] MigrateInstance finished                      args="&{IndexHeaderVersion:1 Name:<instance> Snapshots:[] MigrationType:{FSType:RBD Features:[]} TrackProgress:true MultiSync:false FinalSync:false Data:<nil> ContentType: AllowInconsistent:false Refresh:false Info:0x4001c203f0 VolumeOnly:false ClusterMove:true StorageMove:false}" driver=ceph instance=<instance> pool=<storage_pool> project=<project>
DEBUG  [2025-01-27T13:52:50-08:00] Migrate send transfer finished                instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:51-08:00] Migrate send control monitor finished         instance=<instance> instanceType=virtual-machine project=<project>
DEBUG  [2025-01-27T13:52:51-08:00] Instance operation lock finished              action=migrate err="Error reading migration control target: websocket: close 1006 (abnormal closure): unexpected EOF" instance=<instance> project=<project> reusable=false
DEBUG  [2025-01-27T13:52:51-08:00] Event listener server handler stopped         listener=68db05fb-c925-40a3-ad9f-f1b6616a99fe local="node-01:8443" remote="node-02:59930"
DEBUG  [2025-01-27T13:52:51-08:00] Migration send stopped                        instance=<instance> instanceType=virtual-machine project=<project>
ERROR  [2025-01-27T13:52:51-08:00] Failed migration on source                    clusterMoveSourceName=<instance> err="Error reading migration control target: websocket: close 1006 (abnormal closure): unexpected EOF" instance=<instance> live=false project=<project> push=false
DEBUG  [2025-01-27T13:52:51-08:00] Migration channels disconnected on source     clusterMoveSourceName=<instance> instance=<instance> live=false project=<project> push=false
DEBUG  [2025-01-27T13:52:51-08:00] Failure for operation                         class=task description="Migrating instance" err="Instance move to destination failed on source: Failed migration on source: Error reading migration control target: websocket: close 1006 (abnormal closure): unexpected EOF" operation=ff7b6168-bcf6-4467-ba31-3834a5dc8a6c project=<project>
DEBUG  [2025-01-27T13:52:51-08:00] Failure for operation                         class=websocket description="Migrating instance" err="Failed migration on source: Error reading migration control target: websocket: close 1006 (abnormal closure): unexpected EOF" operation=e6c7b100-ed5c-4914-ab41-cd705c26e9f4 project=<project>
DEBUG  [2025-01-27T13:52:51-08:00] Event listener server handler stopped         listener=51297469-cade-48df-ae83-8c305978afdf local="node-01:8443" remote="<client>:65292"
DEBUG  [2025-01-27T13:52:56-08:00] Matched trusted cert                          fingerprint=ed77d5e3be52b9a8b7db12f200101d2f2d995518fc5f16a57ab6cf64328d31d2 subject="CN=root@node-07,O=LXD"
DEBUG  [2025-01-27T13:52:56-08:00] Replace current raft nodes                    raftMembers="[{{9 node-02:8443 voter} node-02} {{10 node-03:8443 voter} node-03} {{11 node-04:8443 voter} node-04} {{12 node-05:8443 stand-by} node-05} {{6 node-06:8443 stand-by} node-06} {{7 node-07:8443 voter} node-07} {{8 node-01:8443 voter} node-01}]"
DEBUG  [2025-01-27T13:52:56-08:00] Removed inactive member event listener client  address="node-02:8443"

incusd crashed on destination node:

Jan 27 13:52:51 picl-02 incusd[130630]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 27 13:52:51 picl-02 incusd[130630]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4cb964]
Jan 27 13:52:51 picl-02 incusd[130630]: goroutine 460893 [running]:
Jan 27 13:52:51 picl-02 incusd[130630]: io.copyBuffer({0x21206a0, 0x4000c84ad0}, {0x0, 0x0}, {0x0, 0x0, 0x0})
Jan 27 13:52:51 picl-02 incusd[130630]:         /usr/local/go/src/io/io.go:429 +0x174
Jan 27 13:52:51 picl-02 incusd[130630]: io.Copy(...)
Jan 27 13:52:51 picl-02 incusd[130630]:         /usr/local/go/src/io/io.go:388
Jan 27 13:52:51 picl-02 incusd[130630]: os.genericReadFrom(0x400006e500?, {0x0, 0x0})
Jan 27 13:52:51 picl-02 incusd[130630]:         /usr/local/go/src/os/file.go:185 +0x58
Jan 27 13:52:51 picl-02 incusd[130630]: os.(*File).ReadFrom(0x400006e500, {0x0, 0x0})
Jan 27 13:52:51 picl-02 incusd[130630]:         /usr/local/go/src/os/file.go:161 +0xa0
Jan 27 13:52:51 picl-02 incusd[130630]: io.copyBuffer({0x211f5e0, 0x400006e500}, {0x0, 0x0}, {0x0, 0x0, 0x0})
Jan 27 13:52:51 picl-02 incusd[130630]:         /usr/local/go/src/io/io.go:415 +0x14c
Jan 27 13:52:51 picl-02 incusd[130630]: io.Copy(...)
Jan 27 13:52:51 picl-02 incusd[130630]:         /usr/local/go/src/io/io.go:388
Jan 27 13:52:51 picl-02 incusd[130630]: github.com/lxc/incus/v6/internal/rsync.Recv.func2()
Jan 27 13:52:51 picl-02 incusd[130630]:         /build/incus/internal/rsync/rsync.go:375 +0xa0
Jan 27 13:52:51 picl-02 incusd[130630]: created by github.com/lxc/incus/v6/internal/rsync.Recv in goroutine 460817
Jan 27 13:52:51 picl-02 incusd[130630]:         /build/incus/internal/rsync/rsync.go:374 +0x520
Jan 27 13:52:51 picl-02 systemd[1]: incus.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit incus.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 2.
Jan 27 13:52:51 picl-02 systemd[1]: incus.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit incus.service has entered the 'failed' state with result 'exit-code'.
Jan 27 13:52:51 picl-02 systemd[1]: incus.service: Unit process 179838 (rsync) remains running after unit stopped.
@stgraber stgraber self-assigned this Jan 31, 2025
@stgraber stgraber added this to the incus-6.10 milestone Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants