-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPP-593] VPP becomes intermittently unresponsive on nodes hosting VMs #1940
Comments
This issue appears to be resolved with this commit 7104f93 on the stable/1701 branch: commit 7104f93 Python API: Missing locking of results data structure. The wrong assumption that the GIL combined with CPython's "mostly" Added suitable locking. Also added an atexit() to attempt a VPP disconnect on shutdown. Also: lots more comments, docstrings, duplicated code removed. Change-Id: I0cb7d0026db660ec141425c5ad474f14bacea36e Thanks for all the hard work everyone! |
If it's useful at all, this looks to be the point at which it hangs:
|
It'd be very interesting to see what the API client is doing. Please attach a post-mortem API trace if you can find one. I have a theory what's going on. How hard would it be to try a private image? |
(gdb) info threads
|
Description
Creating a VM on a hypervisor node (MHV) causes VPP to become unresponsive after a period of time. This was tested against 17.01-rc2
3_g2d7e163b15.x86_64 in an open stack environment using the neutron ML2 plugin for VPP (https://github.com/openstack/networking-vpp) and virtio for the NIC.Reproduction steps:
1. Install vpp from 1701 stable repo and also vpp-agent
-bash-4.2# rpm -qa|grep vpp
vpp-devel-17.01-rc2
3_g2d7e163b15.x86_64vpp-17.01-rc2
3_g2d7e163b15.x86_64networking-vpp-0.0.1.dev121-2.noarch
vpp-plugins-17.01-rc2
3_g2d7e163b15.x86_64vpp-python-api-17.01-rc2
3_g2d7e163b15.x86_64vpp-lib-17.01-rc2
3_g2d7e163b15.x86_642. Create a network, and also a vm on that network.
The namespace/port created on mcp
--bash-4.2# vppctl show bridge-domain 4 detail
ID Index Learning U-Forwrd UU-Flood Flooding ARP-Term BVI-Intf
4 1 on on on on off N/A
Interface Index SHG BVI TxFlood VLAN-Tag-Rewrite
The vhost interface created on mhv
-bash-4.2# vppctl show vhost
Virtio vhost-user interfaces
Global:
coalesce frames 32 time 1e-3
Interface: VirtualEthernet0/0/0 (ifindex 4)
virtio_net_hdr_sz 12
features mask (0xffffffffffffffff):
features (0x50008000):
VIRTIO_NET_F_MRG_RXBUF (15)
VIRTIO_F_INDIRECT_DESC (28)
VHOST_USER_F_PROTOCOL_FEATURES (30)
protocol features (0x3)
VHOST_USER_PROTOCOL_F_MQ (0)
VHOST_USER_PROTOCOL_F_LOG_SHMFD (1)
socket filename /tmp/7853dd37-6796-476d-bb8b-ae2267acb59e type client errno "Success"
rx placement:
thread 0 on vring 1
tx placement: lock-free
thread 0 on vring 0
Memory regions (total 2)
region fd guest_phys_addr memory_size userspace_addr mmap_offset mmap_addr
====== ===== ================== ================== ================== ================== ==================
0 21 0x0000000000000000 0x00000000000a0000 0x00002aaaaac00000 0x0000000000000000 0x00002aaaaac00000
1 22 0x00000000000c0000 0x000000001ff40000 0x00002aaaaacc0000 0x00000000000c0000 0x00002aaaaaec0000
Virtqueue 0 (TX)
qsz 256 last_avail_idx 81 last_used_idx 81
avail.flags 0 avail.idx 256 used.flags 1 used.idx 81
kickfd 23 callfd 24 errfd -1
Virtqueue 1 (RX)
qsz 256 last_avail_idx 118 last_used_idx 118
avail.flags 1 avail.idx 118 used.flags 1 used.idx 118
kickfd 19 callfd 25 errfd -1
3. After a while (5-10min) vpp stopped responding on mhv (all vpp related command no response)
2017-01-10T17:39:10.406206+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer
2017-01-10T17:39:10.408508+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuProcessReportLogError:1810 : internal error: qemu unexpectedly closed the monitor: QEMU waiting for connection on: di
sconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
2017-01-10T17:39:10.602566+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25991: error : qemuProcessReportLogError:1810 : internal error: process exited while connecting to monitor: QEMU waiting for connection on: disconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
2017-01-10T17:39:49.027951+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: unix_signal_handler:118: received signal SIGCONT, PC 0x7f56e7785590
2017-01-10T17:39:49.027979+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: received SIGTERM, exiting...
2017-01-10T17:39:49.139973+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Probing VFIO support...
2017-01-10T17:40:00.407293+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
2017-01-10T17:40:00.407587+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Initializing pmd_bond for eth_bond0
2017-01-10T17:40:00.407881+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Create bonded device eth_bond0 on port 0 in mode 1 on socket 0.
2017-01-10T17:40:00.408138+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:14.0 on NUMA socket -1
2017-01-10T17:40:00.408147+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
2017-01-10T17:40:00.412447+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:15.0 on NUMA socket -1
2017-01-10T17:40:00.412457+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
Further investigation shows it stuck in a FUTEX_WAIT:
-bash-4.2# strace -p 696966
Process 696966 attached
futex(0x305e7f6c, FUTEX_WAIT, 245, NULL
This also happens on nodes running VPP which are not hosting VMs, but are tapping into a linux namespace for DHCP assignment ("MCP nodes"). This happens much more intermittently and has been much more difficult to reproduce.
Assignee
Ole Trøan
Reporter
Kevin Bringard
Comments
commit 7104f93
Author: Ole Troan [email protected]
Date: Thu Jan 19 09:44:44 2017 +0100
The wrong assumption that the GIL combined with CPython's "mostly"
Also: lots more comments, docstrings, duplicated code removed.
Change-Id: I0cb7d0026db660ec141425c5ad474f14bacea36e
Thanks for all the hard work everyone!
[] futex_wait_queue_me+0xc4/0x120
[] futex_wait+0x179/0x280
[] do_futex+0xfe/0x5b0
[] SyS_futex+0x80/0x180
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff
I have a theory what's going on. How hard would it be to try a private image?
kevinbringard (Thu, 12 Jan 2017 17:29:57 +0000):
(gdb) info threads
Id Target Id Frame
3 Thread 0x7fe9614fc700 (LWP 24843) "eal-intr-thread" 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6
2 Thread 0x7fe960cfb700 (LWP 24844) "vpp_stats" 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fe9e6fd2900 (LWP 24738))]
#0 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fe9e69ac0a2 in unix_shared_memory_queue_add () from /lib64/libvlibmemory.so.0
#2 0x00000000005bca26 in vl_api_control_ping_t_handler ()
#3 0x00007fe9e6bb9b73 in vl_msg_api_handler_with_vm_node () from /lib64/libvlibapi.so.0
#4 0x00007fe9e69a5763 in memclnt_process () from /lib64/libvlibmemory.so.0
#5 0x00007fe9e6536e26 in vlib_process_bootstrap () from /lib64/libvlib.so.0
#6 0x00007fe9e54abd20 in clib_calljmp () from /lib64/libvppinfra.so.0
#7 0x00007fe9a45a0e30 in ?? ()
#8 0x00007fe9e6537ec1 in dispatch_process.constprop.14 () from /lib64/libvlib.so.0
#9 0x0000001700000010 in ?? ()
#10 0x736c706d0000003b in ?? ()
#11 0x636e692d6a64612d in ?? ()
#12 0x206574656c706d6f in ?? ()
#13 0x4552472d534c504d in ?? ()
#14 0x70207475706e6920 in ?? ()
#15 0x64207374656b6361 in ?? ()
#16 0x616c757370616365 in ?? ()
#17 0x006425203a646574 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe960cfb700 (LWP 24844))]
#0 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0
#1 0x000000000059e026 in ip46_fib_stats_delay.isra.5 ()
#2 0x00000000005a4d5a in stats_thread_fn ()
#3 0x00007fe9e54abd20 in clib_calljmp () from /lib64/libvppinfra.so.0
#4 0x00007fe960cfad60 in ?? ()
#5 0x00007fe9e4f69dc5 in start_thread () from /lib64/libpthread.so.0
#6 0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe9614fc700 (LWP 24843))]
#0 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6
#1 0x00000000004f9094 in eal_intr_thread_main ()
#2 0x00007fe9e4f69dc5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007fe9e46ac1cd in clone () from /lib64/libc.so.6
Original issue: https://jira.fd.io/browse/VPP-593
The text was updated successfully, but these errors were encountered: