[VPP-593] VPP becomes intermittently unresponsive on nodes hosting VMs

### Description
Creating a VM on a hypervisor node (MHV) causes VPP to become unresponsive after a period of time. This was tested against 17.01-rc2~3_g2d7e163~b15.x86_64 in an open stack environment using the neutron ML2 plugin for VPP (<a href="https://github.com/openstack/networking-vpp" class="external-link" target="_blank" rel="nofollow noopener">https://github.com/openstack/networking-vpp</a>) and virtio for the NIC.


Reproduction steps:


**1. Install vpp from 1701 stable repo and also vpp-agent**


-bash-4.2# rpm -qa|grep vpp

vpp-devel-17.01-rc2~3_g2d7e163~b15.x86_64

vpp-17.01-rc2~3_g2d7e163~b15.x86_64

networking-vpp-0.0.1.dev121-2.noarch

vpp-plugins-17.01-rc2~3_g2d7e163~b15.x86_64

vpp-python-api-17.01-rc2~3_g2d7e163~b15.x86_64

vpp-lib-17.01-rc2~3_g2d7e163~b15.x86_64

**2. Create a network, and also a vm on that network.**


The namespace/port created on mcp

--bash-4.2# vppctl show bridge-domain 4 detail

 ID Index Learning U-Forwrd UU-Flood Flooding ARP-Term BVI-Intf

 4 1 on on on on off N/A

 Interface Index SHG BVI TxFlood VLAN-Tag-Rewrite

 BondEthernet0.2066 4 0 - * pop-1

 tap-0 5 0 - * none


The vhost interface created on mhv

-bash-4.2# vppctl show vhost

Virtio vhost-user interfaces

Global:

 coalesce frames 32 time 1e-3

Interface: VirtualEthernet0/0/0 (ifindex 4)

virtio_net_hdr_sz 12

features mask (0xffffffffffffffff):

features (0x50008000):

 VIRTIO_NET_F_MRG_RXBUF (15)

 VIRTIO_F_INDIRECT_DESC (28)

 VHOST_USER_F_PROTOCOL_FEATURES (30)

 protocol features (0x3)

 VHOST_USER_PROTOCOL_F_MQ (0)

 VHOST_USER_PROTOCOL_F_LOG_SHMFD (1)

socket filename /tmp/7853dd37-6796-476d-bb8b-ae2267acb59e type client errno "Success"


rx placement:

 thread 0 on vring 1

tx placement: lock-free

 thread 0 on vring 0

Memory regions (total 2)

region fd guest_phys_addr memory_size userspace_addr mmap_offset mmap_addr

====== ===== ================== ================== ================== ================== ==================

 0 21 0x0000000000000000 0x00000000000a0000 0x00002aaaaac00000 0x0000000000000000 0x00002aaaaac00000

 1 22 0x00000000000c0000 0x000000001ff40000 0x00002aaaaacc0000 0x00000000000c0000 0x00002aaaaaec0000

Virtqueue 0 (TX)

 qsz 256 last_avail_idx 81 last_used_idx 81

 avail.flags 0 avail.idx 256 used.flags 1 used.idx 81

 kickfd 23 callfd 24 errfd -1

Virtqueue 1 (RX)

 qsz 256 last_avail_idx 118 last_used_idx 118

 avail.flags 1 avail.idx 118 used.flags 1 used.idx 118

 kickfd 19 callfd 25 errfd -1


**3. After a while (5-10min) vpp stopped responding on mhv (all vpp related command no response)**


2017-01-10T17:39:10.406206+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer

2017-01-10T17:39:10.408508+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuProcessReportLogError:1810 : internal error: qemu unexpectedly closed the monitor: QEMU waiting for connection on: di

sconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server

2017-01-10T17:39:10.602566+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25991: error : qemuProcessReportLogError:1810 : internal error: process exited while connecting to monitor: QEMU waiting for connection on: disconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server

2017-01-10T17:39:49.027951+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: unix_signal_handler:118: received signal SIGCONT, PC 0x7f56e7785590

2017-01-10T17:39:49.027979+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: received SIGTERM, exiting...

2017-01-10T17:39:49.139973+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Probing VFIO support...

2017-01-10T17:40:00.407293+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !

2017-01-10T17:40:00.407587+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Initializing pmd_bond for eth_bond0

2017-01-10T17:40:00.407881+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Create bonded device eth_bond0 on port 0 in mode 1 on socket 0.

2017-01-10T17:40:00.408138+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:14.0 on NUMA socket -1

2017-01-10T17:40:00.408147+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio

2017-01-10T17:40:00.412447+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:15.0 on NUMA socket -1

2017-01-10T17:40:00.412457+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio

Further investigation shows it stuck in a FUTEX_WAIT:


-bash-4.2# strace -p 696966

Process 696966 attached

futex(0x305e7f6c, FUTEX_WAIT, 245, NULL

This also happens on nodes running VPP which are not hosting VMs, but are tapping into a linux namespace for DHCP assignment ("MCP nodes"). This happens much more intermittently and has been much more difficult to reproduce.

### Assignee
Ole Trøan

### Reporter
Kevin Bringard

### Comments
- **kevinbringard (Mon, 23 Jan 2017 14:52:26 +0000)**: This issue appears to be resolved with this commit 7104f93c75aacb97df8c2c5388ef2f1164ce7403 on the stable/1701 branch:


commit 7104f93c75aacb97df8c2c5388ef2f1164ce7403

Author: Ole Troan <ot@cisco.com>

Date: Thu Jan 19 09:44:44 2017 +0100

 Python API: Missing locking of results data structure.


 The wrong assumption that the GIL combined with CPython's "mostly"

 thread safe assurance does not hold. The combination of a slow

 event handler for notification and calling the API at the same

 time let to contention on the results data structure.

 Added suitable locking.


 Also added an atexit() to attempt a VPP disconnect on shutdown.


 Also: lots more comments, docstrings, duplicated code removed.

 Some of the problem here was a disagreement between caller

 and author as to how the API should be used; the comments should

 help.

 Change-Id: I0cb7d0026db660ec141425c5ad474f14bacea36e

 Signed-off-by: Ole Troan <ot@cisco.com>

 Co-Authored-By: Ian Wells <iawells@cisco.com>

 Signed-off-by: Ole Troan <ot@cisco.com>

Thanks for all the hard work everyone!
- **kevinbringard (Tue, 17 Jan 2017 15:19:52 +0000)**: If it's useful at all, this looks to be the point at which it hangs:


<ol>
	<li>cat /proc/169981/stack

[<ffffffff810e2464>] futex_wait_queue_me+0xc4/0x120

[<ffffffff810e2fd9>] futex_wait+0x179/0x280

[<ffffffff810e506e>] do_futex+0xfe/0x5b0

[<ffffffff810e55a0>] SyS_futex+0x80/0x180

[<ffffffff81645909>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff</li>
</ol>
- **dbarach (Thu, 12 Jan 2017 19:45:34 +0000)**: It'd be very interesting to see what the API client is doing. Please attach a post-mortem API trace if you can find one.


I have a theory what's going on. How hard would it be to try a private image?
- **kevinbringard (Thu, 12 Jan 2017 17:29:57 +0000)**: (gdb) info threads

 Id Target Id Frame 

 3 Thread 0x7fe9614fc700 (LWP 24843) "eal-intr-thread" 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6

 2 Thread 0x7fe960cfb700 (LWP 24844) "vpp_stats" 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0

	<li>1 Thread 0x7fe9e6fd2900 (LWP 24738) "vpp_main" 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) thread 1

[Switching to thread 1 (Thread 0x7fe9e6fd2900 (LWP 24738))]

#0 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) bt

#0 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x00007fe9e69ac0a2 in unix_shared_memory_queue_add () from /lib64/libvlibmemory.so.0

#2 0x00000000005bca26 in vl_api_control_ping_t_handler ()

#3 0x00007fe9e6bb9b73 in vl_msg_api_handler_with_vm_node () from /lib64/libvlibapi.so.0

#4 0x00007fe9e69a5763 in memclnt_process () from /lib64/libvlibmemory.so.0

#5 0x00007fe9e6536e26 in vlib_process_bootstrap () from /lib64/libvlib.so.0

#6 0x00007fe9e54abd20 in clib_calljmp () from /lib64/libvppinfra.so.0

#7 0x00007fe9a45a0e30 in ?? ()

#8 0x00007fe9e6537ec1 in dispatch_process.constprop.14 () from /lib64/libvlib.so.0

#9 0x0000001700000010 in ?? ()

#10 0x736c706d0000003b in ?? ()

#11 0x636e692d6a64612d in ?? ()

#12 0x206574656c706d6f in ?? ()

#13 0x4552472d534c504d in ?? ()

#14 0x70207475706e6920 in ?? ()

#15 0x64207374656b6361 in ?? ()

#16 0x616c757370616365 in ?? ()

#17 0x006425203a646574 in ?? ()

#18 0x0000000000000000 in ?? ()

(gdb) thread 2

[Switching to thread 2 (Thread 0x7fe960cfb700 (LWP 24844))]

#0 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0

(gdb) bt

#0 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0

#1 0x000000000059e026 in ip46_fib_stats_delay.isra.5 ()

#2 0x00000000005a4d5a in stats_thread_fn ()

#3 0x00007fe9e54abd20 in clib_calljmp () from /lib64/libvppinfra.so.0

#4 0x00007fe960cfad60 in ?? ()

#5 0x00007fe9e4f69dc5 in start_thread () from /lib64/libpthread.so.0

#6 0x0000000000000000 in ?? ()

(gdb) thread 3

[Switching to thread 3 (Thread 0x7fe9614fc700 (LWP 24843))]

#0 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6

(gdb) bt

#0 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6

#1 0x00000000004f9094 in eal_intr_thread_main ()

#2 0x00007fe9e4f69dc5 in start_thread () from /lib64/libpthread.so.0

#3 0x00007fe9e46ac1cd in clone () from /lib64/libc.so.6</li>

Original issue: https://jira.fd.io/browse/VPP-593

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VPP-593] VPP becomes intermittently unresponsive on nodes hosting VMs #1940

Description

Assignee

Reporter

Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[VPP-593] VPP becomes intermittently unresponsive on nodes hosting VMs #1940

Description

Description

Assignee

Reporter

Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions