Description
Description
I have VPP-LB crash under heavy traffic with some specific commits.
This issue is reproductive and I need help to solve that.
VPP version
VPP 18.10 with 2 patches,
-
- https://gerrit.fd.io/r/#/c/12680/
- https://gerrit.fd.io/r/#/c/18826/
- henry_ni (Tue, 4 Jun 2019 01:26:54 +0000): Hi Tatsumi-san,
- yusuketatsumi (Mon, 3 Jun 2019 10:54:45 +0000): Hi Hongjun,
- yusuketatsumi (Thu, 30 May 2019 11:27:40 +0000): Hi Hongjun,
- henry_ni (Thu, 30 May 2019 08:13:08 +0000): Hi Tatsumi,
- yusuketatsumi (Fri, 17 May 2019 07:59:32 +0000): I used split command to device large filesize to small one.
> Reproducing steps
1. VIP setup
Creating one VIP and 3 members for example.
ex.
vpp# lb vip 192.168.1.1/32 protocol tcp port 80 encap l3dsr dscp 23 new_len 1024
vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.1
vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.2
vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.3
2. start wget (heavy traffic)
ex.
$ while ; do curl localhost ; done
3. Deleting the member with flush via CLI
When deleting VIP member with flush, VPP will crash.
When deleting VIP member without flush, vpp won't crash.
This crash is almost 100% reproductive. If VPP doesn't crash, repeating to creating&deleting member, VPP soon crash.
ex.
$ sudo vppctl
vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.2 del flush
> log & coredump
-
- Please see attached dump files & vpp-*.rpms in details
When deleting VIP member with flush, VPP received SIGSEGV and VPP will crash.
sudo journalctl -xe -u vpp
...
May 17 13:43:34 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc94e37c0
May 17 13:43:35 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc948d180
May 17 13:43:35 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc949d280
May 17 13:43:36 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc9583ec0
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: received signal SIGSEGV, PC 0x7efbc566f1c7, faulting address 0x7effca81c2d8
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #0 0x00007efc0b1a03e8 0x7efc0b1a03e8
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #1 0x00007efc0aaf15d0 0x7efc0aaf15d0
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #2 0x00007efbc566f1c7 lb4_l3dsr_port_node_fn + 0x8f7
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #3 0x00007efc0b173caf 0x7efc0b173caf
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #4 0x00007efc0b17485f vlib_worker_loop + 0x23f
May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #5 0x00007efc0a87b6a4 0x7efc0a87b6a4
May 17 13:44:43 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp systemd[1]: vpp.service: main process exited, code=killed, status=6/ABRT
May 17 13:44:43 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp systemd[1]: Unit vpp.service entered failed state.
May 17 13:44:43 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp systemd[1]: vpp.service failed.
According to gdb, I think calling vlib_refcount_add() with inappropriate number in src/plugins/lb/node.c:366 leads this issue but I can't find out root cause.
sudo gdb /usr/bin/vpp vpp_wk_3-1558068278.33199
[sudo] password for ytatsumi:
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/vpp...Reading symbols from /usr/lib/debug/usr/bin/vpp.debug...done.
done.
warning: core file may not match specified executable file.
[New LWP 33210]
[New LWP 33207]
[New LWP 33209]
[New LWP 33208]
[New LWP 33204]
[New LWP 33215]
[New LWP 33205]
[New LWP 33199]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/vpp -c /etc/vpp/startup.conf'.
Program terminated with signal 6, Aborted.
#0 0x00007efc09add207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb)
(gdb) bt
#0 0x00007efc09add207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007efc09ade8f8 in __GI_abort () at abort.c:90
#2 0x0000000000407cee in os_exit (code=code@entry=1) at /usr/src/debug/vpp-18.10/src/vpp/vnet/main.c:349
#3 0x00007efc0b1a049c in unix_signal_handler (signum=, si=, uc=) at /usr/src/debug/vpp-18.10/src/vlib/unix/main.c:157
#5 vlib_refcount_add (v=-1, counter_index=, thread_index=, r=) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/src/vnet/util/refcount.h:74
#6 lb_node_fn (per_port_vip=1 '\001', encap_type=LB_ENCAP_TYPE_L3DSR, is_input_v4=1 '\001', frame=0x7efbc953d840, node=0x7efbc94bf980, vm=0x7efbc9529ac0) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/src/plugins/lb/node.c:366
#7 lb4_l3dsr_port_node_fn (vm=0x7efbc9529ac0, node=0x7efbc94bf980, frame=0x7efbc953d840) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/src/plugins/lb/node.c:1015
#8 0x00007efc0b173caf in dispatch_pending_node (vm=vm@entry=0x7efbc9529ac0, pending_frame_index=pending_frame_index@entry=2, last_time_stamp=25449526040817171) at /usr/src/debug/vpp-18.10/src/vlib/main.c:989
#9 0x00007efc0b17485f in vlib_main_or_worker_loop (is_main=0, vm=0x7efbc9529ac0) at /usr/src/debug/vpp-18.10/src/vlib/main.c:1555
#10 vlib_worker_loop (vm=0x7efbc9529ac0) at /usr/src/debug/vpp-18.10/src/vlib/main.c:1635
#11 0x00007efc0a87b6a4 in clib_calljmp () from /lib64/libvppinfra.so.18.10
#12 0x00007ef95c65bcc0 in ?? ()
#13 0x00007efbc6e5d26d in eal_thread_loop (arg=) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/build-root/build-vpp-native/external/dpdk-18.08/lib/librte_eal/linuxapp/eal/eal_thread.c:153
#14 0xc9532e6400007efb in ?? ()
#15 0xc9532f8400007efb in ?? ()
#16 0xc95330a400007efb in ?? ()
> Corresponding part of code
I pick up corresponding part of code to gdb result.
If you need all source code, please see above "VPP version" part.
src/vnet/util/refcount.h
static_always_inline
void vlib_refcount_add(vlib_refcount_t *r, u32 thread_index, u32 counter_index, i32 v)
Unknown macro: { vlib_refcount_per_cpu_t *per_cpu = &r->per_cpu[thread_index]; if (PREDICT_FALSE(counter_index >= vec_len(per_cpu->counters))) __vlib_refcount_resize(per_cpu, clib_max(counter_index + 16,(vec_len(per_cpu->counters)) * 2)); per_cpu->counters[counter_index] += v; // refcount.h}
src/plugins/lb/node.c
else if (PREDICT_TRUE(available_index0 != ~0))
{
u32 old_as_index = 0;
//There is an available slot for a new flow
asindex0 =
vip0->new_flow_table[hash0 & vip0->new_flow_table_mask].as_index;
counter = LB_VIP_COUNTER_FIRST_PACKET;
counter = (asindex0 == 0) ? LB_VIP_COUNTER_NO_SERVER : counter;
//TODO: There are race conditions with as0 and vip0 manipulation.
//Configuration may be changed, vectors resized, etc...
//Dereference previously used
old_as_index = lb_hash_available_value(sticky_ht,
hash0, available_index0);
vlib_refcount_add(&lbm->as_refcount, thread_index, // node.c:366
old_as_index, -1);
vlib_refcount_add(&lbm->as_refcount, thread_index,
asindex0, 1);
Assignee
Hongjun Ni
Reporter
Yusuke Tatsumi
Comments
Thank you for your confirm and help!
Thanks,
Hongjun
I could confirm that VPP won't crash with this patch in my machine too!
Thanks a lot.
Thanks for your help.
I will try your patch on my machine.
I have reproduced this issue in my local server.
And have submitted a patch to fix it.
I have tested this patch and it works well.
Below is my test command and packet trace:
Packet 1 is using AS1: 192.168.50.74.
When I deleted and flush AS1: 192.168.50.74,
Packet 2 switch to use AS2: 192.168.50.75.
Now the issue we met before has disappeared.
----------------------------------------------------------------------
lb vip 90.1.2.1/32 protocol tcp port 20000 encap l3dsr dscp 7 new_len 16
lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74
lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.75
DBGvpp# sh lb vip verbose
sh lb vip verbose
ip4-l3dsr [1] 90.1.2.1/32
new_size:16
protocol:6 port:20000
dscp:7
counters:
packet from existing sessions: 0
first session packet: 0
untracked packet: 0
no server configured: 0
#as:2
192.168.50.74 8 buckets 0 flows dpo:14 used
192.168.50.75 8 buckets 0 flows dpo:15 used
----------------------------------------------------------------------
ex /root/lb_l3dsr.pg
DBGvpp# trace add pg-input 1
DBGvpp# pa en
DBGvpp# 0: lb_get_sticky_table:154: Regenerated sticky table 0x7fffb6a7fb00
Packet 1:
00:00:53:787360: lb4-l3dsr-port
lb vip[1]: ip4-l3dsr 90.1.2.1/32 new_size:16 #as:2 protocol:6 port:20000 ds
cp:7
lb as[1]: 192.168.50.74 used
----------------------------------------------------------------------
DBGvpp# lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74 del flush
lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74 del flush
0: lb_as_command_fn:269: vip index is 1
DBGvpp#
DBGvpp# sh lb vip verbose
sh lb vip verbose
ip4-l3dsr [1] 90.1.2.1/32
new_size:16
protocol:6 port:20000
dscp:7
counters:
packet from existing sessions: 1023
first session packet: 1
untracked packet: 0
no server configured: 0
#as:2
192.168.50.74 0 buckets 0 flows dpo:14 removed
192.168.50.75 16 buckets 0 flows dpo:15 used
----------------------------------------------------------------------
DBGvpp# ex /root/lb_l3dsr.pg
trace add pg-input 1
pa en
ex /root/lb_l3dsr.pg
DBGvpp# trace add pg-input 1
DBGvpp# pa en
Packet 2
00:04:43:560694: lb4-l3dsr-port
lb vip[1]: ip4-l3dsr 90.1.2.1/32 new_size:16 #as:2 protocol:6 port:20000 ds
cp:7
lb as[2]: 192.168.50.75 used
Please revert files as bellow,
cat vpp_wk_3-1558068278.33199.tar.gz.a* > vpp_wk_3-1558068278.33199.tar.gz
cat vpp-debuginfo-18.10-3_ci_builddate_20190416_1119.x86_64.rpm.* > vpp-debuginfo-18.10-3_ci_builddate_20190416_1119.x86_64.rpm
cat vpp-plugins-18.10-3_ci_builddate_20190416_1119.x86_64.rpm.* > vpp-plugins-18.10-3_ci_builddate_20190416_1119.x86_64.rpm
Original issue: https://jira.fd.io/browse/VPP-1680