Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPP-1680] VPP-LB crashed with some specific commit under heavy traffic #3143

Closed
vvalderrv opened this issue Feb 2, 2025 · 5 comments
Closed

Comments

@vvalderrv
Copy link
Contributor

Description

I have VPP-LB crash under heavy traffic with some specific commits.

This issue is reproductive and I need help to solve that.

 

VPP version

VPP 18.10 with 2 patches,

    - https://gerrit.fd.io/r/#/c/12680/ - https://gerrit.fd.io/r/#/c/18826/

    > Reproducing steps

    1. VIP setup

    Creating one VIP and 3 members for example.

    ex.

    vpp# lb vip 192.168.1.1/32 protocol tcp port 80 encap l3dsr dscp 23 new_len 1024

    vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.1

    vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.2

    vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.3


    2. start wget (heavy traffic)

    ex.

    $ while ; do curl localhost ; done


     

    3. Deleting the member with flush via CLI

    When deleting VIP member with flush, VPP will crash.

    When deleting VIP member without flush, vpp won't crash.

    This crash is almost 100% reproductive. If VPP doesn't crash, repeating to creating&deleting member, VPP soon crash.

    ex.

    $ sudo vppctl

    vpp# lb as 192.168.1.1/32 protocol tcp port 80 172.16.1.2 del flush

    > log & coredump

      - Please see attached dump files & vpp-*.rpms in details

    When deleting VIP member with flush, VPP received SIGSEGV and VPP will crash.

    sudo journalctl -xe -u vpp

    ...

    May 17 13:43:34 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc94e37c0

    May 17 13:43:35 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc948d180

    May 17 13:43:35 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc949d280

    May 17 13:43:36 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: lb_get_sticky_table:154: Regenerated sticky table 0x7efbc9583ec0

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: received signal SIGSEGV, PC 0x7efbc566f1c7, faulting address 0x7effca81c2d8

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #0 0x00007efc0b1a03e8 0x7efc0b1a03e8

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #1 0x00007efc0aaf15d0 0x7efc0aaf15d0

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #2 0x00007efbc566f1c7 lb4_l3dsr_port_node_fn + 0x8f7

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #3 0x00007efc0b173caf 0x7efc0b173caf

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #4 0x00007efc0b17485f vlib_worker_loop + 0x23f

    May 17 13:44:38 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp vnet[33199]: #5 0x00007efc0a87b6a4 0x7efc0a87b6a4

    May 17 13:44:43 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp systemd[1]: vpp.service: main process exited, code=killed, status=6/ABRT

    May 17 13:44:43 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp systemd[1]: Unit vpp.service entered failed state.

    May 17 13:44:43 vpplb-node007.ops.ynwm.ssk.yahoo.co.jp systemd[1]: vpp.service failed.


     

    According to gdb, I think calling vlib_refcount_add() with inappropriate number in src/plugins/lb/node.c:366 leads this issue but I can't find out root cause.

    sudo gdb /usr/bin/vpp vpp_wk_3-1558068278.33199

    [sudo] password for ytatsumi:

    GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7

    Copyright (C) 2013 Free Software Foundation, Inc.

    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

    This is free software: you are free to change and redistribute it.

    There is NO WARRANTY, to the extent permitted by law. Type "show copying"

    and "show warranty" for details.

    This GDB was configured as "x86_64-redhat-linux-gnu".

    For bug reporting instructions, please see:

    <http://www.gnu.org/software/gdb/bugs/>...

    Reading symbols from /usr/bin/vpp...Reading symbols from /usr/lib/debug/usr/bin/vpp.debug...done.

    done.

    warning: core file may not match specified executable file.

    [New LWP 33210]

    [New LWP 33207]

    [New LWP 33209]

    [New LWP 33208]

    [New LWP 33204]

    [New LWP 33215]

    [New LWP 33205]

    [New LWP 33199]

    [Thread debugging using libthread_db enabled]

    Using host libthread_db library "/lib64/libthread_db.so.1".

    Core was generated by `/usr/bin/vpp -c /etc/vpp/startup.conf'.

    Program terminated with signal 6, Aborted.

    #0 0x00007efc09add207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55

    55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);

    (gdb)

    (gdb) bt

    #0 0x00007efc09add207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55

    #1 0x00007efc09ade8f8 in __GI_abort () at abort.c:90

    #2 0x0000000000407cee in os_exit (code=code@entry=1) at /usr/src/debug/vpp-18.10/src/vpp/vnet/main.c:349

    #3 0x00007efc0b1a049c in unix_signal_handler (signum=, si=, uc=) at /usr/src/debug/vpp-18.10/src/vlib/unix/main.c:157

    #4

    #5 vlib_refcount_add (v=-1, counter_index=, thread_index=, r=) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/src/vnet/util/refcount.h:74

    #6 lb_node_fn (per_port_vip=1 '\001', encap_type=LB_ENCAP_TYPE_L3DSR, is_input_v4=1 '\001', frame=0x7efbc953d840, node=0x7efbc94bf980, vm=0x7efbc9529ac0) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/src/plugins/lb/node.c:366

    #7 lb4_l3dsr_port_node_fn (vm=0x7efbc9529ac0, node=0x7efbc94bf980, frame=0x7efbc953d840) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/src/plugins/lb/node.c:1015

    #8 0x00007efc0b173caf in dispatch_pending_node (vm=vm@entry=0x7efbc9529ac0, pending_frame_index=pending_frame_index@entry=2, last_time_stamp=25449526040817171) at /usr/src/debug/vpp-18.10/src/vlib/main.c:989

    #9 0x00007efc0b17485f in vlib_main_or_worker_loop (is_main=0, vm=0x7efbc9529ac0) at /usr/src/debug/vpp-18.10/src/vlib/main.c:1555

    #10 vlib_worker_loop (vm=0x7efbc9529ac0) at /usr/src/debug/vpp-18.10/src/vlib/main.c:1635

    #11 0x00007efc0a87b6a4 in clib_calljmp () from /lib64/libvppinfra.so.18.10

    #12 0x00007ef95c65bcc0 in ?? ()

    #13 0x00007efbc6e5d26d in eal_thread_loop (arg=) at /tmp/ci-package/vpp/rpmbuild/SOURCES/vpp-18.10/build-root/rpmbuild/vpp-18.10/build-root/build-vpp-native/external/dpdk-18.08/lib/librte_eal/linuxapp/eal/eal_thread.c:153

    #14 0xc9532e6400007efb in ?? ()

    #15 0xc9532f8400007efb in ?? ()

    #16 0xc95330a400007efb in ?? ()


     

    > Corresponding part of code

    I pick up corresponding part of code to gdb result.

    If you need all source code, please see above "VPP version" part.

    src/vnet/util/refcount.h

    static_always_inline

    void vlib_refcount_add(vlib_refcount_t *r, u32 thread_index, u32 counter_index, i32 v)

    Unknown macro: { vlib_refcount_per_cpu_t *per_cpu = &r->per_cpu[thread_index]; if (PREDICT_FALSE(counter_index >= vec_len(per_cpu->counters))) __vlib_refcount_resize(per_cpu, clib_max(counter_index + 16,(vec_len(per_cpu->counters)) * 2)); per_cpu->counters[counter_index] += v; // refcount.h}
     

    src/plugins/lb/node.c

    else if (PREDICT_TRUE(available_index0 != ~0))

    {

    u32 old_as_index = 0;

    //There is an available slot for a new flow

    asindex0 =

    vip0->new_flow_table[hash0 & vip0->new_flow_table_mask].as_index;

    counter = LB_VIP_COUNTER_FIRST_PACKET;

    counter = (asindex0 == 0) ? LB_VIP_COUNTER_NO_SERVER : counter;

    //TODO: There are race conditions with as0 and vip0 manipulation.

    //Configuration may be changed, vectors resized, etc...

    //Dereference previously used

    old_as_index = lb_hash_available_value(sticky_ht,

    hash0, available_index0);

    vlib_refcount_add(&lbm->as_refcount, thread_index, // node.c:366

    old_as_index, -1);

    vlib_refcount_add(&lbm->as_refcount, thread_index,

    asindex0, 1);

    Assignee

    Hongjun Ni

    Reporter

    Yusuke Tatsumi

    Comments

    • henry_ni (Tue, 4 Jun 2019 01:26:54 +0000): Hi Tatsumi-san,

    Thank you for your confirm and help!

    Thanks,

    Hongjun

    • yusuketatsumi (Mon, 3 Jun 2019 10:54:45 +0000): Hi Hongjun,

    I could confirm that VPP won't crash with this patch in my machine too!

    Thanks a lot.

    • yusuketatsumi (Thu, 30 May 2019 11:27:40 +0000): Hi Hongjun,

    Thanks for your help.

    I will try your patch on my machine.

    • henry_ni (Thu, 30 May 2019 08:13:08 +0000): Hi Tatsumi,

    I have reproduced this issue in my local server.

    And have submitted a patch to fix it.

    I have tested this patch and it works well.

     

    Below is my test command and packet trace:

    Packet 1 is using AS1: 192.168.50.74.

    When I deleted and flush AS1: 192.168.50.74, 

    Packet 2 switch to use AS2: 192.168.50.75.

    Now the issue we met before has disappeared.

    ----------------------------------------------------------------------

    lb vip 90.1.2.1/32 protocol tcp port 20000 encap l3dsr dscp 7 new_len 16

    lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74

    lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.75

    DBGvpp# sh lb vip verbose

    sh lb vip verbose

    ip4-l3dsr [1] 90.1.2.1/32

    new_size:16

    protocol:6 port:20000

    dscp:7

    counters:

    packet from existing sessions: 0

    first session packet: 0

    untracked packet: 0

    no server configured: 0

    #as:2

    192.168.50.74 8 buckets 0 flows dpo:14 used

    192.168.50.75 8 buckets 0 flows dpo:15 used

    ----------------------------------------------------------------------

    ex /root/lb_l3dsr.pg

    DBGvpp# trace add pg-input 1

    DBGvpp# pa en

    DBGvpp# 0: lb_get_sticky_table:154: Regenerated sticky table 0x7fffb6a7fb00

    Packet 1:

    00:00:53:787360: lb4-l3dsr-port

    lb vip[1]: ip4-l3dsr 90.1.2.1/32 new_size:16 #as:2 protocol:6 port:20000 ds

    cp:7

    lb as[1]: 192.168.50.74 used

    ----------------------------------------------------------------------

    DBGvpp# lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74 del flush

    lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74 del flush

    0: lb_as_command_fn:269: vip index is 1

    DBGvpp#

    DBGvpp# sh lb vip verbose

    sh lb vip verbose

    ip4-l3dsr [1] 90.1.2.1/32

    new_size:16

    protocol:6 port:20000

    dscp:7

    counters:

    packet from existing sessions: 1023

    first session packet: 1

    untracked packet: 0

    no server configured: 0

    #as:2

    192.168.50.74 0 buckets 0 flows dpo:14 removed

    192.168.50.75 16 buckets 0 flows dpo:15 used

    ----------------------------------------------------------------------

    DBGvpp# ex /root/lb_l3dsr.pg

    trace add pg-input 1

    pa en

    ex /root/lb_l3dsr.pg

    DBGvpp# trace add pg-input 1

    DBGvpp# pa en

    Packet 2

    00:04:43:560694: lb4-l3dsr-port

    lb vip[1]: ip4-l3dsr 90.1.2.1/32 new_size:16 #as:2 protocol:6 port:20000 ds

    cp:7

    lb as[2]: 192.168.50.75 used


    • yusuketatsumi (Fri, 17 May 2019 07:59:32 +0000): I used split command to device large filesize to small one.

    Please revert files as bellow,

    cat vpp_wk_3-1558068278.33199.tar.gz.a* > vpp_wk_3-1558068278.33199.tar.gz

    cat vpp-debuginfo-18.10-3_ci_builddate_20190416_1119.x86_64.rpm.* > vpp-debuginfo-18.10-3_ci_builddate_20190416_1119.x86_64.rpm

    cat vpp-plugins-18.10-3_ci_builddate_20190416_1119.x86_64.rpm.* > vpp-plugins-18.10-3_ci_builddate_20190416_1119.x86_64.rpm

    Original issue: https://jira.fd.io/browse/VPP-1680

@vvalderrv
Copy link
Contributor Author

Hi Tatsumi-san,

Thank you for your confirm and help!

Thanks,

Hongjun

@vvalderrv
Copy link
Contributor Author

Hi Hongjun,

I could confirm that VPP won't crash with this patch in my machine too!

Thanks a lot.

@vvalderrv
Copy link
Contributor Author

Hi Hongjun,

Thanks for your help.
I will try your patch on my machine.

@vvalderrv
Copy link
Contributor Author

Hi Tatsumi,

I have reproduced this issue in my local server.

And have submitted a patch to fix it.

I have tested this patch and it works well.

 

Below is my test command and packet trace:

Packet 1 is using AS1: 192.168.50.74.

When I deleted and flush AS1: 192.168.50.74, 

Packet 2 switch to use AS2: 192.168.50.75.

Now the issue we met before has disappeared.

----------------------------------------------------------------------
lb vip 90.1.2.1/32 protocol tcp port 20000 encap l3dsr dscp 7 new_len 16
lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74
lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.75

DBGvpp# sh lb vip verbose
sh lb vip verbose
ip4-l3dsr [1] 90.1.2.1/32
new_size:16
protocol:6 port:20000
dscp:7
counters:
packet from existing sessions: 0
first session packet: 0
untracked packet: 0
no server configured: 0
#as:2
192.168.50.74 8 buckets 0 flows dpo:14 used
192.168.50.75 8 buckets 0 flows dpo:15 used

----------------------------------------------------------------------
ex /root/lb_l3dsr.pg
DBGvpp# trace add pg-input 1
DBGvpp# pa en
DBGvpp# 0: lb_get_sticky_table:154: Regenerated sticky table 0x7fffb6a7fb00

Packet 1:

00:00:53:787360: lb4-l3dsr-port
lb vip[1]: ip4-l3dsr 90.1.2.1/32 new_size:16 #as:2 protocol:6 port:20000 ds
cp:7
lb as[1]: 192.168.50.74 used

----------------------------------------------------------------------
DBGvpp# lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74 del flush
lb as 90.1.2.1/32 protocol tcp port 20000 192.168.50.74 del flush
0: lb_as_command_fn:269: vip index is 1
DBGvpp#

DBGvpp# sh lb vip verbose
sh lb vip verbose
ip4-l3dsr [1] 90.1.2.1/32
new_size:16
protocol:6 port:20000
dscp:7
counters:
packet from existing sessions: 1023
first session packet: 1
untracked packet: 0
no server configured: 0
#as:2
192.168.50.74 0 buckets 0 flows dpo:14 removed
192.168.50.75 16 buckets 0 flows dpo:15 used

----------------------------------------------------------------------
DBGvpp# ex /root/lb_l3dsr.pg
trace add pg-input 1
pa en
ex /root/lb_l3dsr.pg
DBGvpp# trace add pg-input 1
DBGvpp# pa en

Packet 2

00:04:43:560694: lb4-l3dsr-port
lb vip[1]: ip4-l3dsr 90.1.2.1/32 new_size:16 #as:2 protocol:6 port:20000 ds
cp:7
lb as[2]: 192.168.50.75 used

----------------------------------------------------------------------

 

@vvalderrv
Copy link
Contributor Author

I used split command to device large filesize to small one.

Please revert files as bellow,

cat vpp_wk_3-1558068278.33199.tar.gz.a* > vpp_wk_3-1558068278.33199.tar.gz
cat vpp-debuginfo-18.10-3_ci_builddate_20190416_1119.x86_64.rpm.* > vpp-debuginfo-18.10-3_ci_builddate_20190416_1119.x86_64.rpm
cat vpp-plugins-18.10-3_ci_builddate_20190416_1119.x86_64.rpm.* > vpp-plugins-18.10-3_ci_builddate_20190416_1119.x86_64.rpm

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant