Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPP-2033] Crash when creating AVF interface #3495

Closed
vvalderrv opened this issue Feb 2, 2025 · 8 comments
Closed

[VPP-2033] Crash when creating AVF interface #3495

vvalderrv opened this issue Feb 2, 2025 · 8 comments

Comments

@vvalderrv
Copy link
Contributor

Description

This is not seen in CSIT under usual conditions.

But it somewhat frequently happens with a new (unmerged) CSIT code [0] tested on 2n-icx testbed. (Daily job [2] usually passes there without any failures.)

That code should not do anything specific for AVF, but it switches Python API client to async mode, so messages may be arriving (over socket) somewhat differently (not sure how differently yet).

Either way, VPP should not occasionally crash in avf_process_request [1] just because something is different on API level.

[0] https://gerrit.fd.io/r/c/csit/+/36774

[1] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k4-k1

[2] https://jenkins.fd.io/view/csit/job/csit-vpp-perf-mrr-daily-master-2n-icx/

Assignee

Unassigned

Reporter

Vratko Polak

Comments

  • vrpolak (Mon, 2 Sep 2024 11:38:33 +0000): VPP-2121 was opened later, but contains better insights, so closing this one.
  • vrpolak (Fri, 9 Sep 2022 14:17:41 +0000): > One way to fix this would be to use vlib_process_get_events_with_type.

Contributed here [9].

[9] https://gerrit.fd.io/r/c/vpp/+/37083

  • vrpolak (Fri, 9 Sep 2022 08:35:58 +0000): After reading about processes and events in the docs [7], I think I understand what is going on.

it seems like the vlib_process_wait_for_event_or_clock returns prematurely,

The return is not premature, because there is an event. But it is not the expected confirmation from the AVF process (event type 0), it is [8] a new SOCKET_READ_EVENT (event type 2). The current code thinks it is a timeout, because it sees the event type is non-zero, but does not check if it is really a timeout (event type ~0).

One way to fix this would be to use vlib_process_get_events_with_type.

[7] https://github.com/FDio/vpp/blob/9ad39c026c8a3c945a7003c4aa4f5cb1d4c80160/docs/developer/corearchitecture/vlib.rst

[8] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-ubuntu2204-x86_64-1n-skx/133/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k3

  • ayourtch (Thu, 1 Sep 2022 15:06:50 +0000): I did some debugging, this is preliminary notes:

Part of the backtrace:

#9  0x00007ffff6d73887 in _clib_error (how_to_die=2, function_name=0x7fff769edb0c "avf_process_request", line_number=1285, fmt=0x7fff769edb20 "avf process node failed to reply in 5 seconds") at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vppinfra/error.c:143

#10 0x00007fff769cc9a1 in avf_process_request (vm=0x7fff76a00740, req=0x7fff766efa58) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1285

#11 0x00007fff769cc614 in avf_flag_change (vnm=0x7ffff7f4c368 <vnet_main>, hw=0x7fff7e508e40, flags=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1318

#12 0x00007ffff70d9f8f in ethernet_set_flags (vnm=0x7ffff7f4c368 <vnet_main>, hw_if_index=1, flags=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vnet/ethernet/interface.c:484

#13 0x00007fff769cb9bf in avf_create_if (vm=0x7fff76a00740, args=0x7fff766efcb0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1733

#14 0x00007fff769e047a in vl_api_avf_create_t_handler (mp=0x7fff7e52c978) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/avf_api.c:52

#15 0x00007ffff7fa46c7 in msg_handler_internal (am=0x7ffff7fbbd78 <api_global_main>, the_msg=0x7fff7e52c978, msg_len=24, trace_it=0, do_it=1, tfree_it=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vlibapi/api_shared.c:576

The avf_create handler calls avf_flag_change, which calls avf_process_request; which posts the request to avf_process and suspends the thread for 5 seconds.

the avf_process picks up the request and then calls vf_process_handle_request which calls avf_op_config_promisc_mode and then signals the event by using vlib_process_signal_event. however, avf_op_config_promisc_mode may in turn call vlib_process_suspend in a cycle with doubling timeout, while awaiting for the reply from hardware.

experimenting, it seems like the vlib_process_wait_for_event_or_clock returns prematurely, and thus the panic happens - because the event has not been posted yet...

  • vrpolak (Thu, 11 Aug 2022 12:30:18 +0000): > messages may be arriving (over socket) somewhat differently (not sure how differently yet).

I think I identified what is the difference. VPP PAPI sync code is smart, it sends control ping only when the number of replies is not known (for dump/detail pair). Otherwise (request/reply) it just waits for the response without sending control ping. The new CSIT code was always sending control ping (or a subsequent request without waiting for the first reply). When I changed CSIT code to be also (almost) as smart, failures stopped.

Another way to stop failures was to hack the socket so it sends both a request and subsequent control ping as a single binary chunk.

So I think the crash happens when VPP socket API receives another message while processing previous message that relies on avf_flag_change. Possible fix attempts include:

  1. Make sure AVF related processing does not get confused by signals caused by API socket.

  2. Set API socket to stop sending signals while any AVF (or any API in general) call is running.

  • vrpolak (Thu, 11 Aug 2022 09:43:22 +0000): Here [6] is a backtrace from debug build.

[6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/7/csit_current/0/log.html.gz#s1-s1-s1-s1-s1-t3-k3-k4-k1

  • vrpolak (Wed, 10 Aug 2022 16:52:46 +0000): > VPP stopped responding without crashing.

Looking more closely, I see it does crash [4], but still no idea why it does so without producing core file.

other failures with new CSIT code

I noticed (when there IS a core file) the other failures also involve avf_flag_change [5].

[4] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s3-t1-k3-k5-k1-k3-k1

[5] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/6/csit_current/0/log.html.gz#s1-s1-s1-s1-s2-t2-k3-k4-k1

  • vrpolak (Wed, 10 Aug 2022 14:41:20 +0000):

    There are other failures with new CSIT code which do not happen with old code, for example here [3] VPP stopped responding without crashing.

I see the word "signal" in the avf_process_request code, so perhaps when something goes wrong with API socket, it creates a signal, and that can be picked up by AVF code?

[3] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k9-k1-k1-k1-k1-k1-k6-k3

Original issue: https://jira.fd.io/browse/VPP-2033

@vvalderrv
Copy link
Contributor Author

VPP-2121 was opened later, but contains better insights, so closing this one.

@vvalderrv
Copy link
Contributor Author

> One way to fix this would be to use vlib_process_get_events_with_type.

Contributed here [9].

[9] https://gerrit.fd.io/r/c/vpp/+/37083

@vvalderrv
Copy link
Contributor Author

After reading about processes and events in the docs [7], I think I understand what is going on.

> it seems like the vlib_process_wait_for_event_or_clock returns prematurely,

The return is not premature, because there is an event. But it is not the expected confirmation from the AVF process (event type 0), it is [8] a new SOCKET_READ_EVENT (event type 2). The current code thinks it is a timeout, because it sees the event type is non-zero, but does not check if it is really a timeout (event type ~0).
One way to fix this would be to use vlib_process_get_events_with_type.

[7] https://github.com/FDio/vpp/blob/9ad39c026c8a3c945a7003c4aa4f5cb1d4c80160/docs/developer/corearchitecture/vlib.rst
[8] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-ubuntu2204-x86_64-1n-skx/133/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k3

@vvalderrv
Copy link
Contributor Author

I did some debugging, this is preliminary notes:

Part of the backtrace:

#9  0x00007ffff6d73887 in _clib_error (how_to_die=2, function_name=0x7fff769edb0c "avf_process_request", line_number=1285, fmt=0x7fff769edb20 "avf process node failed to reply in 5 seconds") at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vppinfra/error.c:143

#10 0x00007fff769cc9a1 in avf_process_request (vm=0x7fff76a00740, req=0x7fff766efa58) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1285

#11 0x00007fff769cc614 in avf_flag_change (vnm=0x7ffff7f4c368 <vnet_main>, hw=0x7fff7e508e40, flags=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1318

#12 0x00007ffff70d9f8f in ethernet_set_flags (vnm=0x7ffff7f4c368 <vnet_main>, hw_if_index=1, flags=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vnet/ethernet/interface.c:484

#13 0x00007fff769cb9bf in avf_create_if (vm=0x7fff76a00740, args=0x7fff766efcb0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1733

#14 0x00007fff769e047a in vl_api_avf_create_t_handler (mp=0x7fff7e52c978) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/avf_api.c:52

#15 0x00007ffff7fa46c7 in msg_handler_internal (am=0x7ffff7fbbd78 <api_global_main>, the_msg=0x7fff7e52c978, msg_len=24, trace_it=0, do_it=1, tfree_it=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vlibapi/api_shared.c:576

The avf_create handler calls avf_flag_change, which calls avf_process_request; which posts the request to avf_process and suspends the thread for 5 seconds.

the avf_process picks up the request and then calls vf_process_handle_request which calls avf_op_config_promisc_mode and then signals the event by using vlib_process_signal_event. however, avf_op_config_promisc_mode may in turn call vlib_process_suspend in a cycle with doubling timeout, while awaiting for the reply from hardware.

experimenting, it seems like the vlib_process_wait_for_event_or_clock returns prematurely, and thus the panic happens - because the event has not been posted yet...

 

@vvalderrv
Copy link
Contributor Author

> messages may be arriving (over socket) somewhat differently (not sure how differently yet).

I think I identified what is the difference. VPP PAPI sync code is smart, it sends control ping only when the number of replies is not known (for dump/detail pair). Otherwise (request/reply) it just waits for the response without sending control ping. The new CSIT code was always sending control ping (or a subsequent request without waiting for the first reply). When I changed CSIT code to be also (almost) as smart, failures stopped.
Another way to stop failures was to hack the socket so it sends both a request and subsequent control ping as a single binary chunk.

So I think the crash happens when VPP socket API receives another message while processing previous message that relies on avf_flag_change. Possible fix attempts include:
1. Make sure AVF related processing does not get confused by signals caused by API socket.
2. Set API socket to stop sending signals while any AVF (or any API in general) call is running.

@vvalderrv
Copy link
Contributor Author

@vvalderrv
Copy link
Contributor Author

> VPP stopped responding without crashing.

Looking more closely, I see it does crash [4], but still no idea why it does so without producing core file.

> other failures with new CSIT code

I noticed (when there IS a core file) the other failures also involve avf_flag_change [5].

[4] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s3-t1-k3-k5-k1-k3-k1
[5] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/6/csit_current/0/log.html.gz#s1-s1-s1-s1-s2-t2-k3-k4-k1

@vvalderrv
Copy link
Contributor Author

There are other failures with new CSIT code which do not happen with old code, for example here [3] VPP stopped responding without crashing.
I see the word "signal" in the avf_process_request code, so perhaps when something goes wrong with API socket, it creates a signal, and that can be picked up by AVF code?

[3] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k9-k1-k1-k1-k1-k1-k6-k3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant