Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE Client connection failure causes m_connectedCount to incorrectly decrement #3974

Closed
jamesi8086 opened this issue May 7, 2020 · 26 comments
Labels
Status: Stale Issue is stale stage (outdated/stuck)

Comments

@jamesi8086
Copy link
Contributor

jamesi8086 commented May 7, 2020

Hardware:

Board: M5Stick-C
Core Installation version: 1.0.4
IDE name: Arduino IDE
Flash Frequency: 40Mhz
PSRAM enabled: no
Upload Speed: 1500000
Computer OS: Win 10

Description:

I have an initialized BLE Server and BLE Client on the same device.
BLE Client connects and disconnects repeatedly to another device e.g. my android phone.

When the Client fails to connect, a ESP_GATTS_DISCONNECT_EVT is somehow triggered, decrementing m_connectedCount. This should not happen, as m_connectedCount was not incremented.

Symptom is that m_connectedCount is now 1 less than expected, and potentially underflowing to 65535.

Reproducibility happens rarely, probably only during the boundary where the phone changes its MAC address and the ESP32 is just about to read it. We can increase the chance of this happening by turning off the radio at the right time, during the connecting phase.

Sketch: (leave the backquotes for code formatting)

Related code:


uint16_t _OT_ProtocolV2::get_connected_count()
{
  return this->bleServer->getConnectedCount();
}


bool _OT_ProtocolV2::connect_and_exchange(BLEAddress address, int8_t rssi)
{
  // Connect to the BLE Server
  // - we need BLE_ADDR_TYPE_RANDOM as phones are using random addressing which changes over time
  this->bleClient->connect(address, BLE_ADDR_TYPE_RANDOM);
  BLERemoteService* pRemoteService;
  BLERemoteCharacteristic* pRemoteCharacteristic;

  if(!this->bleClient->isConnected()) 
  {
    Serial.println(F("Client connection failed"));
    return false;
  }

  pRemoteService = this->bleClient->getService(this->serviceUUID);
  if (pRemoteService == NULL)
  {
    if(this->bleClient->isConnected()) this->bleClient->disconnect();
    Serial.println(F("Failed to find our service UUID"));
    return false;
  }

  pRemoteCharacteristic = pRemoteService->getCharacteristic(this->characteristicUUID);
  if (pRemoteCharacteristic == NULL)
  {
    if(this->bleClient->isConnected()) this->bleClient->disconnect();
    Serial.println(F("Failed to find our characteristic UUID"));
    return false;
  }

  if(!pRemoteCharacteristic->canRead() || !pRemoteCharacteristic->canWrite())
  {
    if(this->bleClient->isConnected()) this->bleClient->disconnect();
    Serial.println(F("Unable to read or write"));
    return false;
  }

  std::string buf;
  this->prepare_central_write_request(buf, rssi);
  pRemoteCharacteristic->writeValue(buf, false);
  Serial.print("BLE central Send: ");
  Serial.println(buf.c_str());

  OT_ConnectionRecord connectionRecord;
  buf = pRemoteCharacteristic->readValue();

  // no need for ble client after this point
  if(this->bleClient->isConnected()) this->bleClient->disconnect();  
  
  if(!this->process_central_read_request(buf, connectionRecord))
  {
    Serial.println(F("Json parse error or read failed"));
    return false;
  }

  Serial.print("BLE central Recv: ");
  Serial.println(buf.c_str());
  
  // TODO: store data read into connectionRecord somewhere

  return true;
}

/**

**/

Debug Messages:

Enable Core debug level: Debug on tools menu of Arduino IDE, then put the serial output here 

Recompiled with debug level: Debug, but I don't see any debug messages.

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:8896
load:0x40080400,len:5816
entry 0x400806ac
;�⸮⸮U⸮⸮⸮2-hal-cpu.c:178] setCpuFrequencyMhz(): PLL: 320 / 4 = 80 Mhz, APB: 80000000 Hz
M5StickC initializing...OK
Devices connected: 0
Devices connected: 0
Devices connected: 0
...
Devices connected: 0
Devices connected: 0
Devices connected: 0
5a:62:51:40:dd:99 rssi: -65
Device connected to BLE
BLE central Send: {"id":"8Vej+n4NAutyZlS1ItKDL//RcfqWP/Tq/T/BBBUOsmAF0U+TGBqd2xcMhpfcSOyN1cSGN3znSGguodP+NQ==","v":2,"o":"SG_MOH","mc":"TraceStick V0.1","rs":-65}
BLE central Recv: {"mp":"Pixel 3a XL","v":2,"id":"SElnQHsPCfc/TGSQ6kR0sdTRS07g1KABGxRwcBzAjB6C884CIVB89SIdiwiDMBCbL9wCEKMaDaXOQQKMvg==","o":"SG_MOH"}
Device disconnected from BLE
Devices connected: 0
Devices connected: 0
...
Devices connected: 0
Devices connected: 0
Devices connected: 0
47:aa:ac:6c:32:ce rssi: -55
Device disconnected from BLE
Client connection failed
Devices connected: 65535
Devices connected: 65535
...
Devices connected: 65535
Devices connected: 65535
Devices connected: 65535
7f:46:54:44:b6:57 rssi: -71
Device disconnected from BLE
Client connection failed
Devices connected: 65534
Devices connected: 65534
Devices connected: 65534

@jamesi8086 jamesi8086 changed the title Failed BLE Client connection causes m_connectedCount to decrement BLE Client connection failure causes m_connectedCount to decrement May 7, 2020
@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 9, 2020

On a related note, I'm noticing that this crash happens when we try to restart advertising after the client connection-failure underflow. It does not always happen, suspect there could be some invalid pointer there. I've been leaving the device running 24/7, allow the glitch to happen, and see if it could remain stable even with the underflow glitch.

Update: Ok I don't think this crash is related to the m_connectedCount underflow, seems closer to start/stop advertising repeatedly over long periods of time.

Heres the stack trace:

assertion "head != NULL" failed: file "/home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c", line 214, function: multi_heap_free

0x40092678: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 155
0x400928a9: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 170
0x400efaaf: __assert_func at ../../../.././newlib/libc/stdlib/assert.c line 63
0x40092305: multi_heap_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c line 214
0x40085862: heap_caps_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps.c line 268
0x40085e1d: _free_r at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/syscalls.c line 42
0x40174399: operator delete(void*) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/del_op.cc line 46
0x40174721: operator delete at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/del_opv.cc line 32
0x400d42b5: BLEAdvertising::start() at C:\Users\nexuslord\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\BLE\src\BLEAdvertising.cpp line 227
0x400d23e5: _OT_ProtocolV2::advertising_start() at C:\Users\NEXUSL~1\AppData\Local\Temp\arduino_build_532939\sketch\opentracev2.cpp line 185
0x400d1e6a: loop() at E:\Dropbox\Git Workspace\tracestick\firmware_alpha\alpha/alpha.ino line 88
0x400dfce1: loopTask(void*) at C:\Users\nexuslord\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\cores\esp32\main.cpp line 19
0x4008edb5: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

@jamesi8086 jamesi8086 changed the title BLE Client connection failure causes m_connectedCount to decrement BLE Client connection failure causes m_connectedCount to incorrectly decrement May 10, 2020
@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 17, 2020

bump, this is an active bug

@Alvin1Zhang
Copy link
Collaborator

@me-no-dev

@me-no-dev
Copy link
Member

@yongkimleng can you please enable the core debug to "Debug" and do that same thing again? Unfortunately the person who wrote and maintained the BLE lib in Arduino is no longer around. I had a look at the code, but the flow of events is not very clear to me. Debug will help track what is going on. I imagine that a server event should not be received for a client, it could be an IDF issue, or it could be some miss-configuration in the code.

@chegewara
Copy link
Contributor

Unfortunately the person who wrote and maintained the BLE lib in Arduino is no longer around.

Unfortunately that part of code has been wrote by me. It was intended to support multiple connections.

@me-no-dev
Copy link
Member

@chegewara any clues on a fix?

@chegewara
Copy link
Contributor

I will look into it.

@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 22, 2020

@yongkimleng can you please enable the core debug to "Debug" and do that same thing again? Unfortunately the person who wrote and maintained the BLE lib in Arduino is no longer around. I had a look at the code, but the flow of events is not very clear to me. Debug will help track what is going on. I imagine that a server event should not be received for a client, it could be an IDF issue, or it could be some miss-configuration in the code.

Thats what I did when I took the console output, but did not see any additional output even after enabling debug in the IDE. I can again later tomorrow.

@me-no-dev
Copy link
Member

@yongkimleng can I bother you to try a small change?
https://github.com/espressif/arduino-esp32/blob/master/libraries/BLE/src/BLEServer.cpp#L167-L176

change to:

        case ESP_GATTS_DISCONNECT_EVT: {
            if (m_pServerCallbacks != nullptr) {         // If we have callbacks, call now.
                m_pServerCallbacks->onDisconnect(this);
            }
            if(m_connId == ESP_GATT_IF_NONE){
                return;
            }
            removePeerDevice(param->disconnect.conn_id, false);
            m_connectedCount--;                          // Decrement the number of connected devices count.
            break;
        } // ESP_GATTS_DISCONNECT_EVT

@me-no-dev
Copy link
Member

@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 28, 2020

Applied this fix, leaving it to run overnight

@jamesi8086
Copy link
Contributor Author

Hmm nope the problem still occurs

I think that ESP_GATTS_CONNECT_EVT isn't even triggered in such a case, but ESP_GATTS_DISCONNECT_EVT is.
But looking at the code, I'm not sure where I could add any connected-flag checks.

BLEClient waiting for disconnected
Device disconnected from BLE
Devices connected: 0
Devices connected: 0
Devices connected: 0
Devices connected: 0
Devices connected: 0
Devices connected: 0
5e:fe:cb:54:a4:ee rssi: -59
Connecting to ble device
Device disconnected from BLE
Client connection failed
Devices connected: 65535
Devices connected: 65535
Devices connected: 65535
Devices connected: 65535
Devices connected: 65535
Devices connected: 65535
5e:fe:cb:54:a4:ee rssi: -65
Connecting to ble device
Device connected to BLE
Getting service

@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 28, 2020

Ok I modified removePeerDevice to return true if element is found and removed, so that we can at least do a check. Testing now.

// only decrement if connection is found in map and removed
// sometimes this event triggers w/o a valid connection
if(removePeerDevice(param->disconnect.conn_id, false)) {
        m_connectedCount--;                          // Decrement the number of connected devices count.
}

@chegewara
Copy link
Contributor

chegewara commented May 28, 2020

If there is no ESP_GATTS_CONNECT_EVT then m_connId = ESP_GATT_IF_NONE;
https://github.com/espressif/arduino-esp32/blob/master/libraries/BLE/src/BLEServer.cpp#L32

In that case this should cause return before counter is decreased:

            if(m_connId == ESP_GATT_IF_NONE){
                return;
            }

Please make sure that code is using library you are changing and testing code, maybe add additional log to print m_connId?

@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 28, 2020

I'm not sure on the underlying code, why are we looking at BLEServer? Does every BLEClient create a new instance of BLEServer for every peer?

Yeah its rebuilding when I change the library code, patched a BLEClient destructor bug yesterday as commented in the other thread about it

I tried using log_e or log_v but Verbose mode locks up my M5StickC on boot, and Debug mode does not get me any output from the lower level logs.

@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 28, 2020

I notice that m_connId is set in ESP_GATTS_CONNECT_EVT but its never cleared. So your check would never return ESP_GATT_IF_NONE after the first connection is established.

For each of my devices which has both BLEServer and BLEClients,

  • BLEServer is created once on startup and persists.
  • BLEClient are created and destroyed per connection attempt.

@chegewara
Copy link
Contributor

Yes, it is possible if you are reconnecting to the same BLEClient without creating new instance. As far as i remember when BLEClient is disconnected its instance should be destroyed and free memory.

I'm not sure on the underlying code, why are we looking at BLEServer? Does every BLEClient create a new instance of BLEServer for every peer?

because you are having problem with m_connectedCount which exists only in BLEServer to track how many clients are currently connected.

@jamesi8086
Copy link
Contributor Author

jamesi8086 commented May 28, 2020

Each of my devices function both as a server and client (for p2p). I do recreate BLEClients per connection attempt.

Oddly, the BLEClient's connection/disconnection does trigger the event which occurs in the BLEServer on its own device.

me-no-dev pushed a commit that referenced this issue May 31, 2020
…onnection exists

There is no need to decrement if nothing was removed from removePeerDevice

Reference issue:
#3974
@chegewara
Copy link
Contributor

chegewara commented Jun 1, 2020

To help with such situations there is implemented set of custom gap, gatts and gattc callbacks. Those callbacks are receiving raw evens straight from bluedroid task and can be used to debug or to add special functions not implemented in library:
nkolban/esp32-snippets#711 (comment)

You can filter gatts anc gattc to see only disconnect events that should help you more.

@buxtronix
Copy link
Contributor

I have had similar issues here. I think I have found the root cause - the gattc_if attribute is not stored properly upon client registration. I'm working on a fix that seems to solve it for good.

#4055 is the PR though the change there may need some further updates.

@buxtronix
Copy link
Contributor

On further investigation, it appears that when a client gets a CONNECT or DISCONNECT event, then the callback BLEDevice::gattClientEventHandler gets triggered for every client even if that client was not being (dis)connected at the time.

For example, if I had two clients running, one with esp_gatt_if=4 and the other with esp_gatt_if=5, then if client 4 got disconnected, I would get two disconnect events, one for esp_gatt_if=4 and another for esp_gatt_if=5.

Shouldn't the callback only be called once, for the specific client whose connection status changed?

I have worked around it by matching on conn_id in the BLEClient callback handler, though I suspect that's not the full solution - perhaps the ESP SDK might be buggy..?

@stale
Copy link

stale bot commented Aug 3, 2020

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Aug 3, 2020
@jamesi8086
Copy link
Contributor Author

I did not have time to further improve this, is anyone aware of newer fixes?

@stale
Copy link

stale bot commented Aug 6, 2020

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the Status: Stale Issue is stale stage (outdated/stuck) label Aug 6, 2020
@stale
Copy link

stale bot commented Oct 6, 2020

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Oct 6, 2020
@stale
Copy link

stale bot commented Oct 20, 2020

[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions.

@stale stale bot closed this as completed Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issue is stale stage (outdated/stuck)
Projects
None yet
Development

No branches or pull requests

5 participants