-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simultanuous use of BLE for heartrate and app connection crashes bluetooth stack #69
Comments
This is difficult. Lars's hack was that he spun up another process with anther BLE stack and that worked for the receiving end. Can you let me know the exact version of bookworm that does not work? I will try to reproduce the issue and find a fix. |
Yeah. I added the environment variable in the fork command that tells Noble that Bleno is also active, and now it works well on Bullseye 64 Bit Lite (latest release).
I did a clean install of Bookworm 64 Bit Lite on a Raspberry Pi 4, 8Gb (same machine I used for the Bullseye test), but EXR won't even connect decently. The process seems to be locked in some waiting loop: the bluetooth buttons are not responsive as soon as BLE HRM is active. |
I did an endurance test, and Bullseye seems to unregister the rowingdata service after 45 minutes. Tomorrow I'll try a fresh install on Bullseye and retrofit the relevant changes to the @abandonware version and see what happens. |
So I should use latest bullseye and the extremely_experimental branch right? |
That should result in a stack that works for about 45 minutes. When you install the extremely_experimental branch on Bookwork, it crashes immediatly. |
Do you use 32 or 64 bit version? |
64 bit. An update from my side. I installed main on Bullseye, and it worked flawlessly for over 100 minutes with BLE HRM active, but no HRM connected. When I connect a HRM, it disconnects within 10 minutes with the following error:
Next session for over half an hour, I get the follwong:
(still on a freshly installed main branch on Bullseye) |
Another observation, still on Bullseye and main: when I switch to PM5 mode, it is shown in the GUI, but EXR doesn't see it even with BLE HRM off. Weird. |
I have also frequent connection drop on the HR side. It is irrespective whether any other device is connected or not... This is very difficult to debug... I am now not even sure if this is really with the bleno or noble actually... This seems more like the hci layer not being able to separate the two. The HR is connected starts sending data and suddenly gets disconnected:
Bluetooth log says that the remote host terminted the connections but that does not seem right. In a few seconds HR reconnects and the this goes on and on. |
Might related to this: bluez/bluez#914 Which in case it is true, it will be difficult to deal with the upstream issue... But also can you please try this: https://github.com/noble/noble?tab=readme-ov-file#bleno-compatibility for me it does not work either but may be you have a different setup :) |
Ok, I think I have some improvement. I tried to use the HR broadcast of my Garmin but that does not work for some reason as it disconnects contionusly in every scenario (even if its standalone). I switch to my polar HRM Now I am able to run things. I created a branch on my fork "ble-fix" but basically this is what has been done: package.json: removed the hci socket package and updated both bleno and noble. In hte HR peripheral I switch to the default noble import and remove Lars's hack. And most importantly I run the npm start with NOBLE_MULTI_ROLE=1 environment variable as per the docs of noble. |
Oke, small trick here (see https://github.com/JaapvanEkris/openrowingmonitor/blob/V1Beta-Architecture_Revision-EXTREMELY_EXPERIMENTAL/app/peripherals/ble/HrmPeripheral.js) it spawns off the HRM peripheral in a seperate proces, and I added it as an environment variable there. I checked via logging process.env.NOBLE_MULTI_ROLE in the child process, and that way it is recieved. I don't know if environment variables are inherited by child processes. |
Creating a separate process is fine I suppose. I am not sure how much this brings to the table as noble and bleno are already async anyway. on the new library you are using, are you sure @stoprocent/node-bluetooth-hci-socket needed as an explicit dependency? I am asking as I was looking at the docs and the code currently does not seem to use the UART version hence it is unnecessary. This is the same on my fork where I removed it so bleno and noble can be updated to their latest version for abandonware. But it makes sense to update a the bleno and noble to a something that is being actually more actively maintained and has more downloads. But still I would try without node-bluetooth-hci-socket package. EDIT: Also I would remove Lars's workaround as I think that is no longer necessary. |
It PROBABLY (not sure here) is to create as much distance between the bleno and noble instances. But I don't think it indeed brings many advantages, but it adds complexity.
Not at all. My approach was to just move forward with newer packages and hope that the issues we had a year ago would be resolved by updates, as they originated upstream. If it works without, lets remove it. As I see it, @stoprocent forked the repo and started fixing stuff. It is 1-on-1 compatible with @abandonware's packages, so it was an easy replacement. As an aside, if you look carefully at the package list, you can see that I'm clearing out several packages as well. Snowpack was also included, which caused over 20 security issues, but is only needed for specific development environments. eslint is frustrating during checking but helpful, but also needs a major cleanup. Babel is in there for some obscure reason I haven't figured out yet. So I'll do some cleaning in the next year.
Yeah. I'm not sure how the hack actually worked, or why it was implemented in the first place. The NOBLE_MULTI_ROLE environment variable was introduced 7 years ago and is part of all documentation, even the original one (@abandonware is already a fork of an abandoned project...). The code for the hrm is about 4 years old. So why Lars impelemented it this way is still a mystery for me.
Completely agree. But I rather do it after fixing the bug and have some decent due dilligence, as the @stoprocent doesn't fix the bug anyway. So your approach to hook onto EXPERIMENTAL is a good one.
Completely agree. It does weird stuff, and you get exposed to all kind of undocumented changes. In this case, the HeartRateManager calls 'const noble = new Noble(new NobleBindings())' which must get a {} as argument with newer versions of noble. But it is completely undocumented as nobody is expected to use internal parameters like this. |
Actually it is documented under "Advanced usage". Just variable names are different. New version requires a Actually I should have done the cleaning of these things year ago when did the rework of the peripherals, but at the time I did not look deep into the inners of ble and the code worked. This is the issue with legacy code noone dares touching it :) Now I went through high level the API of bleno and noble and I am confident that they should be able to work simultaneously without real issues. The only part I don't get is why my Garmin watch get disconnected very frequently while a normal HRM strap works. That is a mystery to me.... Note to self: Could it be that it does not have a battery characteristic and the current OEM code does not tolerate this? But there are no errors nothing.... |
Also, the latest abandonware supports async await that simplifies the noble code. I have not converted the code because first it should work, but this is also something that would make the code simpler. Once this bug is cleared I am happy to do the refactoring. |
You can't fix everything in one go. And indeed, legacy code is scary business. The issue is, @laberning 's fix has been implemented while the NOBLE_MULTI_ROLE environment variable already existed for years. The issue then becomes "Did he copy an outdated workaround thinking it was the only way to do it, or did he deliberatly do this because he knew the NOBLE_MULTI_ROLE wouldn't work for ORM?"
True, but I still get
I agree with your feeling that something deeper in the OS may be at play here: seeing both HRM and FTMS disconnect at the same time suggest that the underlying stack has a hickup or quick reset. But the logs show nothing of that kind. The thing is that the disconnects sometimes come at minutes into the row, and sometimes it takes ages. (over an hour...) Could it be some kind of collission thing below the surface that isn't handled well (i.e. peripheral both wanting to send and recieve at the same time?). I added // The environmental variable below is needed to keep BleNo and NoBle apart, as they both want access to the same hardware but for different roles
// See https://github.com/stoprocent/noble?tab=readme-ov-file#bleno-compatibility-linux-specific
const env = { env: { NOBLE_MULTI_ROLE: 1 } }
const bleHrmProcess = child_process.fork('./app/peripherals/ble/hrm/HrmService.js', env) To app/peripherals/ble/HrmPeripheral.js but that didn't help either. An option is to force this by adding process.env[' NOBLE_MULTI_ROLE'] = 1 before every invoke. But it doesn't do justice to setting an environment variable. I also tried switching to stoprocent 's version, and it delivers some error messages. So it definitely is different. Not sure this is an improvement. It might be that abandonware's version just fails silently and stoprocent is vocal about it. |
Could our dance around the lack of async in earlier BleNo/NoBle code introduce this bug? |
based on the info in the docs, I dont think it should. But it is possible that some order of method invokations via the callbacks are cuasing this. but I dont see this too realistic. For logging I use Anyway, I will give it another go, today evening and refactor to async, clean up the code etc. I will stay with no child process fork, I start the process like this: This is how I run anyways when I was testing, but to be fair, I have not run it for hours, only for 5-10 minutes as testing and not a real excersize.
The workaround especially the part where the new Noble is created essentially just creates the default of everything actually, with the small exception that those additional events are added. I believe it does the same thing as the env flag: i.e. it removes the warnings that are raised when there is a message meant for the peripheral rather than the central. If you look at the code https://github.com/abandonware/noble/blob/49a889e4ffbf33aabbdbcc54a2d487b2aa784f3e/lib/hci-socket/bindings.js#L359 it shows that the patch is almost exactly the same apart from the else block that raises a warning. basically Lars switched this off. So I believe the patch is actually a no-op for our case. What can be done (and I will try this out) is adding the |
Yeah, I'll run those. Set ORM to simulation (the C2 2K sample can be set in a loop :), which is set to a Full Marathon), and let my HR belt report and connect to EXR. Just listening to the paddling stop :)
I'll do the crazy thing then. I'm going to install laberning's main (over 3 years old) on Bullseye, see if the issue is present there as well. Lars has pinned bleno and noble to very specific versions. If this fails as well, it excludes any unmanaged update of the core packages and dependencies. |
Yes that is a good idea! |
Following your conversation here, as I have in general the BLE connection problem with a PI Zero 2W. I currently work with a Raspberry Pi OS Lite 32 Bit Bullseye and the laberning version. This version is the only combination that works for me. I‘m using EXR and a Wahoo Tickr Fit HRM. |
Can I help? I'm not aware of the specifics of the issue but if you can help me reproduce something I can look into this. |
Ok I have looked a bot through the code and the issue and I understand you would like to run both |
Interesting, I have just tried this version on a Raspberry Pi 4 8Gb (using Lars install script, and his package list, installing @abandonware/bleno at 0.5.1-4, @abandonware/bluetooth-hci-socket at 0.5.3-7 and @abandonware/noble at 1.9.2-15 on Nodev16):
|
Thank you, that is extremely kind (as is the help from @Abasz and @antaron1973 in resolving this). |
@JaapvanEkris Some update on whether starting a new process for HRM and Lars's comment I tried to remove the forking and I get a HCI error (command disallowed). Its possible that this was the reason for Lars starting a new process. I think this error is due because the BLE central and peripheral is trying to access the HCI socket at the same time and one command interfere with another and does not work. For testing purposes I added a 1 sec delay in starting the central (i.e. HRM) and that fixes. I will try things without the forked process. I will test if things are within the same thread would make things better or worth. I have made significantly changed the HRM setup (moved to async, added better listener clean up etc.) I will do testing.
Yes this seems to be the issue here. |
Made some changes: https://github.com/Abasz/openrowingmonitor/tree/ble-fix. This seems to be working for me (polar H9 strap) but I did not try a marathon :) I what I had to do is, I changed the order of peripheral setup, first central role with noble (HRM) then add a big delay 10 sec before setting up of the peripheral (FTMS). Delay can most certainly be reduced but I wanted to make sure there are no conflicting message on the startup. Note, this still spins up a new process, but I've tested the version where I dont and it works the same. EDIT: my take on the whole thing is that noble and bleno are sending conflicting messages to the HCI socket like @stoprocent suggested. This version still uses abandonware but I tested stoprocent/bleno and noble and it has the same issue. |
I dug into the cpp addon code and run a lot of tests and I think this is what blocks: I suppose it is blocking becuase the HCI socket is busy (though this is not how thing should work when reading the bytes in the socket...). I will not be able to find a solution for this with my current understanding of this driver. The abandonware version has similar polling code (though there is no while loop, and there is a call to an Async scope, but based on my testing its not the while or the sync jsCallback call that blocks, but specifically the read method which is the same in the abandonware version). Also we should note that with previous kernels and previous version of abandonware this was working. I checked the version of abandonware 0.5.3-7 that was the version in Lars's code and I dont see any obvious change there not to mention that latest abandonware worked for a while. The part I cannot wrap my head around is the fact that everything is working when HRM reconnect (noble) happens before EXR connection (bleno)... I wonder whether the HCI socket supports interrupt based polling (assuming this is the actual problem). |
I am following up on what you guys are trying to do but I don't understand the patch you are trying to do. Can you explain simply? |
Basically what we are trying to achieve is that we broadcast BLE data from our app (such as FTMS BLE Profile) as a peripheral to centrals (mobile phone, smart watches etc), while we also receive data (e.g. from a heart rate strap) as central. Our patch is just a throttling the bleno and noble initial setup and pray :) So I would not say this is an actual fix. There are two types of issues with two types of solutions that we were trying to think of
I tracked this down and this is the blocking line in node: https://github.com/stoprocent/noble/blob/ecb43339054cd3ebc2f5ff3da2600eb67c8e3fe9/lib/hci-socket/hci.js#L578 And I think this is what blocks in the cpp driver: Let me know if you have any questions |
@stoprocent any progress on your side on this (were you able to find ot what we are going?)? |
I did some testing while looking for an alternative BLE library and I was able to achieve some success with this: https://github.com/Emill/node-ble-host/ It was able to support central and peripheral role at the same time, including the handling of turning on and off each functionality. Actually I had a better and more stable connection on the central side (HRM) as it was able to connect to my FR255 flawlessly with which both abandonware and stoprocent noble library struggled. I checked the HCI driver of this library and it seems that it communicates directly to the Bluetooth controller via kernel (so it does not need bluetooth service or hciconfig hci0 up). I suppose this is the reason that it does not work together with the other noble and bleno libraries. But I dont fully understand the driver as it has a lot of similarities with stoprocent and abandonware. It uses libuv polling for changes (similar to abandonware) in the contrary to stoprocent where the polling is seemingly offloaded to a new thread. Also as BUT, it seems that it has not been updated for two years (there is one fork that made updates but mostly aesthetic). So if issue arises later on I expect no updates (or not quickly anyway). The API is completely different than the current, so moving to this library would require a complete refactor of the BLE part. The API is not bad but different very much (its fully callback based but I can turn it into async/await with little effort fort more streamlined usage). I currently lean towards stoprocent's library assuming that he is able to make the necessary changes to allow seamless simultaneous access. Considering support as well as requirement for limited changes. |
That is promissing. I notice that my new Garmin HRM belt is struggling somehow. It doesn't work well via Bluetooth (ANT+ is much more reliable).
Well, Abandonware hasn't seen an update in 2 years either. When I look at the download statistics and open issues, it doesn't look like there are fundamental things going on (and if it ain't broken, why modify). If they interact on a fairly low (kernel) level, I guess API's aren't that dynamic. So it might be a bit more stable hook than higher level packages that move a bit faster.
That sucks, especially since that kind of development isn't my strong point. I'll help where I can, and having a good example will help me a lot, but this stuff is a bit closer to I/O management than I' used to.
Let's give @stoprocent until the end of next week to see if he can come up with a solution (or at least a prognosis of a resolution time). By then he had at least a month to see what the possibilities are for Bleno/Noble to resolve this. If he can't give us a solution or estimate, I am inclined to assume it is a dead end for us, despite being a a very promising and potentially better maintained package. I think ORM is better served with making a decision one way or the other, otherwise this issue keeps dragging on, potentially blocking other development as well. While such a major change remains undecided, it will block any functional development in the BLE area (for example, adding Force Curves to the PM5 interface, receiving workouts, etc.). |
That is fine, I was going to suggest I take on the task of redoing it anyway :) I would start with moving FTMS (and HRM) and then share so it can be tested. Its the testing that takes the longest actually. I would do the changes on V1Beta-Architecture_Revision-EXPERIMENTAL_BLE-Fix branch. Actually since there is a good separation between the controller logic (Peripheral Manager) and the service implementation the need changes are well I isolated.
I fully agree.
I fully agree, and also everything is source control so rolling back is not that issue. EDIT: The only question that remains is what should be done to the merging of the Architecture revision into main. As I said above, BLE rework does not interfere because of the controller abstraction, but clearly there is an error, and I think the current BLE fix provides a usable solution (yes one cannot change peripherals on the fly but needs to start with the correct settings) So I would actually merge that into main too with a big README update. |
I can help out here. I have tested a lot of BLE interaction with EXR and with a dedicated test-RPi.
I agree with that one. Could you start a new PR with the V1Beta-Architecture_Revision as a base? The merge of V1Beta-Architecture_Revision-EXPERIMENTAL into the V1Beta-Architecture_Revision had a weird knock-on effect that the V1Beta-Architecture_Revision-EXPERIMENTAL_BLE-Fix branch now has merge conflicts. Not problematic, but the package-lock.json has 331 conflics, which is quite a hassle to resolve. |
What I would do with the lock.json is that I would merge with confilcts then create a commit where I would regenerate it (delete lock.json and the whole node folder and hit npm install. |
I've merged the branch succesfully, including some small tweaks. I've tested it today and Bluetooth FTMS has proven to be stable. I'll try the PM5 emulator tomorrow. My HRM belt isn't that stable itself, so testing might prove difficult, but I'll do some tests on wednesday until sunday to see if it behaves well. If that is succesfull, I'll squash-merge the branches into main. Then at least we have a working version by the weekend. From there on, we are in the position to look for a long term solution and implement it, be it stoprocent's library or another. There were other tasks on my personal todo-list I also wanted to release (a GUI extension for setting targets and cleaning up the total package mess that Babel and eslint are creating), but we'll bump them to the next version as well. It will probably line up with the final version of the bluetooth implementation, making a nice interim-release. These things aren't perfect, but they will do for now. Rome wasn't build in a day either. If you look at the colossal changes we made under the hood, it is surprising we still got a working app. |
Is it the same issue? After a clean raspian x64 & rowingmonitor install can be found by the apps like kinomap and exr. After a reboot of the system the bluetooth is gone? Also the bluetooth service is dead ` Feb 08 14:30:28 rowingmonitor bluetoothd[801]: Starting SDP server `openrowingmonitor.service - Open Rowing Monitor Feb 08 14:27:22 rowingmonitor npm[682]: > [email protected] start |
Could be. Which branch did you install? Looking at the logs you installed the 'main' branch. Is that correct? I planned to merge the updated (fixed) branch into main this evening. |
yes its the main branch. |
# New functionality in 0.9.5 - Added **FIT-File support**: you an now automatically generate a FIT-file after a rowing session, which alows for a more detailed reporting than the tcx-format, and is commonly accepted by most platforms - **Introduction of the session manager**, which provides support for intervals, splits, rest intervals and spontanuous pauses in the session and also adds these to the FIT, tcx and RowingData recordings. Please note, setting predetermined intervals and splits in a user friendly way (via PM5 and webinterface) is still a ToDo that is intended for 1.0.0. - **Improvement of Magnetic rower support**: the new session manager makes sure that the session is nicely stopped, even when the flywheel has stopped quite abruptly before pause timeouts have time to kick in. This is the case on some magnetic rowers which have an extreme high drag, resulting in very short spin down times of their flywheel. # Bugfixes and robustness improvements in 0.9.5 - **Improvement of the architecture**: we cleaned up the old architecture and went to a more message bus structure where clients are responsible for listening to the datatransmissions they are interested in. See [the architecture description](Architecture.md) for a deep-dive of the implementation. Key benefit is that this is more maintainable as it allows serving data more easily to totally different clients (webGUI, recorders and BLE/ANT+) with totally different needs, making future enhancements easier. - **Improvement of Bluetooth stability**: we moved away from abandonware's BLE NoBle/BleNo implementation and moved to stoprocent's implementation, as that package is better maintained and works better with newer installs of BlueZ, which should fix some issues on Raspberry Pi Bookworm. Unfortunatly, none of the NoBle/BleNo descendents are immune to some specific BlueZ issues (see known issues). - **Performance improvement of the TS estimator**, further reducing CPU load, which significantly improves accuracy of the measurements and metrics as the Linux kernel has an easier job keeping the time accurate. - **Removed a lot of memory leaks**, although only being problematic in large simulations (i.e. over 3000K), we want to keep our code to behave nice - **Improved robustness of the stroke detection algorithm** - **Validation of the engine against a PM5 for over 3000KM**, where the deviation is a maximum of 0.1% # Known issues in 0.9.5 - **Bluetooth Heartrate can't be switched dynamically**: due to some underlying changes in the OS, BLE heartrate monitors can't be activated through the GUI without crashing the BLE metrics broadcast (see [the description of issue 69](#69)). As this is an issue in the OS, and all previous versions of OpenRowingMonitor are also affected by this issue. Version 0.9.5 has a workaround implemented that mitigates this at startup. So configuring the use of a BLE heartrate monitor in the config file should work.
The workaround is now in the main branch. I've tested it over 250K in a live environment, both with FTMS and PM5 Mode, and it works. I wasn't able to test it with my HRM belt, as itself has an unstable BLE signal. As stoprocent hasn't responded in over a week, I'm assuming no fix will come from his branch of NoBle/Bleno. So for the long term, I guess we have to move to another package? |
The reboot problem seem to be fixed. nice job! |
I will create a new branch for the new ble. I will start with the full ftms rowing implementation. To set up a example and all that and we will see how it goes. I would say end of Feb fir this. Once that works we can start with the other ble profiles (ftms is probably the most used anyway) |
Sounds like a great plan to me. |
The webbluetooth library seems to be having an update that would resolve the issue I had in the past. So I will do another attempt with using this in combination with bleno. This would be much easier as no re implementing the whole peripheral part would be required. This does not support peripheral so some reliance would remain, but would provide a quick fix for now. thegecko/webbluetooth#289 (comment) I mean, we may still move to the other BLE library (that handles central and peripheral at the same time as one library) as a long term plan |
We could, but I have the impression that the current workaround is sufficient for us to buy some decent time to get a good fix in place. I already made a 0.9.6 branch, and I plan to release some features pretty soon in that branch. But I am not in a rush here to release it, as the 0.9.5 versions seems to be working well. If webbluetooth would provide a good solution, that would be fine. But if there are better (more widely used) packages, I rather take the time to do it right in one go. |
I agree with you in terms of a more robust solution. I want to push the webbluetooth trough (even though it may not be the final solution) as testing to see what works and what does not. If it fails than the choice seems to be simple :) if it works we have a choice :) If it works this is very easy to implement, it has a very good API. Its a shame it only supports central role :( |
With the newer verions of Raspberry Pi OS seem to introduce a problem: the Bluetooth stack seems to be slightly modified, and this blocks simultanuous use of Bluetooth for recieving Heartrate and sending metrics to recieving apps. Without HRM via Bluetooth it works well. But as soon as the HRM via Bluetooth is enabled, it crashes.
Several have observed this (see also laberning#174) and upgrading the bluetooth npm packages doesn't seem to help. It is an issue on both Bullseye and Bookworm, although I had a working version on Bullseye at one moment (but can't get it to reproduce after an install).
I upgraded from @ababdonware to @soprocent npm libs, and looked at https://github.com/noble/noble#bleno-compatibility but I can't get it to work. Lars has put a hack in the HeartRateManager.js to get around it, but it seems to be failing.
I have stored my progress in the https://github.com/JaapvanEkris/openrowingmonitor/tree/V1Beta-Architecture_Revision-EXTREMELY_EXPERIMENTAL branch, but after banging my head against walls for a day, I think I need someone else to have a look to help fix this. Anyone (@Abasz)?
The text was updated successfully, but these errors were encountered: