-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DP scheduler introducing 10ms+ latency to AEC capture streams #8640
Comments
Can we have an ascii-art view of the pipeline and what the 'much worse" latency result is? Using the AEC in a LL setting would require a change of the tick to 10ms, so not sure if the comparison is really with 1ms tick in all cases? I wonder if the issue is not really related to the DP component but rather that it's followed by an LL component that will work one chunk at a time. There's really no way for an LL component to do more that process one chunk with the least amount of MCPS, that's what they are designed for. But there might be a way out of this, IIRC there is an option to make host copiers go faster and dump all the information. @mmaka1 can you chime in? |
I am not sure how "The DP scheduler is introducing unacceptable latency to the stream". As I understand, we are talking about the latency of the path from the AEC output to the host. And imo it is all about the configuration of any additional buffering inside that part, not about the DP scheduler. If there is no extra buffering, the first samples (1ms, 2ms, 10ms - depending on how the LL priod is configured) are delivered to the HD/A Host Input DMA buffer in the next LL cycle and there is no extra latency. If there is any extra buffering threshold, let's say N ms that drives the transfers from the DSP through the SW stack up the application, then yes, the LL period could be configured to deliver N ms of output in one tick, otherwise there would be extra buffering time to collect that N ms required to start the transfer. An alternative to increasing the LL period, as @plbossart mentioned, is to connect the Copier component directly between the output of the AEC and the Host Input DMA and configure it in the so called "fast mode". Then the entire output would be transferred to the DMA buffer in a one LL tick right after the AEC finishes the processing. |
@andyross I need more details to support you with this mtl 007 branch seems to be working fine using AEC_MOCK code. It also contains couple of workarounds, like num channel you pointed here #8639. To simplify debugging, can we focus on bringing AEC up using 007 as a base? Once AEC is stable there it would be easier to go forward with your changes. |
Unfortunately I'm travelling this week without the device, so I won't be able to test until Friday. I was pretty sure I saw exactly the same behavior in the drop branch too. If you're sure it's working there, look to the fact that the new component code will consume bytes as they arrive in process (this is a feature, it front-loads the conversion overhead) and not all at once from the buffer. Is there a rule that a 10ms DP component isn't allowed to take 1ms input, maybe? |
@andyross
That's totally not true Keep in mind that DP scheduler works async to LL, in a thread It is possible that something happens between is_ready and process - is_ready says "go", before DP got CPU time another LL processing add some data to any of inputs - but those belong to next cycle, and should be processed later.
AEC should process 10ms from each input (0..9) and that's it, leaving newer data to next cycle |
after topology changes and FW fixes (all on the 007 branch) I was able to remove all WAs from AEC code. All changes are here: https://github.com/thesofproject/sof/commits/mtl-007-drop-stable/ Number of channels are now taken from the topology, all buffer sizes are set properly. I strongly advise to use code from 007 branch to enable AEC. It would be way easier for me to support you and less likely to get into any problems that may have been already solved on 007. Once we do have a stable base for test and reference, you would be able to safely proceed with your changes (720c38a) |
Indeed. There's a comment from me to this effect in the PR too (#8571 (comment)). Nonetheless this is actually needed on synchronous pipelines because (at least on mt8195) the startup of the capture pipeline lags the prepare() call by 20-30ms, we end up with too much buffered and have a latency bug. The solution needs to be to flush the reference data on capture startup only. |
That's a non-starter for enabling AEC on other platforms though, which I'm trying to do simultaneously. The whole reason we're in this mess is that AEC has traditionally been enabled only on device branches like this, where it bitrots. It's never worked on main, and I'm trying to fix that. I promise I'll continue to use this as a validation reference, but at the end of the day the only thing that can merge on main is a unified component that works everywhere. |
FWIW, I can confirm that the most recent code in mtl-007-drop-stable does not show the backlog I was measuring earlier. When capture is running, the remaining bytes in the mic input buffer are never more than 40 samples (i.e. <1ms, which is the maximum precision available in the LL scheduler on the other core anyway) more than needed to complete one 10ms iteration. I'll leave this bug open for comments until I can find the relevant fixes and get them merged into the main branch PR, but this looks good to me. |
So, I reverse engineered the recent changes in mtl-007-drop-stable to isolate what it was that fixed the buffer backups. See notes in #8639. I can reproduce clean streaming in both my tree and the drop. Essentially this is a duplicate bug, the root cause was pipeline state, not DP scheduling. I'm happy to close this, but will leave it open for comment until folks start to return to work after the holidays. |
Close? |
Good point, thanks for the ping |
The DP scheduler is introducing unacceptable latency to the stream. When used with Google AEC (see PR #8571), the capture pipeline is delaying delivery of output data by at least 10ms.
The architecture of the DP component is that it accumulates and runs analysis on blocks 10ms at a time, but it's being fed from a capture pipeline running at 1ms. There is a warning condition in the new code[1] where it detects the case where it needs to run the AEC analysis twice in one process() call (which it can't do, because it doesn't have output buffer space), because of too much data waiting for it in the input buffers (both its own and the upstream sources). See here:
sof/src/audio/google/google_rtc_audio_processing.c
Line 816 in 720c38a
This is a "comp_dbg()" currently because if I enable it process() ends up spamming the log at 100 Hz with the warning: at steady state there are always 20ms+ worth of data to analyze in process.
Discussion in #8621 implies that this is the DP queueing code trying to rate limit the output, which I'm having a hard time understanding. The component (simply by virtue of running slower) provides lots of output buffering already.
Basically: we can't be doing this. The requirement has to be that as soon as there is output data available from the component, it gets flushed downstream to the host ALSA stream as fast as possible. Right now AEC is seeing much worse latency with DP than it does under a synchronous 10ms pipeline.
[1] You can also see it in the unadulterated mtl-007-drop-stable branch by e.g. logging the fill status of the input buffers in process().
The text was updated successfully, but these errors were encountered: