Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Min Initial Latency Selector #3402

Merged
merged 7 commits into from
Feb 21, 2025
Merged

Add Min Initial Latency Selector #3402

merged 7 commits into from
Feb 21, 2025

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Feb 20, 2025

Currently, the selection logic (both for Transcoding and AI) uses the following logic:

  1. Select best O
  2. Use it for the work
  3. When done, cache it as knownSessions
  4. When a new request is sent, re-use the O in knownSessions

For the Live video, it can be suboptimal, because the cached knownSession is not always the best Orchestrator to use. For example:

  • There is a peak time and not many O are available so G select O with suboptimal latency
  • After a few hours, there is no much traffic, so most Os are available
  • However, G still uses the previously selected O

This PR introduces a new Selector which is way simpler than the currently used MinLSSelector. The new Selector doesn't cache anything and does not favor known sessions. It always selects an O with the lowest InitialLatency.

fix https://linear.app/livepeer/issue/ENG-2454/startup-time-suboptimal-g-o-selection

@leszko leszko requested review from victorges and mjh1 February 20, 2025 13:45
Copy link

linear bot commented Feb 20, 2025

@github-actions github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 20, 2025
Copy link

codecov bot commented Feb 20, 2025

Codecov Report

Attention: Patch coverage is 76.19048% with 15 lines in your changes missing coverage. Please review.

Project coverage is 32.15283%. Comparing base (232df3a) to head (aa520dc).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
server/ai_session.go 0.00000% 12 Missing ⚠️
server/broadcast.go 40.00000% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3402         +/-   ##
===================================================
+ Coverage   32.11405%   32.15283%   +0.03878%     
===================================================
  Files            147         147                 
  Lines          40789       40830         +41     
===================================================
+ Hits           13099       13128         +29     
- Misses         26916       26927         +11     
- Partials         774         775          +1     
Files with missing lines Coverage Δ
server/rpc.go 66.66667% <ø> (ø)
server/selection.go 94.54545% <100.00000%> (+1.02027%) ⬆️
server/broadcast.go 79.54545% <40.00000%> (-0.15557%) ⬇️
server/ai_session.go 2.25806% <0.00000%> (-0.07527%) ⬇️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 232df3a...aa520dc. Read the comment docs.

Files with missing lines Coverage Δ
server/rpc.go 66.66667% <ø> (ø)
server/selection.go 94.54545% <100.00000%> (+1.02027%) ⬆️
server/broadcast.go 79.54545% <40.00000%> (-0.15557%) ⬇️
server/ai_session.go 2.25806% <0.00000%> (-0.07527%) ⬇️

... and 1 file with indirect coverage changes

@leszko leszko requested a review from emranemran February 20, 2025 14:37
@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Feb 20, 2025

@rickstaa I think this is better for batch AI jobs as well. knownSessions is causing issues with distribution of jobs across the network that are hard to overcome. The main complaints is the first selected Os are selected all the time in times where requests do not saturate the capacity and there is no way to add randomness to the selection.

@leszko I was exploring something very similar here: baf178e.

Couple questions:

  • Why not use LatencyScore to sort? Or is InitialLatency populated somewhere else? The unknownSessions is initially sorted by how fast the Orchestrator responds to the GetOrchestrator request. This may not indicate how fast they are for the AI work but starts to update as the Orchestrators are selected and work is done. Note that this updated sort will order sessions that have not been selected first.
  • Could we wire in the PerfScores the selection algo uses if available as the latency score? Then could fallback to the LatencyScore if PerfScores are not available.

@leszko
Copy link
Contributor Author

leszko commented Feb 21, 2025

@rickstaa I think this is better for batch AI jobs as well. knownSessions is causing issues with distribution of jobs across the network that are hard to overcome. The main complaints is the first selected Os are selected all the time in times where requests do not saturate the capacity and there is no way to add randomness to the selection.

IMO it will work better for you as well. You may just need to adapt using the LatencyScore 👇

  • Why not use LatencyScore to sort? Or is InitialLatency populated somewhere else? The unknownSessions is initially sorted by how fast the Orchestrator responds to the GetOrchestrator request. This may not indicate how fast they are for the AI work but starts to update as the Orchestrators are selected and work is done.

Short answer is that for Live AI Video we don't have LatencyScore. We don't measure the time between when the segment is sent until the processed segments is received. The reason for that it's not trivial to do it, that Live AI Video operates on stream not individual segments. So, I'd keep it that way. In any case, this is just sorting, the actual selection should be done in the selection algorithm here.

Note that this updated sort will order sessions that have not been selected first.

Good spot, I updated it to sort also after executing Complete()

  • Could we wire in the PerfScores the selection algo uses if available as the latency score? Then could fallback to the LatencyScore if PerfScores are not available.

I think we could add LatencyScore into the selection algorithm here, then you could take it into account while selecting the O.

@leszko leszko merged commit 0f9be6a into master Feb 21, 2025
17 of 18 checks passed
@leszko leszko deleted the rafal/min-latency-selector branch February 21, 2025 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch. go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants