Add Min Initial Latency Selector #3402

leszko · 2025-02-20T13:45:27Z

Currently, the selection logic (both for Transcoding and AI) uses the following logic:

Select best O
Use it for the work
When done, cache it as knownSessions
When a new request is sent, re-use the O in knownSessions

For the Live video, it can be suboptimal, because the cached knownSession is not always the best Orchestrator to use. For example:

There is a peak time and not many O are available so G select O with suboptimal latency
After a few hours, there is no much traffic, so most Os are available
However, G still uses the previously selected O

This PR introduces a new Selector which is way simpler than the currently used MinLSSelector. The new Selector doesn't cache anything and does not favor known sessions. It always selects an O with the lowest InitialLatency.

fix https://linear.app/livepeer/issue/ENG-2454/startup-time-suboptimal-g-o-selection

linear · 2025-02-20T13:45:30Z

ENG-2454 startup time: suboptimal G->O selection

codecov · 2025-02-20T14:00:16Z

Codecov Report

Attention: Patch coverage is 76.19048% with 15 lines in your changes missing coverage. Please review.

Project coverage is 32.15283%. Comparing base (232df3a) to head (aa520dc).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
server/ai_session.go	0.00000%	12 Missing ⚠️
server/broadcast.go	40.00000%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3402         +/-   ##
===================================================
+ Coverage   32.11405%   32.15283%   +0.03878%     
===================================================
  Files            147         147                 
  Lines          40789       40830         +41     
===================================================
+ Hits           13099       13128         +29     
- Misses         26916       26927         +11     
- Partials         774         775          +1

Files with missing lines	Coverage Δ
server/rpc.go	`66.66667% <ø> (ø)`
server/selection.go	`94.54545% <100.00000%> (+1.02027%)`	⬆️
server/broadcast.go	`79.54545% <40.00000%> (-0.15557%)`	⬇️
server/ai_session.go	`2.25806% <0.00000%> (-0.07527%)`	⬇️

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 232df3a...aa520dc. Read the comment docs.

Files with missing lines	Coverage Δ
server/rpc.go	`66.66667% <ø> (ø)`
server/selection.go	`94.54545% <100.00000%> (+1.02027%)`	⬆️
server/broadcast.go	`79.54545% <40.00000%> (-0.15557%)`	⬇️
server/ai_session.go	`2.25806% <0.00000%> (-0.07527%)`	⬇️

... and 1 file with indirect coverage changes

ad-astra-video · 2025-02-20T16:51:29Z

@rickstaa I think this is better for batch AI jobs as well. knownSessions is causing issues with distribution of jobs across the network that are hard to overcome. The main complaints is the first selected Os are selected all the time in times where requests do not saturate the capacity and there is no way to add randomness to the selection.

@leszko I was exploring something very similar here: baf178e.

Couple questions:

Why not use LatencyScore to sort? Or is InitialLatency populated somewhere else? The unknownSessions is initially sorted by how fast the Orchestrator responds to the GetOrchestrator request. This may not indicate how fast they are for the AI work but starts to update as the Orchestrators are selected and work is done. Note that this updated sort will order sessions that have not been selected first.
Could we wire in the PerfScores the selection algo uses if available as the latency score? Then could fallback to the LatencyScore if PerfScores are not available.

server/selection.go

leszko · 2025-02-21T12:55:59Z

@rickstaa I think this is better for batch AI jobs as well. knownSessions is causing issues with distribution of jobs across the network that are hard to overcome. The main complaints is the first selected Os are selected all the time in times where requests do not saturate the capacity and there is no way to add randomness to the selection.

IMO it will work better for you as well. You may just need to adapt using the LatencyScore 👇

Why not use LatencyScore to sort? Or is InitialLatency populated somewhere else? The unknownSessions is initially sorted by how fast the Orchestrator responds to the GetOrchestrator request. This may not indicate how fast they are for the AI work but starts to update as the Orchestrators are selected and work is done.

Short answer is that for Live AI Video we don't have LatencyScore. We don't measure the time between when the segment is sent until the processed segments is received. The reason for that it's not trivial to do it, that Live AI Video operates on stream not individual segments. So, I'd keep it that way. In any case, this is just sorting, the actual selection should be done in the selection algorithm here.

Note that this updated sort will order sessions that have not been selected first.

Good spot, I updated it to sort also after executing Complete()

Could we wire in the PerfScores the selection algo uses if available as the latency score? Then could fallback to the LatencyScore if PerfScores are not available.

I think we could add LatencyScore into the selection algorithm here, then you could take it into account while selecting the O.

leszko added 4 commits February 20, 2025 14:06

Add default session selector

fd51d4d

Use different selector for Live video AI

735317f

Remove outdated comments.

9980daa

Fix unit test

ad07d4d

leszko requested review from victorges and mjh1 February 20, 2025 13:45

github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 20, 2025

leszko requested a review from emranemran February 20, 2025 14:37

mjh1 reviewed Feb 20, 2025

View reviewed changes

server/selection.go Show resolved Hide resolved

mjh1 reviewed Feb 20, 2025

View reviewed changes

server/selection.go Outdated Show resolved Hide resolved

mjh1 approved these changes Feb 20, 2025

View reviewed changes

leszko added 2 commits February 21, 2025 10:39

Rename "unknownSessions" to "sessions"

f21de5c

Fix selecting O and removing it from the selector

0df225f

mjh1 approved these changes Feb 21, 2025

View reviewed changes

Fix unit test

aa520dc

leszko merged commit 0f9be6a into master Feb 21, 2025
17 of 18 checks passed

leszko deleted the rafal/min-latency-selector branch February 21, 2025 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Min Initial Latency Selector #3402

Add Min Initial Latency Selector #3402

leszko commented Feb 20, 2025

linear bot commented Feb 20, 2025

codecov bot commented Feb 20, 2025 •

edited

Loading

ad-astra-video commented Feb 20, 2025 •

edited

Loading

leszko commented Feb 21, 2025

Add Min Initial Latency Selector #3402

Add Min Initial Latency Selector #3402

Conversation

leszko commented Feb 20, 2025

linear bot commented Feb 20, 2025

codecov bot commented Feb 20, 2025 • edited Loading

Codecov Report

ad-astra-video commented Feb 20, 2025 • edited Loading

leszko commented Feb 21, 2025

codecov bot commented Feb 20, 2025 •

edited

Loading

ad-astra-video commented Feb 20, 2025 •

edited

Loading