feat(sdk): Implement `EventCache` lazy-loading #4632

Hywan · 2025-02-05T16:24:49Z

This is a follow-up of #4594.

The situation

Lazy-loading is now implemented on the Timeline's output. The Timeline outputs a certain maximum number of initial items. When a backwards pagination is run, the Timeline first tries to paginate through in-memory items. When they are all exhausted, the Timeline asks the EventCache to provide more events, which will be transformed into items.

Prior to this patch, the EventCache was reading the entire events of a room. This is wrong and it is the reason of some bugs:

Gaps between events are represented in the EventCache (it relies on the LinkedChunk which has been designed to represent gaps). However, because the Timeline has no way to represent gaps, gaps are removed: only events are kept. The Timeline will do a pagination when it reaches one of its end, e.g. “I've reached my top, please feed me with more events!”. The Timeline isn't aware that events may be missing inside its set of events, because gaps have been removed. That's why users were seeing missing messages in their Timeline! And it was impossible to recover, except by clearing the cache.
Loading all events for a particular room might be expensive. It can easily result in hundreds of thousands of events. Patches have tried to improve the performance, with dramatic results. With fix(ui): Fix performance of TimelineEventHandler::deduplicate_local_timeline_item #4601, fix(ui): Fix performance of AllRemoteEvents::(in|de)crement_all_timeline_item_index_after #4608, fix(ui): Fix performance of ReadReceiptTimelineUpdate::apply #4612, and fix(sdk): Improve performance of RoomEvents::maybe_apply_new_redaction #4616, the Timeline is able to handle 10'000 events 10 times faster. The Timeline lazy-loading also dramatically reduces the number of items broadcasted to the apps/callers/consumers/subscribers. But still, the EventCache outputs too many events, and the Timeline hasn't been designed for that. The EventCache relies on the LinkedChunk to hold the events. The capacity of a chunk in the LinkedChunk is 128 in the current implementation, which means a chunk can contain up to 128 events. Much better than thousands.

The solution

This patch is going to change the behaviour of the EventCache when its storage is enabled. Instead of loading all chunks, only the last one will be loaded. That's it. When the Timeline will trigger a pagination, the EventCache will first try to load the previous chunk if it exists, otherwise it means all events have been exhausted and a network pagination must be done. However, if the previous exists, there are 2 cases:

The previous chunk is of kind Items —i.e. it contains events—, then it's all good, the chunk is loaded, and updates will be broadcasted to the Timeline, happy easy-peasy path.
The previous chunk is of kind Gap —i.e. it contains a… gap—, then a pagination is triggered over the network to fill this gap. The code will stay unchanged here, we already have this mechanism: the Gap is replaced by Items, updates will be broadcasted to the Timeline, happy slow path.

This, will fix performance issues, and bugs (c.f. missing messages), but… this isn't the end of the journey.

Apollo e Dafne

Imagine the following scenario:

the user is online, the Matrix client is running, sync is working, new events are coming
the user backgrounds/reduces the app for 2 hours
the user foregrounds the app, sync is working, new events are coming, but a gap has been inserted for certain rooms because not all events have been synced (this is how the sync mechanism works, the first sync response after the app has been foregrounded may contain a limited flag)
(even here, the user will not see the missing messages, and that's a bug, but not the one I want to illustrate)
the user kills the app
the user goes offline
the user re-opens the app
the user opens a room and scrolls to see all its messages
only the messages that came after the gap can be loaded by the EventCache, and so displayed by the Timeline

This behaviour can feel absurd for many reasons:

the user expects to see all its messages
the events are in the EventCache, they are here (!), but we don't want to load them because they are before a gap, otherwise we end up in the current situation where we load all events, no matter the presence of gaps or not, and we end up in the missing message situation.

One solution to this problem is to add a way for the Timeline to represent gaps. I propose to introduce (later, in another patch) a new VirtualTimelineItem of kind Gap. It changes the behaviour of the Timeline greatly: when a VirtualTimelineItem::Gap enters the viewport of a Timeline, the app/caller has to trigger a pagination. Such virtual timeline item can be rendered as a loader. The Timeline won't trigger a pagination when it reaches one of its end anymore: this new VirtualTimelineItem of kind Gap will be entirely managed by the Timeline. A new method will allow to fill/replace gaps, which will trigger a pagination from the EventCache. Behaviour is still undefined and it raises many unknowns:

When do we insert a Gap in the Timeline? When we are offline only?
Do we also load events that are before this Gap? It can be a bit disturbing for the user to see its messages, then on top of it a loader, then on top of it some events.
When a gap is met in the EventCache, do we always load this gap and its previous chunk?

Well, it raises many many questions. We need to be extremely careful before digging into this.

An alternative exists though: automatic backwards pagination ✨. The SDK can automatically runs backwards pagination to fill all the existing gaps, in parallel of the sync. A correct heuristic must be determined to not bloat the network and to not drain the battery of the user's device (e.g. auto-run for the top most used rooms, up until n events, stuff like that, this is random ideas). It brings several advantages:

It solves the problem of having to support a gap in the Timeline.
It solves a problem that the user may have missed events which may includes notifications.
It solves the problem of the app badge counter, which, for the moment, gets its value from the homeserver, but which is wrong by designed (because the homeserver is blind regarding encrypted rooms, so it misses notifications)
It helps to gather all messages, so users have all their messages, which is good for text search and so on.
In offline mode, the user is likely to be able to load a lot of messages before reaching a gap, if not all its messages for its most frequent rooms. The “bug” becomes a “limitation”: like Signal or WhatsApp have a limited number of participants for a room, well, the SDK will have a limited number of events to display in offline mode. The limited number is likely to be very high if no gap is present. Remember: the goal of the automatic backwards pagination is to fill all the gaps. For regular users, this is going to be fast. Heuristics are useful and required for the power users.

A note about Apollo e Dafne. First off, le Bernin is one of my favourite artist. Second, this sculputure is fantastic in many regards. The movements. The unique representation of this myth. The greek inspirations. The details (oh, the sandals…). Third, this patch evokes me this story of Apollo and Dafne. Apollo is in love with Dafne, and Dafne doesn't like him. Apollo is running after Dafne, and Dafne avoids him, escapes him as much as possible. This story is based on Cupido who shot two arrows: one made of gold to create love, another one made of lead to exhaust love. Every time we are going one step closer to perfect offline support, this goal slips away. #RomanticProgramming

Address [meta] EventCache storage #3280

This is a comestic patch. Nothing fancy except some variable renamings.

This patch adds `EventCacheStore::load_one_chunk_of_linked_chunk` trait method along with the `ChunkRelativePosition` enum. The idea is to be able to load one chunk in isolation from a relative position, either `ChunkRelativePosition::Last` to load the last chunk, or `ChunkRelativePosition::Before(ChunkIdentifier)` to load the chunk before another one.

This patch renames `RoomEvents::with_initial_chunks` to `with_initial_linked_chunk`. It avoids a confusion between several chunks, like `RawChunk`s, and `LinkedChunk` which represents several `Chunk`s.

…the last chunk. This patch updates `RoomEventCacheState::try_reload_linked_chunk` to loads only the last chunk instead of all the chunks.

bnjbvr · 2025-02-06T14:31:50Z

Thanks for the detailed summary. Could it be incorporated as part of internal documentation, somehow?

That's why users were seeing missing messages in their Timeline! And it was impossible to recover, except by clearing the cache.

I don't think it's true: the timeline would recover once it hits the start of the timeline, by triggering back-paginations. Those back-paginations would then fill the gaps at the event cache layer, and those filled gaps would be propagated to the timeline via new vectordiffs. The result would still be confusing to users, though: you got to the top of the apparent timeline, and now the gaps that you hadn't seen below would be filled with new messages.

The previous chunk is of kind Gap —i.e. it contains a… gap—, then a pagination is triggered over the network to fill this gap. The code will stay unchanged here, we already have this mechanism: the Gap is replaced by Items, updates will be broadcasted to the Timeline, happy slow path.

I think that in this case, instead of loading the gap and being satisfied with it (and letting a future call to back-pagination resolve it into events), we could block, at this point, and resolve such a gap right now. Then we don't have the problem with missing messages, after an app has been backgrounded for a while.

If the app is offline at this point, we would still need a way to display potential gaps.

There's also another caveat to add, if we wanted to represent gaps: we would have many of them that are spurious. In particular, when restarting the app, we don't reuse the SSS's previous pos, which will result in an initial sync, that may include events we already know about; in this case, we'll get a prev-batch token => a gap, but when resolving it, it may be spurious (aka, we in fact knew about all the events that have been back-paginated). In terms of user experience, this means we'd display a temporary gap in the timeline, that would later disappear and be replaced by… nothing. Slightly confusing, but that's likely the best we can do without heuristics.

(With heuristics, we could decide to show gaps only after the "initial gap", i.e. a gap that would be observed after an initial SSS response.)

Hywan added 4 commits February 5, 2025 17:24

chore(sdk): Change variable names.

b28ba3d

This is a comestic patch. Nothing fancy except some variable renamings.

chore(sdk): Rename RoomEvents::with_initial_chunks.

47ee818

This patch renames `RoomEvents::with_initial_chunks` to `with_initial_linked_chunk`. It avoids a confusion between several chunks, like `RawChunk`s, and `LinkedChunk` which represents several `Chunk`s.

task(sdk): RoomEventCacheState::try_reload_linked_chunk loads only …

21fad7f

…the last chunk. This patch updates `RoomEventCacheState::try_reload_linked_chunk` to loads only the last chunk instead of all the chunks.

fixup! revamp with load_last_chunk_of_linked_chunk

2635db5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): Implement `EventCache` lazy-loading #4632

feat(sdk): Implement `EventCache` lazy-loading #4632

Hywan commented Feb 5, 2025 •

edited

Loading

bnjbvr commented Feb 6, 2025

feat(sdk): Implement EventCache lazy-loading #4632

Are you sure you want to change the base?

feat(sdk): Implement EventCache lazy-loading #4632

Conversation

Hywan commented Feb 5, 2025 • edited Loading

The situation

The solution

Apollo e Dafne

bnjbvr commented Feb 6, 2025

feat(sdk): Implement `EventCache` lazy-loading #4632

feat(sdk): Implement `EventCache` lazy-loading #4632

Hywan commented Feb 5, 2025 •

edited

Loading