Skip to content

4.x: Occasionally force cache cleanup in PreparedStatementCachingIT #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: scylla-4.x
Choose a base branch
from

Conversation

Bouncheck
Copy link
Collaborator

@Bouncheck Bouncheck commented Apr 8, 2025

This integration test from time to time fails on one of its methods. Evidence points to the cache itself being the culprit, as there is no guarantee that elements will be removed right away. This leads to removal event not being fired and timing out on wait for it. Calling cache.cleanUp() seems to greatly improve stability of this test.

Fixes #476 and #477

This integration test periodically fails on one of its methods.
Evidence points to the cache itself being the culprit, as there is
no guarantee that elements will be removed right away.
This leads to removal event not being fired and timing out on wait for it.
Calling `cache.cleanUp()` seems to greatly improve stability of this test.

Fixes scylladb#476 and scylladb#477
@Bouncheck Bouncheck force-pushed the scylla-4.x-PreparedStatementCachingIT-cache-cleanup branch from ed59669 to 02b72d6 Compare April 8, 2025 09:11
@dkropachev
Copy link
Collaborator

TBH I don't understand this fix, test checks if certain events where fired, it waits 10 seconds for event to be executed.
Now, your fix is just cleaning up the cache, which have nothing to do with events test is looking for.

@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Apr 8, 2025

TBH I don't understand this fix, test checks if certain events where fired, it waits 10 seconds for event to be executed. Now, your fix is just cleaning up the cache, which have nothing to do with events test is looking for.

The event is tied to the removal from cache.
This test sets up custom cache that on element removal executes a callback (see buildCacheRemoveCallback) that in turn fires the PreparedStatementRemovalEvent. If the cache does not perform the removal the event won't be fired either. If it's not fired the listener won't see it and trigger the latch which is part of the condition that the test is waiting on.

@dkropachev dkropachev force-pushed the scylla-4.x branch 2 times, most recently from 0666b08 to de10f2f Compare April 8, 2025 16:53
@dkropachev
Copy link
Collaborator

TBH I don't understand this fix, test checks if certain events where fired, it waits 10 seconds for event to be executed. Now, your fix is just cleaning up the cache, which have nothing to do with events test is looking for.

The event is tied to the removal from cache. This test sets up custom cache that on element removal executes a callback (see buildCacheRemoveCallback) that in turn fires the PreparedStatementRemovalEvent. If the cache does not perform the removal the event won't be fired either. If it's not fired the listener won't see it and trigger the latch which is part of the condition that the test is waiting on.

Just for the record:
We have agreed on the fact that this fix is not relevant because getCache().cleanUp() is already called.
We need to find source of this instability.
Here is what should be done:

  1. Drop global root log level for this particular test
  2. Swap assertions on typeChangeEventLatch and preparedStmtCacheRemoveLatch: typeChangeEventLatch should go first, it will help us to see if issue in event not arriving or in how cache works.

@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Apr 10, 2025

Just for the record: We have agreed on the fact that this fix is not relevant because getCache().cleanUp() is already called.

In theory yes, but in practice I don't see it being called on failing runs. After adding prints it seems like this place is never reached (only when failing). Cache is capable of returning a view map of size 2 but when forced to print elements it prints nothing as if it was empty. While cache.size() is expected to be approximate I don't know if it's the same with view map.
The place where 2 prepared statements are cached is always reached.

When running with debugger the test suddenly becomes surprisingly stable, so I was not able to learn anything from that yet.

  1. Swap assertions on typeChangeEventLatch and preparedStmtCacheRemoveLatch: typeChangeEventLatch should go first, it will help us to see if issue in event not arriving or in how cache works.

After the swap the test still fails on prepared statement latch.

Maybe another option is that on some runs those prepared statements get somehow garbage collected, but I was under the impression (maybe wrong) that this test avoids that.

@Bouncheck Bouncheck marked this pull request as draft April 10, 2025 13:30
@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Apr 10, 2025

Marking as draft since I'm getting inconsistent results.

After creating WeakReferences to the prepared statements and running extra cache.cleanUp() after failure I do get removal callbacks with RemovalCause.COLLECTED (which means "The entry was removed automatically because its key or value was garbage-collected.") however I can still access the prepared statements under weak references.
From what I see cache should be using strong keys and weak values.
Normal removals in regular passing runs have RemovalCause.EXPLICIT.

@dkropachev
Copy link
Collaborator

I have made some tests, this thing happens because there are no records in the cache.
Next step would be to check how assertThat(getPreparedCacheSize(session)).isEqualTo(2) goes through.
Anyways, i think reason why it is failing is the same as at #514, i.e. stmt is not getting registered on the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4.x: PreparedStatementCachingIT.should_invalidate_cache_entry_on_tuple_udt_change_result_set can fail
2 participants