Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving DataLayerService read performance #10

Open
ruichuan opened this issue May 8, 2020 · 6 comments
Open

Improving DataLayerService read performance #10

ruichuan opened this issue May 8, 2020 · 6 comments
Assignees
Labels
feature_request New feature request improvement Improvements to an existing component in progress This issue is already being fixed

Comments

@ruichuan
Copy link
Collaborator

ruichuan commented May 8, 2020

[Environment]: Kubernetes, bare metal
[Known affected releases]: master (includes all releases)

The current DataLayerService asks users to specify whether they want to retrieve the data from local or global data layer. In the future, it'd be valuable to maintain

  1. each data layer server keeps additional metadata for each item in the cache. This metadata contains the expiry date timestamp.
  2. when a new item is requested from the data layer server, it does the following:
    2.1) if it is in the cache and it hasn't expired yet, it returns the value
    2.2.) if it is in the cache but expired OR if it is not in the cache, it retrieves from the global data layer, caches it with the new expiry date timestamp and returns that value.

The 'lease duration' is then a configurable value.

As another improvement, at step 2.1 (when the value in the cache hasn't expired yet), after returning to the client, the data layer server may make another asynchronous request to the global data layer server around the expiry time to retrieve the latest value and update the expiry date timestamp accordingly. This approach would try to keep the caches synchronized and may increase the overhead a bit, but it will only be for the items that have been accessed recently. For others, the cache will expire as normal. (So for a cached item that is accessed once, the time in the cache will be extended at least once more lease duration. Afterwards, if it hasn't been accessed again, it will expire normally.)

Then we could have a really lazy thread running once every in a while to clean up the cache to remove the expired values (simply go over the cache and remove the expired ones). This is an optimization to reduce resource consumption and may not be critical.

The write synchronization stays the same: it is written to the local cache and sent to the global data layer asynchronously. We can apply the same principle here as well: after writing to the local cache, the expiry date will be set accordingly (hopefully, by that time the synchronization to the global data layer will have finished). Perhaps, we can use this criterion for having an idea about the lease duration.

@ruichuan ruichuan added feature_request New feature request TODO This issue is next on the list improvement Improvements to an existing component labels May 8, 2020
@ruichuan ruichuan self-assigned this May 8, 2020
@ruichuan ruichuan added in progress This issue is already being fixed and removed TODO This issue is next on the list labels May 11, 2020
@ruichuan
Copy link
Collaborator Author

ruichuan commented May 11, 2020

created a new branch "feature/datalayer_lease" for this feature.

@manuelstein
Copy link
Collaborator

It's not a lease, because that would imply the local data layer [cache] has a temporal ownership of the global data, which it doesn't - any other local data layer can write the same identifier to the global backend, which renders the local cache content invalid. It's an expiration timer typically labelled TTL.

@iakkus
Copy link
Member

iakkus commented Jun 13, 2020

As an alternative to the above solution, one could utilize Redis as a caching layer in front of Riak. See https://riak.com/products/riak-kv/redis-integration/index.html.

In this alternative, the DataLayerService would become a translation layer for storage operations:

  1. If the operation is a KV operation, then the DataLayerService will redirect it to Redis (which has its own synchronization with Riak).
  2. If the operation is a CRDT operation, then the DataLayerService will redirect it to Riak.
  3. Other operations related to other features (e.g., triggerable tables) need to be figured out, but will most likely use a combination of Redis and Riak.

As a result of these modifications, the DataLayerService will not cache anything; thus, offloading cache-related problems like invalidation to Redis. Furthermore, its logic will be simplified, so that it can become part of each sandbox. As an implementation optimization regarding JVMs, it can be merged with the queue service.

@ruichuan, @paarijaat: Did I miss anything?

@iakkus
Copy link
Member

iakkus commented Jun 15, 2020

As far as I can tell, the data layer service is only used by sandboxes (including management) and the storage frontend, which is going to be removed (see #28). After that, the data layer service will be the only remaining component in the bare metal setup that requires the fluentbit server to send the logs to elasticsearch (in fact, the fluentbit server is not utilized in the knative setup anyway).

If the data layer service moves inside the sandbox, we can also get rid of the extra fluentbit server in bare metal, because the sandbox logs are already sent to elasticsearch. That move would affect the way the management service is bootstrapped, though.

@ruichuan
Copy link
Collaborator Author

ruichuan commented Jun 15, 2020

@iakkus thanks, it captures our discussion well. The data layer service (DLS) is a bit more than a simple translation layer of redirecting requests to riak/redis for processing. DLS needs to use respective client libraries to organize operations in certain ways for correct processing. I agree that the new `riak_redis' approach can simplify DLS.

Regarding where to put DLS (either standalone or inside a sandbox), we can decide later. It anyways is a simple move.

@ruichuan
Copy link
Collaborator Author

ruichuan commented Jul 1, 2020

switched this work to "feature/redis_caching" branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request New feature request improvement Improvements to an existing component in progress This issue is already being fixed
Projects
None yet
Development

No branches or pull requests

3 participants