Skip to content

Task handling is incomplete #774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
snazy opened this issue Jan 15, 2025 · 4 comments · May be fixed by #1523 or #1585
Open

Task handling is incomplete #774

snazy opened this issue Jan 15, 2025 · 4 comments · May be fixed by #1523 or #1585
Labels
1.0-blocker bug Something isn't working
Milestone

Comments

@snazy
Copy link
Member

snazy commented Jan 15, 2025

Describe the bug

Polaris uses some asynchronously executed tasks to run operations for table and manifest file cleanup. Those tasks are potentially executed in a separate thread in the same JVM. There is however no guarantee that those tasks will eventually run for multiple reasons:

  • Tasks (e.g. via org.apache.polaris.service.catalog.BasePolarisCatalog#dropTable) are triggered after the fact.
  • Although tasks are persisted, there is no mechanism to pick up tasks that did not start or did not finish ("long lasting" failures, JVM terminates).

Overall this means that for example a "drop table with purge" returns a successful result to the user, the actual purge may never ever happen.

To Reproduce

No response

Actual Behavior

No response

Expected Behavior

No response

Additional context

No response

System information

No response

@snazy snazy added the bug Something isn't working label Jan 15, 2025
@eric-maynard
Copy link
Contributor

This is true; it's the reason for #270 but ideally we should make the operation reliable.

@danielhumanmod
Copy link
Contributor

danielhumanmod commented May 4, 2025

This issue is also mentioned in #1179. I can help patch this by:

  1. When a cleanup task fails, persist and requeue it instead of retrying inline (to prevent losing track if the service dies).
  2. Recover failed tasks on Polaris startup.
  3. Drain the queue on a regular interval (e.g., every 15 minutes) to ensure no tasks remain pending.

But in the long term, we might want to move this purge execution logic to the Table Maintenance Service #538 for better scalability.

@snazy
Copy link
Member Author

snazy commented May 8, 2025

I think that it's also very important that no two instances run the same task.
Otherwise it's not safe to run Polaris in an HA/LB setup.

@danielhumanmod
Copy link
Contributor

danielhumanmod commented May 14, 2025

I think that it's also very important that no two instances run the same task. Otherwise it's not safe to run Polaris in an HA/LB setup.

Yup, we prevent this from happening by storing a LAST_ATTEMPT_START_TIME for each task entity.

When loadTasks is called, each task is selected based on whether it’s timed out (LAST_ATTEMPT_START_TIME < now - TASK_TIMEOUT_MILLIS), and then updated transactionally to set a new LAST_ATTEMPT_START_TIME and assign to the executor.

So in an HA/LB setup, once one executor picks and updates a task, no others will be able to pick the same one until it times out again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0-blocker bug Something isn't working
Projects
None yet
4 participants