Task handling is incomplete #774

snazy · 2025-01-15T11:19:26Z

Describe the bug

Polaris uses some asynchronously executed tasks to run operations for table and manifest file cleanup. Those tasks are potentially executed in a separate thread in the same JVM. There is however no guarantee that those tasks will eventually run for multiple reasons:

Tasks (e.g. via org.apache.polaris.service.catalog.BasePolarisCatalog#dropTable) are triggered after the fact.
Although tasks are persisted, there is no mechanism to pick up tasks that did not start or did not finish ("long lasting" failures, JVM terminates).

Overall this means that for example a "drop table with purge" returns a successful result to the user, the actual purge may never ever happen.

To Reproduce

No response

Actual Behavior

No response

Expected Behavior

No response

Additional context

No response

System information

No response

The text was updated successfully, but these errors were encountered:

eric-maynard · 2025-01-17T07:23:59Z

This is true; it's the reason for #270 but ideally we should make the operation reliable.

danielhumanmod · 2025-05-04T05:09:35Z

This issue is also mentioned in #1179. I can help patch this by:

When a cleanup task fails, persist and requeue it instead of retrying inline (to prevent losing track if the service dies).
Recover failed tasks on Polaris startup.
Drain the queue on a regular interval (e.g., every 15 minutes) to ensure no tasks remain pending.

But in the long term, we might want to move this purge execution logic to the Table Maintenance Service #538 for better scalability.

snazy · 2025-05-08T09:21:43Z

I think that it's also very important that no two instances run the same task.
Otherwise it's not safe to run Polaris in an HA/LB setup.

danielhumanmod · 2025-05-14T06:01:58Z

I think that it's also very important that no two instances run the same task. Otherwise it's not safe to run Polaris in an HA/LB setup.

Yup, we prevent this from happening by storing a LAST_ATTEMPT_START_TIME for each task entity.

When loadTasks is called, each task is selected based on whether it’s timed out (LAST_ATTEMPT_START_TIME < now - TASK_TIMEOUT_MILLIS), and then updated transactionally to set a new LAST_ATTEMPT_START_TIME and assign to the executor.

So in an HA/LB setup, once one executor picks and updates a task, no others will be able to pick the same one until it times out again.

snazy added the bug Something isn't working label Jan 15, 2025

github-project-automation bot added this to Basic Kanban Board Jan 15, 2025

snazy moved this to Open Bug Reports in Basic Kanban Board Feb 21, 2025

snazy added the 1.0-blocker label Feb 24, 2025

snazy mentioned this issue Feb 27, 2025

Core persistence refactor phase 1 - extract BasePersistence and BaseMetaStoreManager to isolate all "transaction" behaviors #1070

Merged

flyrain added this to the 1.0.0 milestone Mar 13, 2025

danielhumanmod linked a pull request May 4, 2025 that will close this issue

Support per-task transactional leasing in loadTasks #1523

Open

danielhumanmod linked a pull request May 14, 2025 that will close this issue

Support retrying non-finished async tasks on startup and periodically #1585

Open

danielhumanmod mentioned this issue May 15, 2025

Support bulk deletion in batch file cleanup task #1179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task handling is incomplete #774

Task handling is incomplete #774

snazy commented Jan 15, 2025

eric-maynard commented Jan 17, 2025

danielhumanmod commented May 4, 2025 •

edited

Loading

snazy commented May 8, 2025

danielhumanmod commented May 14, 2025 •

edited

Loading

Task handling is incomplete #774

Task handling is incomplete #774

Comments

snazy commented Jan 15, 2025

Describe the bug

To Reproduce

Actual Behavior

Expected Behavior

Additional context

System information

eric-maynard commented Jan 17, 2025

danielhumanmod commented May 4, 2025 • edited Loading

snazy commented May 8, 2025

danielhumanmod commented May 14, 2025 • edited Loading

danielhumanmod commented May 4, 2025 •

edited

Loading

danielhumanmod commented May 14, 2025 •

edited

Loading