Skip to content

Add button to set failed job to pending #8377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Mar 12, 2025
Merged

Add button to set failed job to pending #8377

merged 20 commits into from
Mar 12, 2025

Conversation

frcroth
Copy link
Member

@frcroth frcroth commented Feb 5, 2025

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz

Steps to test:

  • Enable jobs
  • Set a job to "FAILURE"
  • Resume the job. The issue did not specify what should happen to the job so I assumed "PENDING". Should it be "RUNNING" instead?

TODOs:

  • @ frontend (@knollengewaechs ?) Can you please configure the button to only show for super users? I don't know how to do that.

Issues:


(Please delete unneeded items, merge only when none are left open)

Copy link
Contributor

coderabbitai bot commented Feb 5, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a new feature that allows super-users to retry failed jobs. The update spans across backend, frontend, and database layers. A new method is added in the job controller along with its corresponding route, and the frontend UI is updated to show a retry option for failed jobs. Furthermore, the job model and DAO have been enhanced with a new boolean field to track retries, and the job service logic is modified to conditionally suppress email notifications. Database schema and migration files have been updated to support these changes.

Changes

Files Change Summary
CHANGELOG.unreleased.md, MIGRATIONS.unreleased.md Documented the new super-user job retry feature and added a migration entry for SQL file 127-job-retried-by-super-user.sql.
app/controllers/JobController.scala, conf/.../latest.routes Added a new retry method and a PATCH route (/jobs/:id/retry) enabling super-users to retry jobs.
frontend/.../jobs.ts, frontend/.../job_list_view.tsx Introduced an async API function retryJob and updated the job list UI to include a retry icon for failed jobs when accessed by super-users.
app/models/job/Job.scala, app/models/job/JobService.scala Added the retriedBySuperUser field, a retryOne DAO method, and conditional email notification logic based on retry status.
tools/.../schema.sql, conf/evolutions/127-job-retried-by-super-user.sql, conf/evolutions/reversions/127-job-retried-by-super-user.sql Updated the schema version, added a new non-null boolean column retriedBySuperUser to the jobs table, and provided corresponding migration and reversion scripts.

Possibly related PRs

  • Evaluate segmentation in infer neurons task #8221: The changes in the main PR focus on adding a retry functionality for jobs in the JobController, while the retrieved PR modifies the runInferNeuronsJob method in the same JobController class, indicating a direct relationship between the two PRs through modifications to the same class.

Suggested labels

enhancement, bug, backend, frontend

Suggested reviewers

  • MichaelBuessemeyer

Poem

Oh, I'm a little bunny in the codeyard so bright,
Hopping through patches with joy and delight.
Super-users now can retry a job gone wrong,
With a swift click and a fix that's strong.
I wiggle my nose and celebrate this day—
CodeRabbit cheers the changes in its playful way!
🐰✨

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@frcroth frcroth self-assigned this Feb 5, 2025
@frcroth frcroth changed the title Add route to set failed job to pending Add button to set failed job to pending Feb 10, 2025
@frcroth frcroth marked this pull request as ready for review February 10, 2025 12:33
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
CHANGELOG.unreleased.md (1)

19-19: New Feature: Resume Failed Jobs for Super-Users

The changelog correctly reflects the new feature allowing super-users to resume failed jobs. Please verify that the backend and frontend behavior (i.e., whether the job state transitions to "PENDING" or "RUNNING") is clearly defined in the documentation and aligns with the intended functionality as discussed in PR [#8377].

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d04993 and 24c529e.

📒 Files selected for processing (5)
  • CHANGELOG.unreleased.md (1 hunks)
  • app/controllers/JobController.scala (3 hunks)
  • conf/webknossos.latest.routes (1 hunks)
  • frontend/javascripts/admin/api/jobs.ts (1 hunks)
  • frontend/javascripts/admin/job/job_list_view.tsx (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: circleci_build
🔇 Additional comments (8)
frontend/javascripts/admin/api/jobs.ts (1)

77-81: LGTM! The implementation follows the established pattern.

The resumeJob function is well-implemented, following the same pattern as other job-related functions in the file, making it consistent and maintainable.

frontend/javascripts/admin/job/job_list_view.tsx (3)

10-10: LGTM! Icon import follows the established pattern.

The PlayCircleOutlined icon import is correctly added, maintaining consistency with other icon imports.


139-139: LGTM! Super user check is properly implemented.

The Redux selector is correctly used to determine if the current user is a super user, ensuring proper access control.


305-316: LGTM! Resume action follows the same pattern as cancel action.

The resume functionality is well-implemented:

  • Correctly checks for failed job state and super user status
  • Uses AsyncLink for handling the action
  • Follows the same pattern as the cancel action
  • Updates the job list after resuming
app/controllers/JobController.scala (2)

53-67: LGTM! UserService is properly injected.

The UserService is correctly added to the constructor using dependency injection, following the established pattern.


115-123: LGTM! Resume method is secure and follows established patterns.

The implementation:

  • Properly checks if jobs feature is enabled
  • Verifies super user permissions with proper error handling
  • Updates job state using existing DAO methods
  • Returns consistent response format
conf/webknossos.latest.routes (1)

271-271: LGTM! Route follows RESTful conventions.

The new route is well-defined:

  • Uses appropriate HTTP method (PATCH) for state transition
  • Follows the same pattern as other job routes
  • Is correctly placed in the jobs section
CHANGELOG.unreleased.md (1)

18-18: Accessibility Update: Measurement Tools Enhancement

This changelog entry now notes that measurement tools are accessible when viewing datasets outside an annotation. Ensure that this entry remains consistent with the UI changes and accessibility guidelines applied in [#8334].

Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hej,

thanks for implementing this feature 🎉. The code looked fine to me except one potential minor improvement.

But could you please check the bug I found during testing? A restarted job is not actually restarted by the worker :/

onClick={async () => {
resumeJob(job.id).then(() => fetchData());
}}
icon={<PlayCircleOutlined key="resume" className="icon-margin-right" />}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key prop here shouldn't be needed (just as above). Could you please remove them before merging and give this change a quick check?

_ <- bool2Fox(wkconf.Features.jobsEnabled) ?~> "job.disabled"
_ <- userService.assertIsSuperUser(request.identity) ?~> "notAllowed" ~> FORBIDDEN
job <- jobDAO.findOne(id)
_ <- jobDAO.updateManualState(id, JobState.PENDING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing I noticed that updating the manual state isn't enough (using the current wk worker master code). Could you please investigate this? I think that also updating the job's state should also work, but somehow the manual state is ignored by the backend / worker. I think is more likely the behaviour that needs to be fixed. Not sure whether this needs to be fixed in the backend or worker

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think the problem was that the field manual state was used, which only has an effect on the displayed state.

@daniel-wer
Copy link
Member

Thanks for implementing! When a superuser retries/resumes a task by another user, ideally the user wouldn't get the "Job Failed" email if the job fails another time, because that is irritating and unnecessary email noise for him/her. Once the job is successful, the respective mail should go through. Could the two of you coordinate on how to best implement this as part of this PR or a followup? /cc @normanrz

@MichaelBuessemeyer
Copy link
Contributor

Thanks for implementing! When a superuser retries/resumes a task by another user, ideally the user wouldn't get the "Job Failed" email if the job fails another time, because that is irritating and unnecessary email noise for him/her. Once the job is successful, the respective mail should go through. Could the two of you coordinate on how to best implement this as part of this PR or a followup?

Oh that's a good idea 🍌 👯.
First thoughts: Add a retry count to the jobs table and only send failed mail in case the retry count is 0 upon failure. Save some bool like sentFailedMessage to the job. And also maybe other ideas. But this sounds definitely doable and not so complicated. => I'd vote for implementing this in this pr :).

But maybe I am also overseeing things.

Moreover: I think I like the approach saving the retry count more, as it provides more information for e.g. later analysis on how many attempts were needed instead of saving a simple boolean which provides less information. 🤔

@daniel-wer
Copy link
Member

These are good ideas, but I think the differentiating factor should rather be whether another user resumed the workflow run. From a users perspective, when resuming my own workflow, I'd like to be notified of failure and success. However, if a super user resumed a run, the run owner should only be notified of success.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24c529e and 67ed6e0.

📒 Files selected for processing (9)
  • CHANGELOG.unreleased.md (1 hunks)
  • app/controllers/JobController.scala (3 hunks)
  • app/models/job/Job.scala (3 hunks)
  • app/models/job/JobService.scala (1 hunks)
  • conf/application.conf (1 hunks)
  • conf/evolutions/126-job-resumed-by-super-user.sql (1 hunks)
  • conf/evolutions/reversions/126-job-resumed-by-super-user.sql (1 hunks)
  • frontend/javascripts/admin/job/job_list_view.tsx (4 hunks)
  • tools/postgres/schema.sql (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • CHANGELOG.unreleased.md
  • frontend/javascripts/admin/job/job_list_view.tsx
🔇 Additional comments (7)
conf/evolutions/126-job-resumed-by-super-user.sql (1)

1-8: LGTM! The SQL migration is well-structured.

The migration correctly:

  • Uses a transaction for atomicity
  • Validates schema version
  • Adds the column with appropriate default value
  • Updates schema version
conf/evolutions/reversions/126-job-resumed-by-super-user.sql (1)

1-8: LGTM! The SQL reversion is properly implemented.

The reversion correctly:

  • Uses a transaction for atomicity
  • Validates schema version
  • Drops the column
  • Updates schema version
conf/application.conf (1)

154-155: Verify if both features need to be enabled.

While enabling jobsEnabled is necessary for the new job management functionality, please verify if voxelyticsEnabled is also required for this feature.

app/models/job/Job.scala (1)

32-32: LGTM! The field is properly defined.

The resumedBySuperUser field is correctly added with appropriate type and default value.

app/models/job/JobService.scala (1)

66-66: LGTM! The notification logic aligns with the requirements.

The change ensures that failed email notifications are only sent if the job was not resumed by a super user, which aligns with the discussion in the PR comments about preventing unnecessary email notifications.

app/controllers/JobController.scala (1)

115-123: LGTM! The implementation is secure and follows controller patterns.

The new resume method correctly:

  • Validates that jobs are enabled
  • Verifies super user permissions
  • Handles job resumption
tools/postgres/schema.sql (1)

23-23: LGTM! The schema changes are well-structured.

The changes correctly:

  • Increment the schema version
  • Add the resumedBySuperUser column with appropriate constraints

Also applies to: 475-475

Comment on lines 258 to 263
def resumeOne(id: ObjectId)(implicit ctx: DBAccessContext): Fox[Unit] =
for {
_ <- assertUpdateAccess(id)
_ <- run(q"""UPDATE webknossos.jobs SET state = 'PENDING', resumedBySuperUser = True WHERE _id = $id""".asUpdate)
} yield ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix SQL syntax and add state validation.

Two issues need to be addressed:

  1. Use lowercase true instead of True in SQL to follow PostgreSQL boolean literal syntax
  2. Add validation to ensure the job is in FAILURE state before resuming

Here's the suggested implementation:

 def resumeOne(id: ObjectId)(implicit ctx: DBAccessContext): Fox[Unit] =
   for {
     _ <- assertUpdateAccess(id)
-    _ <- run(q"""UPDATE webknossos.jobs SET state = 'PENDING', resumedBySuperUser = True WHERE _id = $id""".asUpdate)
+    _ <- run(q"""UPDATE webknossos.jobs 
+                 SET state = 'PENDING', resumedBySuperUser = true 
+                 WHERE _id = $id 
+                 AND state = 'FAILURE'""".asUpdate)
   } yield ()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def resumeOne(id: ObjectId)(implicit ctx: DBAccessContext): Fox[Unit] =
for {
_ <- assertUpdateAccess(id)
_ <- run(q"""UPDATE webknossos.jobs SET state = 'PENDING', resumedBySuperUser = True WHERE _id = $id""".asUpdate)
} yield ()
def resumeOne(id: ObjectId)(implicit ctx: DBAccessContext): Fox[Unit] =
for {
_ <- assertUpdateAccess(id)
_ <- run(q"""UPDATE webknosos.jobs
SET state = 'PENDING', resumedBySuperUser = true
WHERE _id = $id
AND state = 'FAILURE'""".asUpdate)
} yield ()

@frcroth
Copy link
Member Author

frcroth commented Feb 24, 2025

@daniel-wer @MichaelBuessemeyer I added the column "resumedBySuperUser" to the jobs table, which is checked when sending a failure email.


DROP VIEW webknossos.jobs_;

ALTER TABLE webknossos.jobs ADD column resumedBySuperUser boolean NOT NULL DEFAULT false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, very nitpicky but boolean should be in caps here as well, correct?

Suggested change
ALTER TABLE webknossos.jobs ADD column resumedBySuperUser boolean NOT NULL DEFAULT false;
ALTER TABLE webknossos.jobs ADD column resumedBySuperUser BOOLEAN NOT NULL DEFAULT false;

Comment on lines 154 to 155
jobsEnabled = true
voxelyticsEnabled = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undo me -> should also make the ci green

Suggested change
jobsEnabled = true
voxelyticsEnabled = true
jobsEnabled = false
voxelyticsEnabled = false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works 🎉

Thanks for implementing this 🧨

Maybe consider my final potential improvement. IMO the button restarts / retries a run and not resumes it 🤔

}}
icon={<PlayCircleOutlined className="icon-margin-right" />}
>
Resume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe restart/retry is a better word here 🤔?

Suggested change
Resume
Retry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your reasoning, I am just following the specification. @valentin-pinkau @daniel-wer Opinion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear, because it's actually both. Resume, because the tasks that already finished are skipped and retry, because the failed task is retried. Still, Retry is probably more accurate. Since it's only visible for super user that's fine I'd say. Otherwise one could add a tooltip, noting Restarts the workflow from the failed task, skipping and reusing artifacts from preceding tasks that were already successful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarts the workflow from the failed task, skipping and reusing artifacts from preceding tasks that were already successful.

That sounds like a valuable addon regardless of this being only available to super users. As we might make this available to non super users in the future, we do not need to remember to add this tooltip. => I'd vote to add it here as this shouldn't be difficult and also should need a re-review thus not delaying merging this :)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
frontend/javascripts/admin/job/job_list_view.tsx (1)

314-314: Consider adding confirmation dialog for resuming jobs.

Since resuming a job may have side effects (especially if the job belongs to another user), consider adding a confirmation dialog similar to the one used for job cancellation.

You could implement this by adding a confirmation dialog:

-  onClick={async () => {
-    resumeJob(job.id).then(() => fetchData());
-  }}
+  onClick={async () => {
+    const isResumeConfirmed = await confirmAsync({
+      title: <p>Are you sure you want to retry job {job.id}?</p>,
+      okText: "Yes, retry job",
+      cancelText: "No, cancel",
+    });
+
+    if (isResumeConfirmed) {
+      resumeJob(job.id).then(() => fetchData());
+    }
+  }}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82ab1f1 and 9e38d44.

📒 Files selected for processing (3)
  • CHANGELOG.unreleased.md (1 hunks)
  • MIGRATIONS.unreleased.md (1 hunks)
  • frontend/javascripts/admin/job/job_list_view.tsx (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • CHANGELOG.unreleased.md
  • MIGRATIONS.unreleased.md
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: circleci_build
🔇 Additional comments (7)
frontend/javascripts/admin/job/job_list_view.tsx (7)

10-10: Appropriate icon selection for the resume action.

The PlayCircleOutlined icon is a good choice for the resume action as it visually communicates the concept of restarting or continuing a process.


14-14: Correctly added resumeJob to imports.

The resumeJob function import is correctly added alongside related job management functions.


25-25: Redux state management properly configured.

The imports for Redux's useSelector and the OxalisState type are correctly added to support the super user check functionality.

Also applies to: 28-28


139-139: Well-implemented super user permission check.

Using Redux's useSelector to check if the current user is a super user is an appropriate approach for restricting access to the resume functionality.


300-300: Consistent icon implementation.

The icon is properly added to the AsyncLink component, maintaining consistency with other action buttons in the interface.


305-317: Successfully implemented resume functionality for failed jobs.

The implementation correctly:

  1. Restricts the resume functionality to super users
  2. Only shows the button for jobs in "FAILURE" state
  3. Uses the resumeJob API call to restart the job
  4. Refreshes the data after resuming to show updated state
  5. Includes an informative tooltip explaining what happens when resuming a job

The "Retry" label is more accurate than "Resume" as it better communicates that the failed task will be retried, as discussed in previous comments.


309-309: Informative tooltip provides clarity for super users.

The tooltip clearly explains the behavior when resuming a job, which is particularly important for super users who need to understand the implications of restarting a workflow. This addresses the request from previous review comments.

Comment on lines 310 to 312
onClick={async () => {
resumeJob(job.id).then(() => fetchData());
}}
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify error handling for resumeJob API call.

The code assumes the resumeJob API call will always succeed. Consider adding error handling to manage potential failures and provide feedback to the user.


🏁 Script executed:

#!/bin/bash
# Check if there's error handling in the resumeJob implementation

echo "Checking if resumeJob has error handling:"
rg -A 5 "resumeJob.*=" admin/admin_rest_api.ts || echo "resumeJob function not found"

echo "Checking global error handling for API calls:"
rg -A 5 "catch\(" admin/admin_rest_api.ts || echo "No global error handling found"

Length of output: 552


Action Required: Enhance Error Handling for resumeJob API Call

  • The onClick handler in frontend/javascripts/admin/job/job_list_view.tsx currently calls resumeJob(job.id) without any error handling.
  • Our search for the resumeJob implementation in admin/admin_rest_api.ts did not locate its definition, and no global error handling (e.g., catch blocks) was found.
  • Please add explicit error handling for resumeJob (for example, using a try-catch structure with async/await or adding a .catch() callback) to manage potential API failures and provide appropriate user feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the AI is correct here, I'd say. Some kind of feedback would be useful. Maybe add an error toast upon failure and a success toast upon success?

Maybe something like this:

 onClick={async () => {
  try{
    await retryJob(job.id)
    await fetchData();
    Toast.success("Job is being retried");
    } catch (e) {
      console.error("Could not retry job", e);
      Toast.error("Failed to start retrying the job");
    }
  }}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Based on the results, I need to check for the imports in the file to determine which function name is correct:

@MichaelBuessemeyer I'll implement your suggestion with the error and success toasts. However, I noticed that the function name in the diffed code was resumeJob, but it should be retryJob based on the imports:

import { cancelJob, getJobs, retryJob } from "admin/admin_rest_api";

Here's the updated code with error handling:

onClick={async () => {
  try {
    await retryJob(job.id);
    await fetchData();
    Toast.success("Job is being retried");
  } catch (e) {
    console.error("Could not retry job", e);
    Toast.error("Failed to start retrying the job");
  }
}}

Is there a Toast component available in the codebase, or should we use a different notification system like Ant Design's message or notification APIs since the project seems to be using Ant Design components?


🏁 Script executed:

#!/bin/bash
# Check for resumeJob and retryJob in admin API files
echo "Looking for resumeJob in frontend files:"
rg "resumeJob" frontend/javascripts/admin/

echo "Looking for retryJob in frontend files:"
rg "retryJob" frontend/javascripts/admin/

Length of output: 589


🏁 Script executed:

#!/bin/bash
# Check imports in the job_list_view.tsx file
cat frontend/javascripts/admin/job/job_list_view.tsx | grep -n "import" | head -15

Length of output: 885


Based on the results, I need to look for more specific toast/notification implementations:


🏁 Script executed:

#!/bin/bash
# Check for toast or notification usage in the codebase
echo "Looking for Toast usage:"
rg "Toast\." frontend/javascripts/ --type ts --type tsx | head -10

echo "Looking for message or notification from Ant Design:"
rg "import.*message|import.*notification" frontend/javascripts/ --type ts --type tsx | head -10

Length of output: 396


🏁 Script executed:

#!/bin/bash
# Check for common notification patterns
echo "Looking for toast/notification implementations:"
rg -A 1 "notification\.success|message\.success|\.success\(" frontend/javascripts/ --type ts --type tsx | head -10

Length of output: 249


🏁 Script executed:

#!/bin/bash
# Check for common notification patterns without file type filtering
echo "Looking for message.success usage:"
rg "message\.success" frontend/javascripts/ | head -10

echo "Looking for notification.success usage:"
rg "notification\.success" frontend/javascripts/ | head -10

echo "Looking for antd message/notification imports:"
rg "import.*\{ .*message" frontend/javascripts/ | head -10

Length of output: 1993


@MichaelBuessemeyer You're absolutely right about adding error handling. I've checked the codebase, and it looks like the function should be retryJob (not resumeJob as in the diff).

For notifications, I see that the project uses Ant Design's message component. Here's the updated implementation with proper error handling:

onClick={async () => {
  try {
    await retryJob(job.id);
    await fetchData();
    message.success("Job is being retried");
  } catch (e) {
    console.error("Could not retry job", e);
    message.error("Failed to start retrying the job");
  }
}}

This will require adding the message import from Ant Design if it's not already there:

import { Input, Modal, Spin, Table, Tooltip, Typography, message } from "antd";

This implementation provides clear feedback to users when the retry action succeeds or fails, making the UI more user-friendly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
app/controllers/JobController.scala (1)

118-118: Consider adding a more descriptive error message for unauthorized access.

The current error message "notAllowed" is generic. A more specific message would help users understand why they can't perform this action.

-      _ <- userService.assertIsSuperUser(request.identity) ?~> "notAllowed" ~> FORBIDDEN
+      _ <- userService.assertIsSuperUser(request.identity) ?~> "Only super users can retry jobs" ~> FORBIDDEN
conf/evolutions/126-job-retried-by-super-user.sql (1)

5-8: Adding the New Column to Jobs Table
Line 7 adds the column retriedBySuperUser with a default value of false. While the implementation is correct, note that the SQL syntax typically uses “ADD COLUMN” (with COLUMN in capital or lowercase) rather than “ADD column”. For clarity and consistency, consider rewriting it as follows:

-ALTER TABLE webknossos.jobs ADD column retriedBySuperUser BOOLEAN NOT NULL DEFAULT false;
+ALTER TABLE webknossos.jobs ADD COLUMN retriedBySuperUser BOOLEAN NOT NULL DEFAULT false;

Additionally, double-check that the naming (retriedBySuperUser) aligns with the intended business terminology. The PR objectives mention a superuser action that “resets” or “resumes” a job, yet here the column is named “retriedBySuperUser.” Ensuring consistency between documentation, code, and feature behavior will help reduce confusion in the future.

tools/postgres/schema.sql (1)

473-476: Incorporation of the New Column in the Jobs Table
At line 475, the new column retriedBySuperUser BOOLEAN NOT NULL DEFAULT false, is added to the webknossos.jobs table within the schema definition. This reflects the change in the evolution script. As mentioned previously, verify that the naming is consistent with the feature’s intent (i.e. whether the column should be called “retriedBySuperUser” or “resumedBySuperUser”) and that any part of the codebase referencing this field is updated accordingly.
Optionally, if filtering jobs by this column becomes common, consider adding an index on it later to optimize query performance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e38d44 and 20689dd.

📒 Files selected for processing (11)
  • CHANGELOG.unreleased.md (1 hunks)
  • MIGRATIONS.unreleased.md (1 hunks)
  • app/controllers/JobController.scala (3 hunks)
  • app/models/job/Job.scala (3 hunks)
  • app/models/job/JobService.scala (1 hunks)
  • conf/evolutions/126-job-retried-by-super-user.sql (1 hunks)
  • conf/evolutions/reversions/126-job-retried-by-super-user.sql (1 hunks)
  • conf/webknossos.latest.routes (1 hunks)
  • frontend/javascripts/admin/api/jobs.ts (1 hunks)
  • frontend/javascripts/admin/job/job_list_view.tsx (4 hunks)
  • tools/postgres/schema.sql (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • CHANGELOG.unreleased.md
  • MIGRATIONS.unreleased.md
  • conf/webknossos.latest.routes
  • app/models/job/JobService.scala
  • frontend/javascripts/admin/job/job_list_view.tsx
🔇 Additional comments (11)
frontend/javascripts/admin/api/jobs.ts (1)

77-81: Concise and consistent implementation of the retry function.

The implementation follows the established pattern of other job-related API functions in this file, providing a clean interface for the frontend to interact with the backend retry endpoint.

conf/evolutions/reversions/126-job-retried-by-super-user.sql (2)

3-3: Great use of schema version assertion.

The assertion ensures the reversion script runs only when the correct schema version is in place, preventing accidental application to incorrect database states.


10-10: Correctly recreated the jobs_ view.

The view properly excludes deleted jobs, maintaining the expected behavior of the system.

app/models/job/Job.scala (3)

32-32: Well-named and properly defaulted field.

The retriedBySuperUser field clearly communicates its purpose and defaults to false, which is the correct initial state for a job.


130-130: Field mapping implementation is correct.

The parse method correctly maps the database column to the model field.


258-265:

❓ Verification inconclusive

Well-implemented retry functionality with proper validation.

The implementation includes:

  1. Access control check with assertUpdateAccess
  2. State validation ensuring only FAILURE jobs can be retried
  3. Setting the retriedBySuperUser flag for proper tracking

To verify all retry operations set the state to PENDING, run:


🏁 Script executed:

#!/bin/bash
# Check implementation of retry functionality across the codebase
echo "Checking for retryOne implementations:"
rg -A 10 "def retryOne" --type scala

echo "\nChecking for SQL UPDATE operations that modify job state:"
rg -A 3 "UPDATE webknossos.jobs.*SET state" --type scala

Length of output: 975


Verified Retry Functionality Implementation in Job.scala

The implementation in app/models/job/Job.scala (lines 258–265) has been confirmed to:

  • Check access via assertUpdateAccess.
  • Restrict retries strictly to jobs in the FAILURE state.
  • Update the job state to PENDING and set the retriedBySuperUser flag for audit tracking.

The search for the retryOne method and its SQL update shows the expected code snippet. However, the additional check for other SQL update operations modifying the job state produced no extra output. Please manually verify that no unintended SQL updates are present elsewhere in the repository.

app/controllers/JobController.scala (2)

67-67: Updated constructor with proper dependency injection.

Adding the UserService as a dependency follows the established pattern and provides the necessary functionality for super user verification.


115-123:

❓ Verification inconclusive

Well-implemented retry endpoint with proper authorization.

The implementation:

  1. Ensures jobs are enabled before proceeding
  2. Restricts the operation to super users only
  3. Returns an appropriate HTTP response (403 FORBIDDEN) when authorization fails
  4. Uses the jobDAO.retryOne method to perform the actual retry operation

To ensure this controller method is properly connected to the API route, run:


🏁 Script executed:

#!/bin/bash
# Verify that the retry route is properly defined in the routes file
echo "Looking for retry route configuration:"
rg "PATCH.*/jobs/.*retry" --type text

# Check if there are any other routes related to job retrying
echo "\nChecking for other job retry related routes:"
rg "jobs.*retry" --type text

Length of output: 318


Below is a new script that re-runs the route search without the unsupported file type argument. This should help verify whether the retry endpoint is actually configured in a routes file (typically in a file like conf/routes):

Manual Verification Needed: Confirm route configuration for JobController.retry endpoint
The implementation of the retry endpoint in app/controllers/JobController.scala looks well-implemented with proper authorization and error handling. However, since the original verification script did not produce the expected results due to an unsupported --type text flag, please run the updated script above to manually verify that the correct route for retrying jobs is defined in your routes file.

• Ensure that the HTTP method, path, and mapping in the routes file correctly connect to the retry action in the controller.
• Confirm that any existing API documentation or integration tests reflect this configuration.


🏁 Script executed:

#!/bin/bash
# Verify that the retry route is properly defined in the routes file(s)
echo "Searching for PATCH retry route configuration in conf/routes file(s):"
rg "PATCH.*\/jobs\/.*retry" conf/routes*

echo "Searching for any mention of 'retry' related to jobs in conf/routes file(s):"
rg "jobs.*retry" conf/routes*

Length of output: 608


Action: Verify the Routing Setup for the Retry Endpoint

The implementation in app/controllers/JobController.scala appears solid with proper authorization and error handling. However, our automated search for a corresponding route in the conf/routes file couldn’t produce definitive results due to regex parsing issues. Please verify manually that the route mapping for the retry action is correctly defined. In particular:

  • Ensure the HTTP method and path match: Confirm that the route (likely using PATCH or another method) targeting /jobs/<id>/retry is present in your conf/routes file.
  • Review integration tests or API documentation: Make sure that this mapping is documented and tested accordingly.
conf/evolutions/126-job-retried-by-super-user.sql (2)

1-4: Transaction Block and Schema Version Assertion Check
The migration starts with a transactional block and asserts that the current schema version is 125. This is a good safeguard to ensure the migration applies in the correct order. Please verify that the use of the ASSERT statement is supported in your target PostgreSQL version (as it is available from PostgreSQL 14 onward).


8-12: Recreating the Jobs View and Committing the Transaction
The steps to update the view (webknossos.jobs_) and commit the transaction are straightforward and correctly ordered. The view now selects records from the updated jobs table (including the new column) and excludes deleted entries.

tools/postgres/schema.sql (1)

23-24: Schema Version Update in Release Information
Line 23 updates the schemaVersion to 126 in the webknossos.releaseInformation table. This change is necessary to track the evolution of the schema correctly. Please ensure that this version update is coordinated with the corresponding evolution script.


DROP VIEW webknossos.jobs_;

ALTER TABLE webknossos.jobs DROP column resumedBySuperUser;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Field name inconsistency detected.

The column being dropped is named resumedBySuperUser, but in the Job model class (Job.scala), the field is named retriedBySuperUser. This inconsistency will cause mapping issues between the database and application model.

Update the column name to match the model:

-ALTER TABLE webknossos.jobs DROP column resumedBySuperUser;
+ALTER TABLE webknossos.jobs DROP column retriedBySuperUser;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ALTER TABLE webknossos.jobs DROP column resumedBySuperUser;
ALTER TABLE webknossos.jobs DROP column retriedBySuperUser;

@frcroth frcroth force-pushed the resume-failed-jobs branch from 20689dd to fa52211 Compare March 3, 2025 15:11
Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still works, nice 🎉

Please seem my two comments (one is attached to an already existing thread).

After that I should be good to go 🏃

Comment on lines 307 to 316
<AsyncLink
href="#"
title="Restarts the workflow from the failed task, skipping and reusing artifacts from preceding tasks that were already successful."
onClick={async () => {
retryJob(job.id).then(() => fetchData());
}}
icon={<PlayCircleOutlined className="icon-margin-right" />}
>
Retry
</AsyncLink>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move the text from the title prop to a wrapping Tooltip? This is more conventional with the codebase and matches the styling of the other tooltips.

Suggested change
<AsyncLink
href="#"
title="Restarts the workflow from the failed task, skipping and reusing artifacts from preceding tasks that were already successful."
onClick={async () => {
retryJob(job.id).then(() => fetchData());
}}
icon={<PlayCircleOutlined className="icon-margin-right" />}
>
Retry
</AsyncLink>
<Tooltip title="Restarts the workflow from the failed task, skipping and reusing artifacts from preceding tasks that were already successful.">
<AsyncLink
href="#"
title="Restarts the workflow from the failed task, skipping and reusing artifacts from preceding tasks that were already successful."
onClick={async () => {
retryJob(job.id).then(() => fetchData());
}}
icon={<PlayCircleOutlined className="icon-margin-right" />}
>
Retry
</AsyncLink>
</Tooltip>

@MichaelBuessemeyer
Copy link
Contributor

Awesome looks great 🎉

Did you do retesting a little or should I do this quickly? The changes are so minor that one person giving it a quick test should be enough IMO :)

Copy link
Member

@daniel-wer daniel-wer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could get this merged. I just tested the functionality and it worked nicely. Is there anything else holding this back?

Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think as well that this would be very helpful now.

I'd give it a go 🐐

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
CHANGELOG.unreleased.md (3)

13-15: New Feature Entry: Consider Additional Details
The "Added" entry on line 14 clearly highlights that failed jobs may be resumed by super-users. Since the PR’s main objective is to transition a failed job from "FAILURE" to "PENDING," it might add clarity to explicitly mention the state change. This would help users quickly grasp the exact behavior introduced by this PR.


24-29: Duplicate "Removed" Sections and Mis-categorized Entries
There are two separate "Removed" sections (lines 24–29) which creates confusion. Moreover, the entries under these headers describe bug fixes (“Fixed a bug …”) rather than actual removals. Consider merging these into a single section and reclassifying them as "Fixed" if they indeed represent bug resolutions. For example, you might remove the second "### Removed" header and adjust the categorization as follows:

-### Removed
-
- - Fixed a bug where newly added remote datasets would always appear in root folder, regardless of actual selected folder. [#8425]
- - Fixed a bug where the python libs functionality `wk.RemoteDataset.explore_and_add_remote` would error. [#8425]
-
-### Removed
+### Fixed
+ - Fixed a bug where newly added remote datasets would always appear in root folder, regardless of the selected folder. [#8425]
+ - Fixed a bug where `wk.RemoteDataset.explore_and_add_remote` would error. [#8425]

This adjustment would improve clarity and consistency with the Keep a Changelog guidelines.

🧰 Tools
🪛 LanguageTool

[style] ~25-~25: ‘in the meantime’ might be wordy. Consider a shorter alternative.
Context: ... if some of its layers had been deleted in the meantime. [#8405](https://github.com/scalablemin...

(EN_WORDINESS_PREMIUM_IN_THE_MEANTIME)


[uncategorized] ~26-~26: You might be missing the article “the” here.
Context: ... remote datasets would always appear in root folder, regardless of actual selected f...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~26-~26: You might be missing the article “the” here.
Context: ...ys appear in root folder, regardless of actual selected folder. [#8425](https://github...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


31-32: Empty "Breaking Changes" Section
The "### Breaking Changes" header is currently empty. If there are no breaking changes in this release, consider either removing the header or adding a note such as "None" for better clarity and to prevent confusion for users scanning the changelog.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c18f73 and d6145f5.

📒 Files selected for processing (13)
  • CHANGELOG.unreleased.md (1 hunks)
  • CHANGELOG.unreleased.md (1 hunks)
  • frontend/javascripts/oxalis/view/left-border-tabs/layer_settings_tab.tsx (3 hunks)
  • CHANGELOG.unreleased.md (1 hunks)
  • CHANGELOG.unreleased.md (1 hunks)
  • CHANGELOG.unreleased.md (1 hunks)
  • CHANGELOG.unreleased.md (1 hunks)
  • CHANGELOG.unreleased.md (1 hunks)
  • frontend/javascripts/oxalis/model/accessors/dataset_layer_transformation_accessor.ts (6 hunks)
  • frontend/javascripts/oxalis/view/left-border-tabs/layer_settings_tab.tsx (2 hunks)
  • CHANGELOG.unreleased.md (0 hunks)
  • frontend/javascripts/oxalis/model/accessors/dataset_layer_transformation_accessor.ts (5 hunks)
  • frontend/javascripts/oxalis/view/left-border-tabs/layer_settings_tab.tsx (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • CHANGELOG.unreleased.md
  • CHANGELOG.unreleased.md
  • CHANGELOG.unreleased.md
  • CHANGELOG.unreleased.md
  • CHANGELOG.unreleased.md
  • CHANGELOG.unreleased.md
  • CHANGELOG.unreleased.md
🧰 Additional context used
🪛 LanguageTool
CHANGELOG.unreleased.md

[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...calableminds/webknossos/pull/8405) ### Fixed - Fixed a bug that would lock a non-existing ma...

(ENGLISH_WORD_REPEAT_RULE)

🔇 Additional comments (9)
frontend/javascripts/oxalis/view/left-border-tabs/layer_settings_tab.tsx (3)

133-133: The import update improves layer configuration specificity.

This import allows for retrieving layer-specific default configurations instead of using generic defaults, which will improve the precision of layer settings.


1015-1019: LGTM: Enhanced layer configuration with context-aware defaults.

This change improves the configuration system by using getSpecificDefaultsForLayer to obtain context-aware default values based on the dataset, layer name, and layer type, rather than using generic defaults.


1046-1046: Appropriate application of the layer-specific defaults.

The default alpha value now comes from the layer-specific defaults, which properly completes the implementation of the contextual default configuration system.

frontend/javascripts/oxalis/model/accessors/dataset_layer_transformation_accessor.ts (3)

194-196: Good defensive programming with standardized transform handling.

The new toIdentityTransformMaybe function centralizes the logic for safely returning either the provided transform (if it exists and isn't an identity transform) or a standardized identity transform. This prevents null returns and ensures consistent behavior.


217-217: Consistent application of the transform normalization logic.

The function calls to toIdentityTransformMaybe replace direct returns, ensuring all code paths consistently handle null or identity transformations in the same way. This increases code reliability and prevents potential null reference errors.

Also applies to: 228-228, 232-232, 273-273, 288-288


246-248: Renamed function better expresses its comparison purpose.

The function was renamed from isIdentityTransform to equalsIdentityTransform, which more accurately reflects that it's comparing a transform with the identity transform rather than checking some inherent property.

CHANGELOG.unreleased.md (3)

11-11: Commit Link Formatting is Clear
The commit comparison link on line 11 is correctly formatted and provides useful context for the unreleased changes.


16-18: Changed Section is Well Documented
The "Changed" entry on line 17 is clear, documenting that when using a zarr link as another layer’s source, the user’s token is used for access. This change is straightforward and appears consistent with the intended improvements.


19-23: Fixed Section: Clear Bug Fixes
The fixed entries on lines 20, 21, and 22 effectively describe the issues resolved, including mapping locks, alignment issues, and a saving bug. The descriptions provide adequate context for developers and users alike.

🧰 Tools
🪛 LanguageTool

[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...calableminds/webknossos/pull/8405) ### Fixed - Fixed a bug that would lock a non-existing ma...

(ENGLISH_WORD_REPEAT_RULE)

@frcroth frcroth merged commit c00aef9 into master Mar 12, 2025
3 checks passed
@frcroth frcroth deleted the resume-failed-jobs branch March 12, 2025 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add button to processing jobs page to resume failed jobs for superusers.
4 participants