Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NC | Lifecycle | Status, Events, Timeout #8860

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

romayalon
Copy link
Contributor

@romayalon romayalon commented Mar 9, 2025

Describe the Problem

The NC lifecycle process should -

  1. Write lifecycle run status in a log file on finish/fail.
  2. Write events on start, finish, failure.
  3. Have configurable timeout that is by default 6 hours.

Explain the Changes

  1. Added config.NC_LIFECYLE_TIMEOUT_MS configuration that is set by default to be 6 hours, added P.race[lifecyle_run(), lifecycle_timeout()] in lifecycle_run_under_lock().
  2. Added 3 new events - LIFECYCLE_STARTED, LIFECYCLE_SUCCESSFUL and LIFECYCLE_FAILED that will be logged to events.log.
  3. Added lifecycle run status capture in nc_lifecycle, per run, bucket and rule. After every run a JSON file will be written under /var/log/noobaa/lifecycle/.
    The name of the file will be lifecycle_run_1741525171954.json while 1741525171954 is the lifecycle run start timestamp.
    The following is an example of the times other statuses captured in a lifecycle run status.
{
   "running_host": "my-hostname1",
    "lifecycle_run_times": {
      "start_time": 1741529394579,
      "end_time": 1741529394597,
      "took_ms": 18,
      "buckets_list_start_time": 1741529394584,
      "buckets_list_end_time": 1741529394584,
      "buckets_list_took_ms": 0,
      "buckets_process_took_ms": 13
    },
    "errors": [],
    "buckets_statuses": [
      {
        "bucket_name": "bucket1",
        "bucket_process_times": {
          "start_time": 1741529394585,
          "end_time": 1741529394597,
          "took_ms": 12
          "error": { "code": "", "message": "", "stack": ""}
        },
        "rules_statuses": [
          {
            "lifecycle_rule": "rule1",
            "rule_process_times": {
              "start_time": 1741529394587,
              "end_time": 1741529394597,
              "took_ms": 10,
              "list_candidates_start_time": 1741529394589,
              "list_candidates_took_ms": 0,
              "delete_candidates_start_time": 1741529394589,
              "delete_candidates_took_ms": 0,
              "update_last_sync_time": 1741529394589,
              "update_last_sync_took_ms": 8
            },
            "num_objects_deleted": 0,
            "num_objects_delete_failed": 0,
            "objects_delete_errors": []
            "error": { "code": "", "message": "", "stack": ""} 
          }
        ]
      }
    ]
  }
}

Issues: Fixed #xxx / Gap #xxx

  1. Add automatic tests.
  2. Add docs.

Testing Instructions:

Status manual test -

  1. create an account, a bucket and set lifecycle configuration on the bucket.
  2. start nsfs process - sudo node noobaa-core/src/cmd/nsfs.js --debug 5
  3. run noobaa-cli lifecycle and check for the status of the rule processing.

Events manual test -
Happy path -

  1. start nsfs process - sudo node noobaa-core/src/cmd/nsfs.js --debug 5
  2. run noobaa-cli lifecycle and check in the stderr logs (if locally) or in events.log the following -
Mar-9 16:09:54.583 [noobaa-cli/91017] [EVENT]{"timestamp":"2025-03-09T14:09:54.580Z","host":"hostname1","event":{"code":"noobaa_lifecycle_worker_started","message":"NooBaa Lifecycle worker run started.","description":"NooBaa Lifecycle worker run started.","entity_type":"NODE","event_type":"INFO","scope":"NODE","severity":"INFO","state":"HEALTHY","pid":91017}}
Mar-9 16:09:54.597 [noobaa-cli/91017] [EVENT]{"timestamp":"2025-03-09T14:09:54.597Z","host":"hostname1","event":{"code":"noobaa_lifecycle_worker_finished_successfully","message":"NooBaa Lifecycle worker run finished successfully.","description":"NooBaa Lifecycle worker finished successfully.","entity_type":"NODE","event_type":"INFO","scope":"NODE","severity":"INFO","state":"HEALTHY","pid":91017}}

Sad path -

  1. don't start nsfs process
  2. run noobaa-cli lifecycle and check in the stderr logs (if locally) or in events.log the following -
Mar-9 16:48:22.733 [noobaa-cli/97456] [EVENT]{"timestamp":"2025-03-09T14:48:22.732Z","host":"hostname1","event":{"code":"noobaa_lifecycle_worker_started","message":"NooBaa Lifecycle worker run started.","description":"NooBaa Lifecycle worker run started.","entity_type":"NODE","event_type":"INFO","scope":"NODE","severity":"INFO","state":"HEALTHY","pid":97456}}
Mar-9 16:48:22.773 [noobaa-cli/97456] [EVENT]{"timestamp":"2025-03-09T14:48:22.773Z","host":"hostname1","event":{"code":"noobaa_lifecycle_worker_failed","message":"NooBaa Failed to run lifecycle worker.","description":"NooBaa Lifecycle worker run failed due to an error.","entity_type":"NODE","event_type":"ERROR","scope":"NODE","severity":"ERROR","state":"DEGRADED","pid":97456}

Timeout manual test -

  1. Change config.NC_LIFECYLE_TIMEOUT_MS to 1 (ms).
  2. run noobaa-cli lifecycle and expect the following error -
{
  "error": {
    "code": "LifecycleFailed",
    "message": "Lifecycle worker run failed.",
    "detail": {
      "message": "lifecycle worker reached timeout",
      "stack": "Error: lifecycle worker reached timeout\n    at Timeout._onTimeout (noobaa-core/src/manage_nsfs/nc_lifecycle.js:34:69)\n    at listOnTimeout (node:internal/timers:581:17)\n    at process.processTimers (node:internal/timers:519:7)"
    }
  }
}
  • Doc added/updated
  • Tests added

@romayalon romayalon force-pushed the romy-lifecycle-status-events-timeout branch from f7f4e64 to ccd4f4c Compare March 9, 2025 14:56
@romayalon romayalon requested review from guymguym and nadavMiz March 9, 2025 15:20
@nadavMiz
Copy link
Contributor

nadavMiz commented Mar 9, 2025

in the future we should probably use a wrapper function instead of having start_time, end_time all over the code

await config_fs.create_dir_if_missing(lifecyle_logs_dir_path);
const lock_path = path.join(lifecyle_logs_dir_path, CLUSTER_LOCK);

await native_fs_utils.lock_and_run(config_fs.fs_context, lock_path, async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function got a bit messy. I think you should probably move the lambda function to be a proper function with relevant name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants