Move resource optimisation changes to production #2455

BenjaminSsempala · 2025-02-12T22:02:17Z

Summary of Changes (What does this PR do?)

Move resource optimisation changes to production

Summary by CodeRabbit

Chores
- Optimized resource allocation by fine-tuning CPU and memory settings across deployments.
- Refined scheduling configurations with updated node selection and affinity rules to improve workload distribution.
- Adjusted autoscaling parameters to enhance scaling behavior and overall system stability.

coderabbitai · 2025-02-12T22:02:23Z

📝 Walkthrough

Walkthrough

The pull request updates multiple Kubernetes production configuration files. Each modified values-prod.yaml now specifies a nodeSelector with { role: control-plane } and adjusts the affinity rules by replacing previous node-type criteria with a role-based preference. In addition, resource allocations (CPU and memory) have been revised across several modules and autoscaling parameters updated in a couple of files. These changes standardize scheduling and resource management across deployments.

Changes

File(s)	Change Summary
`k8s/calibrate/values-prod.yaml`, `k8s/docs/values-prod.yaml`	Resource Updates: Calibrate: CPU limit increased from 50m to 100m; Docs: CPU limit increased from 50m to 200m. Node Selector: Set to `{ role: control-plane }`. Affinity: Replaced previous node-type based preferences with a new role-based matching (weight adjusted to 1).
`k8s/inventory/values-prod.yaml`, `k8s/platform/values-prod.yaml`	Resource Updates: Inventory: CPU request increased (5m→10m), memory request decreased (60Mi→20Mi), CPU limit reduced (100m→50m). Platform: Memory limit reduced (700Mi→350Mi), CPU and memory requests decreased (100m→20m; 250Mi→150Mi). Node Selector & Affinity: Both now use `{ role: control-plane }` and convert affinity from node-type to role-based matching.
`k8s/netmanager/values-prod.yaml`, `k8s/reports/values-prod.yaml`	Resource & Autoscaling Updates: Netmanager: `maxReplicas` decreased from 4 to 3; Reports: CPU request (5m→10m), memory request (60Mi→150Mi), CPU limit (100m→50m), memory limit (60Mi→200Mi), and autoscaling adjusted (`maxReplicas`: 4→3, `targetMemoryUtilizationPercentage`: 70→80). Node Selector & Affinity: Both updated to use `{ role: control-plane }` with affinity key updated from `node-type` to `role` (value changed from general-purpose to control-plane).

Possibly related PRs

Minor resource changes for Caliberate and docs #2442: Addresses similar modifications in resource limits and node selection within a values-prod.yaml file.
Resource optimisation: Updating node Selection values #2388: Also revises the nodeSelector to specify a control-plane role and updates affinity configurations.
Adjust resources and modify node selection for different Frontend app… #2379: Focuses on updating the nodeSelector and affinity settings to use a role-based approach, closely aligning with these changes.

Suggested reviewers

Baalmart
Psalmz777

Poem

In YAML realms where settings align,
CPU and memory dance in perfect time.
Node selectors now call for control-plane might,
Affinity rules reshaped and set just right.
Cheers to clear configs and clusters that soar—
Our code sings in harmony evermore! 🚀

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)

JIRA integration encountered authorization issues. Please disconnect and reconnect the integration in the CodeRabbit UI.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

k8s/reports/values-prod.yaml (1)

28-39: Consider impact of control-plane scheduling

Moving workloads to control-plane nodes:

Could impact cluster management operations

Might affect control-plane performance

Consider:

Maintaining some workload diversity across node types

Setting resource quotas for control-plane nodes

Monitoring control-plane node metrics closely

🧰 Tools

🪛 YAMLlint (1.35.1)

[error] 28-28: trailing spaces

(trailing-spaces)

[warning] 35-35: wrong indentation: expected 12 but found 10

(indentation)

[warning] 38-38: wrong indentation: expected 14 but found 12

(indentation)
k8s/calibrate/values-prod.yaml (1)
1-1: Implement monitoring and fallback strategy

With all services being moved to control-plane nodes and various resource limit changes, consider:

Setting up detailed monitoring for control-plane node metrics

Implementing pod disruption budgets

Creating a fallback strategy for node scheduling if control-plane nodes become overloaded

Example Pod Disruption Budget:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: airqo-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: airqo-*

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1e6daf and 85092b8.

📒 Files selected for processing (6)

k8s/calibrate/values-prod.yaml (2 hunks)
k8s/docs/values-prod.yaml (2 hunks)
k8s/inventory/values-prod.yaml (2 hunks)
k8s/netmanager/values-prod.yaml (2 hunks)
k8s/platform/values-prod.yaml (1 hunks)
k8s/reports/values-prod.yaml (1 hunks)

🧰 Additional context used

🪛 YAMLlint (1.35.1)

k8s/docs/values-prod.yaml

[error] 31-31: trailing spaces

(trailing-spaces)

[warning] 39-39: wrong indentation: expected 12 but found 10

(indentation)

[warning] 42-42: wrong indentation: expected 14 but found 12

(indentation)

k8s/calibrate/values-prod.yaml

[error] 31-31: trailing spaces

(trailing-spaces)

[warning] 39-39: wrong indentation: expected 12 but found 10

(indentation)

[warning] 42-42: wrong indentation: expected 14 but found 12

(indentation)

k8s/inventory/values-prod.yaml

[error] 16-16: trailing spaces

(trailing-spaces)

[warning] 35-35: wrong indentation: expected 12 but found 10

(indentation)

[warning] 38-38: wrong indentation: expected 14 but found 12

(indentation)

k8s/reports/values-prod.yaml

[error] 28-28: trailing spaces

(trailing-spaces)

[warning] 35-35: wrong indentation: expected 12 but found 10

(indentation)

[warning] 38-38: wrong indentation: expected 14 but found 12

(indentation)

🔇 Additional comments (6)

k8s/reports/values-prod.yaml (2)

26-27: Verify autoscaling thresholds

The changes make autoscaling more conservative:

Reduced max replicas (3 instead of 4)

Higher memory utilization threshold (80% vs 70%)

This could impact service availability during high load.

Let's check historical scaling patterns:
❌ Verification inconclusive

🏁 Script executed:
#!/bin/bash
# Check HPA events to verify if max replicas were ever reached
kubectl -n production describe hpa -l app=airqo-reports | grep -A 5 "Events:"
Length of output: 175

Verify Autoscaling Thresholds: Manual Verification Recommended

The automated script didn’t produce results because the environment lacked the kubectl command. Please verify the HPA events in a proper environment where kubectl is installed to determine if the autoscaling adjustments (3 max replicas and an 80% memory utilization threshold) are suitable for production loads.
19-23: Review resource allocation changes

The changes show a more balanced resource profile:

CPU: Higher baseline (10m) with tighter ceiling (50m) suggests better resource utilization

Memory: Significant increase in both request (150Mi) and limit (200Mi) indicates addressing potential memory pressure

Let's verify if these memory limits are sufficient by checking the application's memory usage patterns:
❌ Verification inconclusive

🏁 Script executed:
#!/bin/bash
# Get memory usage metrics for the reports service
kubectl -n production top pod -l app=airqo-reports --containers
Length of output: 163

Memory Usage Verification Still Pending

The initial shell command didn't yield output—likely due to environment limitations with kubectl—so we can't conclusively verify the memory usage metrics automatically. Please run the memory metrics command in an environment where kubectl is available or verify manually that the updated limit of 200Mi sufficiently supports the application's usage profile.
k8s/inventory/values-prod.yaml (1)

21-25: Verify aggressive memory reduction

The memory request has been significantly reduced (60Mi → 20Mi) while keeping a high limit (150Mi). This wide range might cause:

More frequent pod evictions

Unpredictable node scheduling

Let's check if the service can operate with this low memory:
❌ Verification inconclusive

🏁 Script executed:
#!/bin/bash
# Check minimum memory usage of inventory pods
kubectl -n production top pod -l app=airqo-inventory --containers | sort -k4 -n | head -n 1
Length of output: 191

Action Required: Manual Verification on Memory Usage

The provided script did not return any memory usage data because kubectl is unavailable in the current environment. To confidently verify that the aggressive memory reduction from 60Mi to 20Mi is safe, please run the memory usage check in an environment where kubectl is accessible. This will help determine if the inventory pods are indeed operating within acceptable memory constraints.
k8s/netmanager/values-prod.yaml (1)

16-39: Configuration aligns with optimization strategy

The changes maintain consistency with other services:

Reduced max replicas (3)

Aligned node scheduling with control-plane preference
k8s/docs/values-prod.yaml (1)

21-23: Verify high CPU limit increase

The CPU limit has been quadrupled (50m → 200m), which is significantly higher than other services.

Let's check if this increase is justified:
❌ Verification inconclusive

🏁 Script executed:
#!/bin/bash
# Check CPU usage patterns for docs service
kubectl -n production top pod -l app=airqo-prod-docs --containers | sort -k3 -n
Length of output: 179

Manual Verification Needed for CPU Usage

The kube-command script did not yield any output because kubectl is absent in this environment. As a result, we can't reliably determine if the 4× increase in CPU limit (50m → 200m) is warranted. Please manually verify the CPU usage of the docs service in the production environment or rerun the command in an appropriate environment where kubectl is available.
k8s/calibrate/values-prod.yaml (1)

21-43: Resource changes align with optimization strategy

The changes show balanced optimization:

Moderate CPU limit increase (100m)

Consistent node scheduling strategy

🧰 Tools

🪛 YAMLlint (1.35.1)

[error] 31-31: trailing spaces

(trailing-spaces)

[warning] 39-39: wrong indentation: expected 12 but found 10

(indentation)

[warning] 42-42: wrong indentation: expected 14 but found 12

(indentation)

coderabbitai · 2025-02-12T22:04:57Z

k8s/platform/values-prod.yaml

 resources:
  limits:
    cpu: 200m
-    memory: 700Mi
+    memory: 350Mi
  requests:
-    cpu: 100m
-    memory: 250Mi
+    cpu: 20m
+    memory: 150Mi


🛠️ Refactor suggestion

Consider adjusting the CPU request to prevent resource contention.

The current CPU request (20m) is quite low compared to its limit (200m), creating a 10:1 ratio. This wide gap could lead to resource contention issues in production. Consider increasing the CPU request to maintain a more balanced ratio (typically 3:1 to 4:1).

requests: - cpu: 20m + cpu: 50m memory: 150Mi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resources:

limits:

cpu: 200m

memory: 700Mi

memory: 350Mi

requests:

cpu: 100m

memory: 250Mi

cpu: 20m

memory: 150Mi

resources:

limits:

cpu: 200m

memory: 350Mi

requests:

cpu: 50m

memory: 150Mi

coderabbitai · 2025-02-12T22:04:57Z

k8s/platform/values-prod.yaml

+nodeSelector:
+  role: control-plane
 tolerations: []
 affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
-      - weight: 10
-        preference:
+      - preference:
          matchExpressions:
-            - key: node-type
+            - key: role
              operator: In
              values:
-                - general-purpose
-      - weight: 1
-        preference:
-          matchExpressions:
-            - key: node-type
-              operator: In
-              values:
-                - high-memory
+                - control-plane
+        weight: 1


⚠️ Potential issue

Reconsider scheduling workloads on control-plane nodes.

The current configuration forces workload placement on control-plane nodes, which is not recommended for production:

Control-plane nodes should be reserved for critical cluster components

This setup could impact cluster stability and scalability

The combination of nodeSelector (hard requirement) and preferred nodeAffinity is redundant

Consider one of these alternatives:

# Option 1: Use worker nodes (preferred) nodeSelector: - role: control-plane + role: worker # Option 2: If control-plane is necessary, use only soft affinity -nodeSelector: - role: control-plane +nodeSelector: {} affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: role operator: In values: - control-plane weight: 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

nodeSelector:

role: control-plane

tolerations: []

affinity:

nodeAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 10

preference:

- preference:

matchExpressions:

- key: node-type

- key: role

operator: In

values:

- general-purpose

- weight: 1

preference:

matchExpressions:

- key: node-type

operator: In

values:

- high-memory

- control-plane

weight: 1

nodeSelector:

role: worker

tolerations: []

affinity:

nodeAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- preference:

matchExpressions:

- key: role

operator: In

values:

- control-plane

weight: 1

Baalmart

thanks @BenjaminSsempala

Move resourec optimisation changes to production

85092b8

BenjaminSsempala added the ready for review label Feb 12, 2025

BenjaminSsempala requested review from Baalmart and Psalmz777 February 12, 2025 22:02

BenjaminSsempala self-assigned this Feb 12, 2025

coderabbitai bot reviewed Feb 12, 2025

View reviewed changes

Baalmart approved these changes Feb 13, 2025

View reviewed changes

Baalmart merged commit 06d2836 into staging Feb 13, 2025
31 checks passed

Baalmart deleted the resource-prod branch February 13, 2025 04:54

Baalmart mentioned this pull request Feb 13, 2025

move to production #2457

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move resource optimisation changes to production #2455

Move resource optimisation changes to production #2455

BenjaminSsempala commented Feb 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 12, 2025 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Feb 12, 2025

coderabbitai bot Feb 12, 2025

Baalmart left a comment

Move resource optimisation changes to production #2455

Move resource optimisation changes to production #2455

Conversation

BenjaminSsempala commented Feb 12, 2025 • edited by coderabbitai bot Loading

Summary of Changes (What does this PR do?)

Summary by CodeRabbit

coderabbitai bot commented Feb 12, 2025 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Feb 12, 2025

Choose a reason for hiding this comment

coderabbitai bot Feb 12, 2025

Choose a reason for hiding this comment

Baalmart left a comment

Choose a reason for hiding this comment

BenjaminSsempala commented Feb 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)