Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move resource optimisation changes to production #2455

Merged
merged 1 commit into from
Feb 13, 2025
Merged

Conversation

BenjaminSsempala
Copy link
Contributor

@BenjaminSsempala BenjaminSsempala commented Feb 12, 2025

Summary of Changes (What does this PR do?)

  • Move resource optimisation changes to production

Summary by CodeRabbit

  • Chores
    • Optimized resource allocation by fine-tuning CPU and memory settings across deployments.
    • Refined scheduling configurations with updated node selection and affinity rules to improve workload distribution.
    • Adjusted autoscaling parameters to enhance scaling behavior and overall system stability.

Copy link

coderabbitai bot commented Feb 12, 2025

📝 Walkthrough

Walkthrough

The pull request updates multiple Kubernetes production configuration files. Each modified values-prod.yaml now specifies a nodeSelector with { role: control-plane } and adjusts the affinity rules by replacing previous node-type criteria with a role-based preference. In addition, resource allocations (CPU and memory) have been revised across several modules and autoscaling parameters updated in a couple of files. These changes standardize scheduling and resource management across deployments.

Changes

File(s) Change Summary
k8s/calibrate/values-prod.yaml, k8s/docs/values-prod.yaml Resource Updates: Calibrate: CPU limit increased from 50m to 100m; Docs: CPU limit increased from 50m to 200m.
Node Selector: Set to { role: control-plane }.
Affinity: Replaced previous node-type based preferences with a new role-based matching (weight adjusted to 1).
k8s/inventory/values-prod.yaml, k8s/platform/values-prod.yaml Resource Updates: Inventory: CPU request increased (5m→10m), memory request decreased (60Mi→20Mi), CPU limit reduced (100m→50m).
Platform: Memory limit reduced (700Mi→350Mi), CPU and memory requests decreased (100m→20m; 250Mi→150Mi).
Node Selector & Affinity: Both now use { role: control-plane } and convert affinity from node-type to role-based matching.
k8s/netmanager/values-prod.yaml, k8s/reports/values-prod.yaml Resource & Autoscaling Updates: Netmanager: maxReplicas decreased from 4 to 3; Reports: CPU request (5m→10m), memory request (60Mi→150Mi), CPU limit (100m→50m), memory limit (60Mi→200Mi), and autoscaling adjusted (maxReplicas: 4→3, targetMemoryUtilizationPercentage: 70→80).
Node Selector & Affinity: Both updated to use { role: control-plane } with affinity key updated from node-type to role (value changed from general-purpose to control-plane).

Possibly related PRs

Suggested reviewers

  • Baalmart
  • Psalmz777

Poem

In YAML realms where settings align,
CPU and memory dance in perfect time.
Node selectors now call for control-plane might,
Affinity rules reshaped and set just right.
Cheers to clear configs and clusters that soar—
Our code sings in harmony evermore! 🚀

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)
  • JIRA integration encountered authorization issues. Please disconnect and reconnect the integration in the CodeRabbit UI.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
k8s/reports/values-prod.yaml (1)

28-39: Consider impact of control-plane scheduling

Moving workloads to control-plane nodes:

  1. Could impact cluster management operations
  2. Might affect control-plane performance

Consider:

  • Maintaining some workload diversity across node types
  • Setting resource quotas for control-plane nodes
  • Monitoring control-plane node metrics closely
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 28-28: trailing spaces

(trailing-spaces)


[warning] 35-35: wrong indentation: expected 12 but found 10

(indentation)


[warning] 38-38: wrong indentation: expected 14 but found 12

(indentation)

k8s/calibrate/values-prod.yaml (1)

1-1: Implement monitoring and fallback strategy

With all services being moved to control-plane nodes and various resource limit changes, consider:

  1. Setting up detailed monitoring for control-plane node metrics
  2. Implementing pod disruption budgets
  3. Creating a fallback strategy for node scheduling if control-plane nodes become overloaded

Example Pod Disruption Budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: airqo-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: airqo-*
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1e6daf and 85092b8.

📒 Files selected for processing (6)
  • k8s/calibrate/values-prod.yaml (2 hunks)
  • k8s/docs/values-prod.yaml (2 hunks)
  • k8s/inventory/values-prod.yaml (2 hunks)
  • k8s/netmanager/values-prod.yaml (2 hunks)
  • k8s/platform/values-prod.yaml (1 hunks)
  • k8s/reports/values-prod.yaml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
k8s/docs/values-prod.yaml

[error] 31-31: trailing spaces

(trailing-spaces)


[warning] 39-39: wrong indentation: expected 12 but found 10

(indentation)


[warning] 42-42: wrong indentation: expected 14 but found 12

(indentation)

k8s/calibrate/values-prod.yaml

[error] 31-31: trailing spaces

(trailing-spaces)


[warning] 39-39: wrong indentation: expected 12 but found 10

(indentation)


[warning] 42-42: wrong indentation: expected 14 but found 12

(indentation)

k8s/inventory/values-prod.yaml

[error] 16-16: trailing spaces

(trailing-spaces)


[warning] 35-35: wrong indentation: expected 12 but found 10

(indentation)


[warning] 38-38: wrong indentation: expected 14 but found 12

(indentation)

k8s/reports/values-prod.yaml

[error] 28-28: trailing spaces

(trailing-spaces)


[warning] 35-35: wrong indentation: expected 12 but found 10

(indentation)


[warning] 38-38: wrong indentation: expected 14 but found 12

(indentation)

🔇 Additional comments (6)
k8s/reports/values-prod.yaml (2)

26-27: Verify autoscaling thresholds

The changes make autoscaling more conservative:

  • Reduced max replicas (3 instead of 4)
  • Higher memory utilization threshold (80% vs 70%)

This could impact service availability during high load.

Let's check historical scaling patterns:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Check HPA events to verify if max replicas were ever reached
kubectl -n production describe hpa -l app=airqo-reports | grep -A 5 "Events:"

Length of output: 175


Verify Autoscaling Thresholds: Manual Verification Recommended

The automated script didn’t produce results because the environment lacked the kubectl command. Please verify the HPA events in a proper environment where kubectl is installed to determine if the autoscaling adjustments (3 max replicas and an 80% memory utilization threshold) are suitable for production loads.


19-23: Review resource allocation changes

The changes show a more balanced resource profile:

  • CPU: Higher baseline (10m) with tighter ceiling (50m) suggests better resource utilization
  • Memory: Significant increase in both request (150Mi) and limit (200Mi) indicates addressing potential memory pressure

Let's verify if these memory limits are sufficient by checking the application's memory usage patterns:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Get memory usage metrics for the reports service
kubectl -n production top pod -l app=airqo-reports --containers

Length of output: 163


Memory Usage Verification Still Pending

The initial shell command didn't yield output—likely due to environment limitations with kubectl—so we can't conclusively verify the memory usage metrics automatically. Please run the memory metrics command in an environment where kubectl is available or verify manually that the updated limit of 200Mi sufficiently supports the application's usage profile.

k8s/inventory/values-prod.yaml (1)

21-25: Verify aggressive memory reduction

The memory request has been significantly reduced (60Mi → 20Mi) while keeping a high limit (150Mi). This wide range might cause:

  1. More frequent pod evictions
  2. Unpredictable node scheduling

Let's check if the service can operate with this low memory:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Check minimum memory usage of inventory pods
kubectl -n production top pod -l app=airqo-inventory --containers | sort -k4 -n | head -n 1

Length of output: 191


Action Required: Manual Verification on Memory Usage

The provided script did not return any memory usage data because kubectl is unavailable in the current environment. To confidently verify that the aggressive memory reduction from 60Mi to 20Mi is safe, please run the memory usage check in an environment where kubectl is accessible. This will help determine if the inventory pods are indeed operating within acceptable memory constraints.

k8s/netmanager/values-prod.yaml (1)

16-39: Configuration aligns with optimization strategy

The changes maintain consistency with other services:

  • Reduced max replicas (3)
  • Aligned node scheduling with control-plane preference
k8s/docs/values-prod.yaml (1)

21-23: Verify high CPU limit increase

The CPU limit has been quadrupled (50m → 200m), which is significantly higher than other services.

Let's check if this increase is justified:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Check CPU usage patterns for docs service
kubectl -n production top pod -l app=airqo-prod-docs --containers | sort -k3 -n

Length of output: 179


Manual Verification Needed for CPU Usage

The kube-command script did not yield any output because kubectl is absent in this environment. As a result, we can't reliably determine if the 4× increase in CPU limit (50m → 200m) is warranted. Please manually verify the CPU usage of the docs service in the production environment or rerun the command in an appropriate environment where kubectl is available.

k8s/calibrate/values-prod.yaml (1)

21-43: Resource changes align with optimization strategy

The changes show balanced optimization:

  • Moderate CPU limit increase (100m)
  • Consistent node scheduling strategy
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 31-31: trailing spaces

(trailing-spaces)


[warning] 39-39: wrong indentation: expected 12 but found 10

(indentation)


[warning] 42-42: wrong indentation: expected 14 but found 12

(indentation)

Comment on lines 20 to +26
resources:
limits:
cpu: 200m
memory: 700Mi
memory: 350Mi
requests:
cpu: 100m
memory: 250Mi
cpu: 20m
memory: 150Mi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adjusting the CPU request to prevent resource contention.

The current CPU request (20m) is quite low compared to its limit (200m), creating a 10:1 ratio. This wide gap could lead to resource contention issues in production. Consider increasing the CPU request to maintain a more balanced ratio (typically 3:1 to 4:1).

  requests:
-    cpu: 20m
+    cpu: 50m
    memory: 150Mi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resources:
limits:
cpu: 200m
memory: 700Mi
memory: 350Mi
requests:
cpu: 100m
memory: 250Mi
cpu: 20m
memory: 150Mi
resources:
limits:
cpu: 200m
memory: 350Mi
requests:
cpu: 50m
memory: 150Mi

Comment on lines +31 to +43
nodeSelector:
role: control-plane
tolerations: []
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 10
preference:
- preference:
matchExpressions:
- key: node-type
- key: role
operator: In
values:
- general-purpose
- weight: 1
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-memory
- control-plane
weight: 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Reconsider scheduling workloads on control-plane nodes.

The current configuration forces workload placement on control-plane nodes, which is not recommended for production:

  1. Control-plane nodes should be reserved for critical cluster components
  2. This setup could impact cluster stability and scalability
  3. The combination of nodeSelector (hard requirement) and preferred nodeAffinity is redundant

Consider one of these alternatives:

# Option 1: Use worker nodes (preferred)
nodeSelector:
-  role: control-plane
+  role: worker

# Option 2: If control-plane is necessary, use only soft affinity
-nodeSelector:
-  role: control-plane
+nodeSelector: {}
 affinity:
   nodeAffinity:
     preferredDuringSchedulingIgnoredDuringExecution:
       - preference:
           matchExpressions:
             - key: role
               operator: In
               values:
                 - control-plane
         weight: 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
nodeSelector:
role: control-plane
tolerations: []
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 10
preference:
- preference:
matchExpressions:
- key: node-type
- key: role
operator: In
values:
- general-purpose
- weight: 1
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-memory
- control-plane
weight: 1
nodeSelector:
role: worker
tolerations: []
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: role
operator: In
values:
- control-plane
weight: 1

Copy link
Collaborator

@Baalmart Baalmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Baalmart Baalmart merged commit 06d2836 into staging Feb 13, 2025
31 checks passed
@Baalmart Baalmart deleted the resource-prod branch February 13, 2025 04:54
@Baalmart Baalmart mentioned this pull request Feb 13, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants