-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move resource optimisation changes to production #2455
Conversation
📝 WalkthroughWalkthroughThe pull request updates multiple Kubernetes production configuration files. Each modified Changes
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
k8s/reports/values-prod.yaml (1)
28-39
: Consider impact of control-plane schedulingMoving workloads to control-plane nodes:
- Could impact cluster management operations
- Might affect control-plane performance
Consider:
- Maintaining some workload diversity across node types
- Setting resource quotas for control-plane nodes
- Monitoring control-plane node metrics closely
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 28-28: trailing spaces
(trailing-spaces)
[warning] 35-35: wrong indentation: expected 12 but found 10
(indentation)
[warning] 38-38: wrong indentation: expected 14 but found 12
(indentation)
k8s/calibrate/values-prod.yaml (1)
1-1
: Implement monitoring and fallback strategyWith all services being moved to control-plane nodes and various resource limit changes, consider:
- Setting up detailed monitoring for control-plane node metrics
- Implementing pod disruption budgets
- Creating a fallback strategy for node scheduling if control-plane nodes become overloaded
Example Pod Disruption Budget:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: airqo-pdb spec: minAvailable: 1 selector: matchLabels: app: airqo-*
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
k8s/calibrate/values-prod.yaml
(2 hunks)k8s/docs/values-prod.yaml
(2 hunks)k8s/inventory/values-prod.yaml
(2 hunks)k8s/netmanager/values-prod.yaml
(2 hunks)k8s/platform/values-prod.yaml
(1 hunks)k8s/reports/values-prod.yaml
(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
k8s/docs/values-prod.yaml
[error] 31-31: trailing spaces
(trailing-spaces)
[warning] 39-39: wrong indentation: expected 12 but found 10
(indentation)
[warning] 42-42: wrong indentation: expected 14 but found 12
(indentation)
k8s/calibrate/values-prod.yaml
[error] 31-31: trailing spaces
(trailing-spaces)
[warning] 39-39: wrong indentation: expected 12 but found 10
(indentation)
[warning] 42-42: wrong indentation: expected 14 but found 12
(indentation)
k8s/inventory/values-prod.yaml
[error] 16-16: trailing spaces
(trailing-spaces)
[warning] 35-35: wrong indentation: expected 12 but found 10
(indentation)
[warning] 38-38: wrong indentation: expected 14 but found 12
(indentation)
k8s/reports/values-prod.yaml
[error] 28-28: trailing spaces
(trailing-spaces)
[warning] 35-35: wrong indentation: expected 12 but found 10
(indentation)
[warning] 38-38: wrong indentation: expected 14 but found 12
(indentation)
🔇 Additional comments (6)
k8s/reports/values-prod.yaml (2)
26-27
: Verify autoscaling thresholdsThe changes make autoscaling more conservative:
- Reduced max replicas (3 instead of 4)
- Higher memory utilization threshold (80% vs 70%)
This could impact service availability during high load.
Let's check historical scaling patterns:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check HPA events to verify if max replicas were ever reached kubectl -n production describe hpa -l app=airqo-reports | grep -A 5 "Events:"Length of output: 175
Verify Autoscaling Thresholds: Manual Verification Recommended
The automated script didn’t produce results because the environment lacked the
kubectl
command. Please verify the HPA events in a proper environment wherekubectl
is installed to determine if the autoscaling adjustments (3 max replicas and an 80% memory utilization threshold) are suitable for production loads.
19-23
: Review resource allocation changesThe changes show a more balanced resource profile:
- CPU: Higher baseline (10m) with tighter ceiling (50m) suggests better resource utilization
- Memory: Significant increase in both request (150Mi) and limit (200Mi) indicates addressing potential memory pressure
Let's verify if these memory limits are sufficient by checking the application's memory usage patterns:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Get memory usage metrics for the reports service kubectl -n production top pod -l app=airqo-reports --containersLength of output: 163
Memory Usage Verification Still Pending
The initial shell command didn't yield output—likely due to environment limitations with
kubectl
—so we can't conclusively verify the memory usage metrics automatically. Please run the memory metrics command in an environment wherekubectl
is available or verify manually that the updated limit of 200Mi sufficiently supports the application's usage profile.k8s/inventory/values-prod.yaml (1)
21-25
: Verify aggressive memory reductionThe memory request has been significantly reduced (60Mi → 20Mi) while keeping a high limit (150Mi). This wide range might cause:
- More frequent pod evictions
- Unpredictable node scheduling
Let's check if the service can operate with this low memory:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check minimum memory usage of inventory pods kubectl -n production top pod -l app=airqo-inventory --containers | sort -k4 -n | head -n 1Length of output: 191
Action Required: Manual Verification on Memory Usage
The provided script did not return any memory usage data because
kubectl
is unavailable in the current environment. To confidently verify that the aggressive memory reduction from 60Mi to 20Mi is safe, please run the memory usage check in an environment wherekubectl
is accessible. This will help determine if the inventory pods are indeed operating within acceptable memory constraints.k8s/netmanager/values-prod.yaml (1)
16-39
: Configuration aligns with optimization strategyThe changes maintain consistency with other services:
- Reduced max replicas (3)
- Aligned node scheduling with control-plane preference
k8s/docs/values-prod.yaml (1)
21-23
: Verify high CPU limit increaseThe CPU limit has been quadrupled (50m → 200m), which is significantly higher than other services.
Let's check if this increase is justified:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check CPU usage patterns for docs service kubectl -n production top pod -l app=airqo-prod-docs --containers | sort -k3 -nLength of output: 179
Manual Verification Needed for CPU Usage
The kube-command script did not yield any output because
kubectl
is absent in this environment. As a result, we can't reliably determine if the 4× increase in CPU limit (50m → 200m) is warranted. Please manually verify the CPU usage of the docs service in the production environment or rerun the command in an appropriate environment wherekubectl
is available.k8s/calibrate/values-prod.yaml (1)
21-43
: Resource changes align with optimization strategyThe changes show balanced optimization:
- Moderate CPU limit increase (100m)
- Consistent node scheduling strategy
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 31-31: trailing spaces
(trailing-spaces)
[warning] 39-39: wrong indentation: expected 12 but found 10
(indentation)
[warning] 42-42: wrong indentation: expected 14 but found 12
(indentation)
resources: | ||
limits: | ||
cpu: 200m | ||
memory: 700Mi | ||
memory: 350Mi | ||
requests: | ||
cpu: 100m | ||
memory: 250Mi | ||
cpu: 20m | ||
memory: 150Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adjusting the CPU request to prevent resource contention.
The current CPU request (20m) is quite low compared to its limit (200m), creating a 10:1 ratio. This wide gap could lead to resource contention issues in production. Consider increasing the CPU request to maintain a more balanced ratio (typically 3:1 to 4:1).
requests:
- cpu: 20m
+ cpu: 50m
memory: 150Mi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
resources: | |
limits: | |
cpu: 200m | |
memory: 700Mi | |
memory: 350Mi | |
requests: | |
cpu: 100m | |
memory: 250Mi | |
cpu: 20m | |
memory: 150Mi | |
resources: | |
limits: | |
cpu: 200m | |
memory: 350Mi | |
requests: | |
cpu: 50m | |
memory: 150Mi |
nodeSelector: | ||
role: control-plane | ||
tolerations: [] | ||
affinity: | ||
nodeAffinity: | ||
preferredDuringSchedulingIgnoredDuringExecution: | ||
- weight: 10 | ||
preference: | ||
- preference: | ||
matchExpressions: | ||
- key: node-type | ||
- key: role | ||
operator: In | ||
values: | ||
- general-purpose | ||
- weight: 1 | ||
preference: | ||
matchExpressions: | ||
- key: node-type | ||
operator: In | ||
values: | ||
- high-memory | ||
- control-plane | ||
weight: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider scheduling workloads on control-plane nodes.
The current configuration forces workload placement on control-plane nodes, which is not recommended for production:
- Control-plane nodes should be reserved for critical cluster components
- This setup could impact cluster stability and scalability
- The combination of
nodeSelector
(hard requirement) and preferrednodeAffinity
is redundant
Consider one of these alternatives:
# Option 1: Use worker nodes (preferred)
nodeSelector:
- role: control-plane
+ role: worker
# Option 2: If control-plane is necessary, use only soft affinity
-nodeSelector:
- role: control-plane
+nodeSelector: {}
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: role
operator: In
values:
- control-plane
weight: 1
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
nodeSelector: | |
role: control-plane | |
tolerations: [] | |
affinity: | |
nodeAffinity: | |
preferredDuringSchedulingIgnoredDuringExecution: | |
- weight: 10 | |
preference: | |
- preference: | |
matchExpressions: | |
- key: node-type | |
- key: role | |
operator: In | |
values: | |
- general-purpose | |
- weight: 1 | |
preference: | |
matchExpressions: | |
- key: node-type | |
operator: In | |
values: | |
- high-memory | |
- control-plane | |
weight: 1 | |
nodeSelector: | |
role: worker | |
tolerations: [] | |
affinity: | |
nodeAffinity: | |
preferredDuringSchedulingIgnoredDuringExecution: | |
- preference: | |
matchExpressions: | |
- key: role | |
operator: In | |
values: | |
- control-plane | |
weight: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @BenjaminSsempala
Summary of Changes (What does this PR do?)
Summary by CodeRabbit