-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: increase parquet arrow buffer size #340
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is sufficient, this configmap may not be mounted in all of the existing deployments
WalkthroughThis pull request updates the Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
charts/operator-wandb/charts/app/templates/deployment.yaml (1)
311-313
: Integration of Buffer Size Variable in App Deployment:
The new environment variableGORILLA_PARQUET_ARROW_BUFFER_SIZE
with a value of"2147483648"
(2GB) is consistently added in the app deployment template. This ensures uniform behavior across the system regarding Parquet operations. One suggestion: consider making this value configurable via the Helm values (if it isn’t already) to allow easier tuning in the future without modifying the template directly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
charts/operator-wandb/Chart.yaml
(1 hunks)charts/operator-wandb/charts/app/templates/deployment.yaml
(1 hunks)charts/operator-wandb/charts/executor/templates/deployment.yaml
(1 hunks)charts/operator-wandb/charts/parquet/templates/cron.yaml
(1 hunks)charts/operator-wandb/charts/parquet/templates/deployment.yaml
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- charts/operator-wandb/Chart.yaml
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-test
🔇 Additional comments (3)
charts/operator-wandb/charts/parquet/templates/cron.yaml (1)
129-131
: New Environment Variable Addition in CronJob:
The newGORILLA_PARQUET_ARROW_BUFFER_SIZE
is correctly added with a value of"2147483648"
(2GB). This change clearly aligns with the performance objectives of reducing excessive row groups and minimizing overhead during Parquet exports.charts/operator-wandb/charts/executor/templates/deployment.yaml (1)
160-162
: Consistent Buffer Size Configuration in Executor Deployment:
TheGORILLA_PARQUET_ARROW_BUFFER_SIZE
environment variable has been added with the appropriate value of"2147483648"
(2GB). This update is consistent with similar changes in other components, ensuring that the executor benefits from the intended performance improvement while writing Parquet files.charts/operator-wandb/charts/parquet/templates/deployment.yaml (1)
190-192
: Environment Variable Addition in Parquet Deployment:
The addition of theGORILLA_PARQUET_ARROW_BUFFER_SIZE
variable with the value"2147483648"
(2GB) is implemented correctly in the deployment configuration. This ensures that the service handling Parquet operations uses a larger buffer, which can help to reduce memory overhead during reads by consolidating row groups.
This is a variable that we want to set fleet-wide (outside of the tiniest instances).
The bigger buffer lets us write more data into fewer files and row groups. Each row group adds memory overhead when reading, so fewer files and row groups makes reading much smoother.
The current default is 512MB, but that triggers a bit too easily (in some cases) due to how we measure the buffer size. Bumping to 2GB makes this more consistent, at the cost of higher memory consumption on export.
https://weightsandbiases.slack.com/archives/C010Y174QGH/p1738711339558679?thread_ts=1738455939.730879&cid=C010Y174QGH
Summary by CodeRabbit
Chores
New Features