-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Description
After upgrading trino from v471 to v475, when using the BigQuery connector with bigquery.arrow-serialization.enabled
set to true
(the default value), we are encountering an OutOfMemoryException
with the message "Unable to allocate buffer of size". This happens specifically when the connector attempts to allocate memory for Arrow record batches.
The issue does not occur when we set bigquery.arrow-serialization.enabled
to false
in the connector configuration. This indicates that the problem is related to the Arrow serialization process.
The query is a simple SELECT
statement on a large BigQuery table. The worker node has significant memory resources (768 GiB), and the query is the only one running, so a lack of overall system resources is unlikely to be the cause. The error logs point to a specific memory allocation failure within the Arrow buffer allocator, suggesting a potential memory leak or an incorrect memory management configuration within the connector's Arrow implementation.
Steps to Reproduce
- Configure the Trino BigQuery connector with the following settings:
connector.name=bigquery bigquery.project-id=REDACTED bigquery.credentials-file=/etc/secrets/gcp/REDACTED.json bigquery.views-enabled=true bigquery.skip-view-materialization=true bigquery.metadata.parallelism=32
- Execute the following query:
select * from "REDACTED"."REDACTED"."REDACTED" limit 5;
- Observe the
OutOfMemoryException
error in the worker logs.
Expected Behavior
The query should execute successfully, retrieving the data from BigQuery without memory allocation errors.
Actual Behavior
The query fails with an OutOfMemoryException
and a "Memory was leaked by query" message. The log indicates that the Arrow buffer allocator is unable to reserve the required memory, even though the total allocation is well below the system limits.
Environment
Worker Config and JVM
coordinator=false
http-server.http.port=8080
query.max-memory=1000000GB
query.max-memory-per-node=650GB
memory.heap-headroom-per-node=50GB
discovery.uri=http://trinoetl-cluster-trino:8080
internal-communication.shared-secret=REDACTED
exchange.http-client.max-content-length=2047MB
http-server.process-forwarded=true
query.max-execution-time=1h
jmx.rmiregistry.port=9080
jmx.rmiserver.port=9081
shutdown.grace-period=120s
-server
-agentpath:/usr/lib/trino/bin/libjvmkill.so
-Xmx700G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Allow loading dynamic agent used by JOL
-XX:+EnableDynamicAgentLoading
-Djava.security.krb5.conf=/etc/secrets/krb5.conf/REDACTED.conf
--add-opens=java.base/java.nio=ALL-UNNAMED
-Dcom.sun.management.jmxremote.rmi.port=9081
Coordinator Config and JVM
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=1000000GB
query.max-memory-per-node=1GB
discovery.uri=http://localhost:8080
http-server.authentication.type=PASSWORD
internal-communication.shared-secret=REDACTED
exchange.http-client.max-content-length=2047MB
http-server.process-forwarded=true
query.max-execution-time=1h
jmx.rmiregistry.port=9080
jmx.rmiserver.port=9081
shutdown.grace-period=120s
-server
-agentpath:/usr/lib/trino/bin/libjvmkill.so
-Xmx26G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Allow loading dynamic agent used by JOL
-XX:+EnableDynamicAgentLoading
-Djava.security.krb5.conf=/etc/secrets/krb5.conf/REDACTED.conf
--add-opens=java.base/java.nio=ALL-UNNAMED
-Dcom.sun.management.jmxremote.rmi.port=9081
System Information
- Trino Version: v475
- Java Version: Corretto-23.0.2.7.1 (build 23.0.2+7-FR)
- Platform: amd64
- Worker AWS VM Type: r6a.24xlarge (96 cores, 768 GiB)
- Hosted Platform: EKS
- System Load: The query was executed when no other query was running or scheduled.
Error Log
2025-08-13T08:47:53.891Z DEBUG SplitRunner-20250813_084752_00003_t2t8i.1.0.0-1-364 io.trino.plugin.bigquery.BigQueryStorageArrowPageSource Starting to read from projects/GCPPROJECT/locations/us/sessions/REDACTED/streams/REDACTED
2025-08-13T08:47:53.891Z DEBUG SplitRunner-20250813_084752_00003_t2t8i.1.0.0-1-364 io.trino.plugin.bigquery.ReadRowsHelper Reading rows from projects/GCPPROJECT/locations/us/sessions/GCPPROJECT/streams/REDACTED offset 0
2025-08-13T08:47:54.248Z DEBUG bigquery-bq_ibp_aws-1 io.trino.plugin.bigquery.ReadRowsHelper ReadRowsResponse from BigQuery: stats {
progress {
at_response_end: 0.008838748559355736
}
}
arrow_record_batch {
serialized_record_batch: "REDACTED"
}
row_count: 4096
arrow_schema {
serialized_schema: "REDACTED"
}
2025-08-13T08:47:54.249Z DEBUG SplitRunner-20250813_084752_00003_t2t8i.1.0.0-1-364 io.trino.plugin.bigquery.BigQueryStorageArrowPageSource Read 53324 bytes (total 53324) from projects/GCPPROJECT/locations/us/sessions/REDACTED/streams/REDACTED
2025-08-13T08:47:54.250Z WARN SplitRunner-20250813_084752_00003_t2t8i.1.0.0-1-364 io.trino.plugin.bigquery.BigQueryArrowBufferAllocator Failed to allocate 262144 bytes for allocator 'projects/GCPPROJECT/locations/us/sessions/GCPPROJECT/streams/REDACTED' due to FAILED_LOCAL
2025-08-13T08:47:54.250Z WARN SplitRunner-20250813_084752_00003_t2t8i.1.0.0-1-364 io.trino.plugin.bigquery.BigQueryArrowBufferAllocator Allocation failure details: Allocation outcome details:
allocator[projects/REDACTED/locations/us/sessions/GCPPROJECT/streams/REDACTED] reservation: 0 limit: 557753 used: 361472 requestedSize: 262144 allocatedSize: 0 localAllocationStatus: fail
2025-08-13T08:47:54.250Z ERROR SplitRunner-20250813_084752_00003_t2t8i.1.0.0-1-364 org.apache.arrow.memory.BaseAllocator Memory was leaked by query. Memory leaked: (361472)
Allocator(projects/REDACTED/locations/us/sessions/GCPPROJECT/streams/REDACTED) 0/361472/361472/557753 (res/actual/peak/limit)
2025-08-13T08:47:54.250Z DEBUG task-notification-1 io.trino.execution.TaskStateMachine Task 20250813_084752_00003_t2t8i.1.0.0 is FAILING
2025-08-13T08:47:54.250Z DEBUG SplitRunner-67 io.trino.execution.executor.dedicated.SplitProcessor Unable to allocate buffer of size 262144 (rounded from 138955) due to memory limit. Current allocation: 361472
org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of size 262144 (rounded from 138955) due to memory limit. Current allocation: 361472
at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:330)
at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:298)
at org.apache.arrow.compression.ZstdCompressionCodec.doDecompress(ZstdCompressionCodec.java:62)
at org.apache.arrow.vector.compression.AbstractCompressionCodec.decompress(AbstractCompressionCodec.java:78)
at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:128)
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:88)
at io.trino.plugin.bigquery.BigQueryArrowToPageConverter.convert(BigQueryArrowToPageConverter.java:105)
at io.trino.plugin.bigquery.BigQueryStorageArrowPageSource.getNextSourcePage(BigQueryStorageArrowPageSource.java:126)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:269)
at io.trino.operator.Driver.processInternal(Driver.java:403)
at io.trino.operator.Driver.lambda$process$8(Driver.java:306)
at io.trino.operator.Driver.tryWithLock(Driver.java:709)
at io.trino.operator.Driver.process(Driver.java:298)
at io.trino.operator.Driver.processForDuration(Driver.java:269)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)
at io.trino.execution.executor.dedicated.SplitProcessor.run(SplitProcessor.java:77)
at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.lambda$run$0(TaskEntry.java:201)
at io.trino.$gen.Trino_475____20250813_083524_2.run(Unknown Source)
at io.trino.execution.executor.dedicated.TaskEntry$VersionEmbedderBridge.run(TaskEntry.java:202)
at io.trino.execution.executor.scheduler.FairScheduler.runTask(FairScheduler.java:177)
at io.trino.execution.executor.scheduler.FairScheduler.lambda$submit$0(FairScheduler.java:164)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:128)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1575)
Suppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (361472)
Allocator(projects/GCPPROJECT/locations/us/sessions/GCPPROJECT/streams/REDACTED) 0/361472/361472/557753 (res/actual/peak/limit)
at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504)
at io.trino.plugin.bigquery.BigQueryStorageArrowPageSource.getNextSourcePage(BigQueryStorageArrowPageSource.java:124)
... 20 more