Skip to content

FIX: Change type for body to LargeBinaryArray.#72

Open
d33tah wants to merge 2 commits intomaxcountryman:mainfrom
d33tah:main
Open

FIX: Change type for body to LargeBinaryArray.#72
d33tah wants to merge 2 commits intomaxcountryman:mainfrom
d33tah:main

Conversation

@d33tah
Copy link
Copy Markdown

@d33tah d33tah commented Sep 8, 2025

When migrating lots of huge private warc files, I noticed that it's possible for a chunk to exceed u32 memory limit of BinaryArray within the default batch size. Instead of changing the batch size, I decided to change typing - hopefully this makes the project more stable in this specific case.

When migrating lots of huge private warc files, I noticed that it's possible for a chunk to exceed u32 memory limit of BinaryArray within the default batch size. Instead of changing the batch size, I decided to change typing - hopefully this makes the project more stable in this specific case.
@maxcountryman
Copy link
Copy Markdown
Owner

Would be nice to have a test case that demonstrates the problem and ensures we don’t regress.

@d33tah
Copy link
Copy Markdown
Author

d33tah commented Sep 8, 2025

@maxcountryman apologies, I tried isolating the test case, but after spending some time on it I gave up. For what it's worth, here's the backtrace:

root@jacek-experiment-warc-reencoding:/mnt# ( head -n1 2023-10-bug.txt ; cat 2023-10-bug-single2.txt ; echo ) | ~/warc-parquet/target/debug/warc-parquet --batch-size=3500> /tmp/a.txt                                                                                                                                                                                                                                                                                                                  11:32:55 [16/1929]
Error: UnexpectedEOB                                                                                          
root@jacek-experiment-warc-reencoding:/mnt# ( head -n1 2023-10-bug.txt ; cat 2023-10-bug-single2.txt ; echo ) | ~/warc-parquet/target/debug/warc-parquet --batch-size=4000> /tmp/a.txt                                                                        

thread 'main' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-47.0.0/src/array/byte_array.rs:211:45:
offset overflow                                                                                                      
stack backtrace:                                                      
   0:     0x559a5516b7a2 - std::backtrace_rs::backtrace::libunwind::trace::h9c1aa7b29a521839                     
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/../../backtrace/src/backtrace/libunwind.rs:117:9
   1:     0x559a5516b7a2 - std::backtrace_rs::backtrace::trace_unsynchronized::hb123c31478ec901c             
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/../../backtrace/src/backtrace/mod.rs:66:14
   2:     0x559a5516b7a2 - std::sys::backtrace::_print_fmt::hdda75a118fd2034a                                
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/backtrace.rs:66:9
   3:     0x559a5516b7a2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hf435e8e9347709a8
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/backtrace.rs:39:26
   4:     0x559a5518ca93 - core::fmt::rt::Argument::fmt::h9802ea71fd88c728
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/fmt/rt.rs:173:76
   5:     0x559a5518ca93 - core::fmt::write::h0a51fad3804c5e7c
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/fmt/mod.rs:1465:25
   6:     0x559a551693e3 - std::io::default_write_fmt::h7e00b0a8732ee2a2
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/io/mod.rs:639:11
   7:     0x559a551693e3 - std::io::Write::write_fmt::h9759e4151bf4a45e
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/io/mod.rs:1954:13
   8:     0x559a5516b5f2 - std::sys::backtrace::BacktraceLock::print::h1ec5ce5bb8ee285e
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/backtrace.rs:42:9
   9:     0x559a5516c676 - std::panicking::default_hook::{{closure}}::h5ffefe997a3c75e4
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:300:27
  10:     0x559a5516c479 - std::panicking::default_hook::h820c77ba0601d6bb
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:327:9
  11:     0x559a5516d002 - std::panicking::rust_panic_with_hook::h8b29cbe181d50030
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:833:13
  12:     0x559a5516cdba - std::panicking::begin_panic_handler::{{closure}}::h9f5b6f6dc6fde83e
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:706:13
  13:     0x559a5516bca9 - std::sys::backtrace::rust_end_short_backtrace::hd7b0c344383b0b61
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/backtrace.rs:168:18
  14:     0x559a5516ca4d - rustc[5224e6b81cd82a8f]::rust_begin_unwind
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
  15:     0x559a544fcc80 - core::panicking::panic_fmt::hc49fc28484033487
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
  16:     0x559a544fcc5b - core::panicking::panic_display::h6b4caeeb29cac1c9
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:268:5
  17:     0x559a544fcc5b - core::option::expect_failed::hfe7afbd436ce9c45
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/option.rs:2081:5
  18:     0x559a54fded91 - core::option::Option<T>::expect::hb33ecd65b45c9065
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/option.rs:960:21
  19:     0x559a54559e02 - arrow_array::array::byte_array::GenericByteArray<T>::from_iter_values::h14acbf5f5ba0536b
                               at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-47.0.0/src/array/byte_array.rs:211:45
  20:     0x559a5455a78e - arrow_array::array::binary_array::<impl core::convert::From<alloc::vec::Vec<&[u8]>> for arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericBinaryType<OffsetSize>>>::from::h8240cfa630739d70
                               at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-47.0.0/src/array/binary_array.rs:115:9
  21:     0x559a5454aa09 - warc_parquet::reader::build_record_batch::h0c7775c4f9e02aa4
                               at /root/warc-parquet/src/reader.rs:419:26
  22:     0x559a54534a99 - <warc_parquet::reader::IterReader<R> as core::iter::traits::iterator::Iterator>::next::hd4560ad26e0df4b1
                               at /root/warc-parquet/src/reader.rs:183:18
  23:     0x559a54526461 - warc_parquet::write_row_groups::h41332abf56f5e194
                               at /root/warc-parquet/src/main.rs:88:25
  24:     0x559a545102c3 - warc_parquet::main::hb284ea0f950d46f1
                               at /root/warc-parquet/src/main.rs:128:9
  25:     0x559a5452270b - core::ops::function::FnOnce::call_once::heac3c12f3a96ca20
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/ops/function.rs:250:5
  26:     0x559a5451abee - std::sys::backtrace::rust_begin_short_backtrace::h002c9990de245e45
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/sys/backtrace.rs:152:18
  27:     0x559a54527f41 - std::rt::lang_start::{{closure}}::h387e11f190645d88
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/rt.rs:206:18
  28:     0x559a55163a70 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hf19f6f3c4f0cdb1c
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/ops/function.rs:284:21
  29:     0x559a55163a70 - std::panicking::catch_unwind::do_call::hdc689d1fa1f67ace
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:589:40
  30:     0x559a55163a70 - std::panicking::catch_unwind::h1025d97250558c4b
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:552:19
  31:     0x559a55163a70 - std::panic::catch_unwind::h3f76beef3f07b6dc
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panic.rs:359:14
  32:     0x559a55163a70 - std::rt::lang_start_internal::{{closure}}::haf71a34e0fbc4d76
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/rt.rs:175:24
  33:     0x559a55163a70 - std::panicking::catch_unwind::do_call::hbd7dad3d92d409ee 
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:589:40   
  34:     0x559a55163a70 - std::panicking::catch_unwind::h69749cff2ef3daa8                     
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:552:19    
  35:     0x559a55163a70 - std::panic::catch_unwind::ha18d8f0ab15c4858        
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panic.rs:359:14
  36:     0x559a55163a70 - std::rt::lang_start_internal::h31bbb7f936fd6b5d                                                                                                                                                                                   
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/rt.rs:171:5            
  37:     0x559a54527f27 - std::rt::lang_start::h8e6f93d87e5c7b14                  
                               at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/rt.rs:205:5        
  38:     0x559a5451253e - main                                           
  39:     0x7f43733b2ca8 - <unknown>                                                                                 
  40:     0x7f43733b2d65 - libc_start_main                          
  41:     0x559a544fd761 - _start                                                                                
  42:                0x0 - <unknown>                                                   

@d33tah
Copy link
Copy Markdown
Author

d33tah commented Sep 8, 2025

@maxcountryman

I sent you an e-mail with a 84MB .warc.zst file (3.1G uncompressed). Please let me know if you can reproduce the issue and if it's any helpful to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants