Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to cast tokenizer #32

Open
frederickmannings opened this issue Jan 17, 2025 · 5 comments
Open

failed to cast tokenizer #32

frederickmannings opened this issue Jan 17, 2025 · 5 comments

Comments

@frederickmannings
Copy link

frederickmannings commented Jan 17, 2025

Strange behaviour when trying to run a huggingface embedding pipeline within a test, resulting in error:

thread '<unnamed>' panicked at 'failed to cast tokenizer', src/lib.rs:86:54
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x7e279629eb1c m=4 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 105 gp=0xc000104fc0 m=4 mp=0xc0000a3508 [syscall]:
runtime.cgocall(0x96d380, 0xc00029c688)
        /usr/local/go/src/runtime/cgocall.go:167 +0x4b fp=0xc00029c660 sp=0xc00029c628 pc=0x49002b
github.com/daulet/tokenizers._Cfunc_encode(0x0, 0x7e273408ab90, 0xc000422330)
        _cgo_gotypes.go:161 +0x6b fp=0xc00029c688 sp=0xc00029c660 pc=0x91d18b
github.com/daulet/tokenizers.(*Tokenizer).EncodeWithOptions.func2(0x0?, 0x7e273408ab90, 0xc0004223

The same process works just fine when running outside of tests. @daulet any ideas about what might be the issue here? tokenizers.a is present in /usr/lib/

@frederickmannings
Copy link
Author

Using Hugot as the library to generate embeddings.

@frederickmannings
Copy link
Author

Debug logs with fill Rust backtrace:

$ RUST_BACKTRACE=full CGO_LDFLAGS="-L/usr/lib/" go test -v ./internal/embed
=== RUN   TestConfigOptions
=== RUN   TestConfigOptions/with_custom_model_dir
=== RUN   TestConfigOptions/with_custom_model_name
=== RUN   TestConfigOptions/with_both_options
--- PASS: TestConfigOptions (0.00s)
    --- PASS: TestConfigOptions/with_custom_model_dir (0.00s)
    --- PASS: TestConfigOptions/with_custom_model_name (0.00s)
    --- PASS: TestConfigOptions/with_both_options (0.00s)
=== RUN   TestEmbedder
=== RUN   TestEmbedder/initialisation


=== RUN   TestEmbedder/singleton_pattern
=== RUN   TestEmbedder/embedding_generation
thread '<unnamed>' panicked at 'failed to cast tokenizer', src/lib.rs:86:54
stack backtrace:
   0:           0xc9a6a1 - <unknown>
   1:           0xcf7f2f - <unknown>
   2:           0xc8f1c1 - <unknown>
   3:           0xc9a4b5 - <unknown>
   4:           0xc9d177 - <unknown>
   5:           0xc9cf64 - <unknown>
   6:           0xc9d6ec - <unknown>
   7:           0xc9d5e7 - <unknown>
   8:           0xc9aad6 - <unknown>
   9:           0xc9d332 - <unknown>
  10:           0x422363 - <unknown>
  11:           0x422323 - <unknown>
  12:           0x97638e - <unknown>
  13:           0x96d3b8 - <unknown>
  14:           0x49e2e4 - <unknown>
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x74a06aa9eb1c m=8 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 121 gp=0xc000104fc0 m=8 mp=0xc000580008 [syscall]:
runtime.cgocall(0x96d380, 0xc0000b1688)
        /usr/local/go/src/runtime/cgocall.go:167 +0x4b fp=0xc0000b1660 sp=0xc0000b1628 pc=0x49002b
github.com/daulet/tokenizers._Cfunc_encode(0x0, 0x749ff413a6d0, 0xc000598820)
        _cgo_gotypes.go:160 +0x6b fp=0xc0000b1688 sp=0xc0000b1660 pc=0x91d18b
github.com/daulet/tokenizers.(*Tokenizer).EncodeWithOptions.func2(0x0?, 0x749ff413a6d0, 0xc0005988
20)
        /home/fred/go/pkg/mod/github.com/daulet/[email protected]/tokenizer.go:368 +0x91 fp=0xc00
00b1728 sp=0xc0000b1688 pc=0x91e0f1
github.com/daulet/tokenizers.(*Tokenizer).EncodeWithOptions(0xc0000a00a0, {0xeeb462?, 0x44543c?}, 
0x1, {0xc00013c100, 0x3, 0x432ebe?})
        /home/fred/go/pkg/mod/github.com/daulet/[email protected]/tokenizer.go:368 +0x12d fp=0xc0
000b1910 sp=0xc0000b1728 pc=0x91d94d
github.com/knights-analytics/hugot/pipelineBackends.tokenizeInputsRust(0xc0000b1d08, 0xc000206450,
 {0xc0000b1ed0, 0x1, 0x9214c6?})
        /home/fred/go/pkg/mod/github.com/knights-analytics/[email protected]/pipelineBackends/tokenizer
_rust.go:55 +0x134 fp=0xc0000b1b88 sp=0xc0000b1910 pc=0x921934
github.com/knights-analytics/hugot/pipelineBackends.TokenizeInputs(...)
        /home/fred/go/pkg/mod/github.com/knights-analytics/[email protected]/pipelineBackends/tokenizer
.go:34
github.com/knights-analytics/hugot/pipelines.(*FeatureExtractionPipeline).Preprocess(0xc0004460a0,
 0xc0000b1d08, {0xc0000b1ed0, 0x1, 0x1})
        /home/fred/go/pkg/mod/github.com/knights-analytics/[email protected]/pipelines/featureExtractio
n.go:151 +0x7e fp=0xc0000b1bd8 sp=0xc0000b1b88 pc=0x95399e
github.com/knights-analytics/hugot/pipelines.(*FeatureExtractionPipeline).RunPipeline(0xc0004460a0
, {0xc0002536d0?, 0x53fd5e?, 0xc000204820?})
        /home/fred/go/pkg/mod/github.com/knights-analytics/[email protected]/pipelines/featureExtractio
n.go:235 +0x11e fp=0xc0000b1da0 sp=0xc0000b1bd8 pc=0x95425e
github.com/Predixus/DynaRAG/internal/embed.(*Embedder).GetEmbeddings(0xfcfc78?, {0xc0002536d0?, 0x
0?, 0x0?})
        /home/fred/git/predixus/DynaRAG/internal/embed/main.go:146 +0xc5 fp=0xc0000b1e18 sp=0xc000
0b1da0 pc=0x968905
github.com/Predixus/DynaRAG/internal/embed.TestEmbedder.func4(0xc000204820)
        /home/fred/git/predixus/DynaRAG/internal/embed/main_test.go:168 +0x29e fp=0xc0000b1f70 sp=
0xc0000b1e18 pc=0x9698de
testing.tRunner(0xc000204820, 0xc0002b00d8)
        /usr/local/go/src/testing/testing.go:1690 +0xf4 fp=0xc0000b1fc0 sp=0xc0000b1f70 pc=0x53b1f
4
testing.(*T).Run.gowrap1()
        /usr/local/go/src/testing/testing.go:1743 +0x25 fp=0xc0000b1fe0 sp=0xc0000b1fc0 pc=0x53c1e
5
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000b1fe8 sp=0xc0000b1fe0 pc=0x49e66
1
created by testing.(*T).Run in goroutine 24
        /usr/local/go/src/testing/testing.go:1743 +0x390

goroutine 1 gp=0xc0000061c0 m=nil [chan receive]:
runtime.gopark(0x210340?, 0x74a0240c1688?, 0x18?, 0x0?, 0xe51d20?)
        /usr/local/go/src/runtime/proc.go:424 +0xce fp=0xc00018f9c8 sp=0xc00018f9a8 pc=0x49650e
runtime.chanrecv(0xc000112a10, 0xc00018faaf, 0x1)
        /usr/local/go/src/runtime/chan.go:639 +0x41c fp=0xc00018fa40 sp=0xc00018f9c8 pc=0x42c7bc
runtime.chanrecv1(0x15dfd00?, 0xe09ce0?)
        /usr/local/go/src/runtime/chan.go:489 +0x12 fp=0xc00018fa68 sp=0xc00018fa40 pc=0x42c372
testing.(*T).Run(0xc0002044e0, {0xeeab88?, 0x0?}, 0xf2b0f8)
        /usr/local/go/src/testing/testing.go:1751 +0x3ab fp=0xc00018fb28 sp=0xc00018fa68 pc=0x53c0
8b
testing.runTests.func1(0xc0002044e0)
        /usr/local/go/src/testing/testing.go:2168 +0x37 fp=0xc00018fb68 sp=0xc00018fb28 pc=0x53e35
7
testing.tRunner(0xc0002044e0, 0xc00018fc70)
        /usr/local/go/src/testing/testing.go:1690 +0xf4 fp=0xc00018fbb8 sp=0xc00018fb68 pc=0x53b1f
4
testing.runTests(0xc0001387f8, {0x159bf80, 0x2, 0x2}, {0x494390?, 0x493ffa?, 0x15e05a0?})
        /usr/local/go/src/testing/testing.go:2166 +0x43d fp=0xc00018fca0 sp=0xc00018fbb8 pc=0x53e2
3d
testing.(*M).Run(0xc0001db900)
        /usr/local/go/src/testing/testing.go:2034 +0x64a fp=0xc00018fed0 sp=0xc00018fca0 pc=0x53cc
6a
main.main()
        _testmain.go:47 +0x9b fp=0xc00018ff50 sp=0xc00018fed0 pc=0x96b4bb
runtime.main()
        /usr/local/go/src/runtime/proc.go:272 +0x28b fp=0xc00018ffe0 sp=0xc00018ff50 pc=0x46052b
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00018ffe8 sp=0xc00018ffe0 pc=0x49e66
1

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:

@frederickmannings
Copy link
Author

It's finding the file just fine. Seems to have an issue with parsing it.

@daulet
Copy link
Owner

daulet commented Feb 1, 2025

@frederickmannings can you share the config? perhaps minimized version if you can reduce it to essential repro?

@frederickmannings
Copy link
Author

Thanks @daulet for replying. I will try and shrink this down to the minimal example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants