-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use rocksdb instead of in-memory map to reduce memory footprint #638
Conversation
WalkthroughThe pull request introduces a new dependency Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Import Command
participant LoadFunc as load_variants_jsonl
participant RocksDB as RocksDB Instance
participant JSONL as JSONL Files
CLI->>LoadFunc: Call with variant files, DB path, options
LoadFunc->>JSONL: Read variant files
LoadFunc->>RocksDB: Open/Create temporary database
LoadFunc->>RocksDB: Update variants per release
LoadFunc-->>CLI: Return DB instance and releases
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/clinvar_genes/cli/import.rs (3)
93-93
: Consider wrapping partial writes in a transaction or providing rollback logic.
If an error occurs midway, partial data may remain in the database. This might not be critical here, but transactional support can avoid inadvertent partial states.Also applies to: 96-102, 105-105
123-158
: Evaluate potential concurrency collisions.
Repeatedly retrieving, deserializing, and re-serializing variants in a loop could lead to race conditions or overwrites if multiple threads process the same key concurrently. Consider a locking strategy or write batches to minimize conflicts.
202-202
: Efficiently merging keys from ephemeral DB.
Chaining DB keys with the in-memory sets is a direct solution for ensuring completeness. For huge datasets, though, an on-the-fly approach could reduce overall memory usage.Also applies to: 207-220
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (2)
Cargo.toml
(1 hunks)src/clinvar_genes/cli/import.rs
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Schema
- GitHub Check: Testing
- GitHub Check: Linting
- GitHub Check: build-and-push-image
🔇 Additional comments (6)
src/clinvar_genes/cli/import.rs (5)
8-12
: New imports appear coherent with the overall purpose.
Addingclap::Parser
,prost::Message
,rocksdb::{DBWithThreadMode, MultiThreaded}
, and so forth is aligned with the transition to RocksDB-based data handling.
89-90
: Good use of type aliases for improved readability.
DefiningVariantsPerGeneDb
andReleases
clarifies the intent of the data structures and simplifies referencing them throughout the code.
162-162
: Return structure is clear and consistent.
Returning both the RocksDB handle and the set of releases provides a clean way to proceed with subsequent operations.
188-194
: Temporary directory usage aligns well with ephemeral DB creation.
Usingtempfile::TempDir
inside the specified output path is a neat approach for local, transient data storage.
228-239
: Final assembly ofClinvarPerGeneRecord
is well-structured.
Gathering variant data per release and merging them with impact/frequency data is straightforward. Just be mindful of the potential overhead if the number of releases grows large.Cargo.toml (1)
58-58
: Dependency addition is appropriate for ephemeral file management.
Addingtempfile = "3.15.0"
aligns with the creation of temporary directories for rock-solid ephemeral database usage.
Closing in favour of #640 |
The CI of https://github.com/varfish-org/annonars-data-clinvar currently fails due to crossing memory limits of the runners. Therefore, we replace the in-memory indexmap used here with a temporary rocksdb instance. May need some fine-tuning, though.
Summary by CodeRabbit
Dependencies
tempfile
crate version 3.15.0 to project dependenciesImprovements
Technical Updates