Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use rocksdb instead of in-memory map to reduce memory footprint #638

Closed
wants to merge 1 commit into from

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Jan 27, 2025

The CI of https://github.com/varfish-org/annonars-data-clinvar currently fails due to crossing memory limits of the runners. Therefore, we replace the in-memory indexmap used here with a temporary rocksdb instance. May need some fine-tuning, though.

Summary by CodeRabbit

  • Dependencies

    • Added tempfile crate version 3.15.0 to project dependencies
  • Improvements

    • Enhanced variant data loading and storage mechanism in RocksDB
    • Simplified data handling for variant information
    • Improved error management during JSON deserialization
  • Technical Updates

    • Updated method signatures for variant import functionality
    • Added new type aliases to support database operations

@tedil tedil requested a review from holtgrewe January 27, 2025 08:19
Copy link

coderabbitai bot commented Jan 27, 2025

Walkthrough

The pull request introduces a new dependency tempfile in the Cargo.toml and significantly refactors the variant data loading mechanism in the src/clinvar_genes/cli/import.rs file. The changes focus on improving the RocksDB data storage approach by simplifying the data handling process, directly storing variants in the database, and modifying the function signatures to support more flexible database operations.

Changes

File Change Summary
Cargo.toml Added tempfile = "3.15.0" dependency
src/clinvar_genes/cli/import.rs - Updated load_variants_jsonl function signature
- Added type aliases VariantsPerGeneDb and Releases
- Modified variant loading logic to use direct RocksDB storage
- Streamlined error handling for JSON deserialization

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Import Command
    participant LoadFunc as load_variants_jsonl
    participant RocksDB as RocksDB Instance
    participant JSONL as JSONL Files

    CLI->>LoadFunc: Call with variant files, DB path, options
    LoadFunc->>JSONL: Read variant files
    LoadFunc->>RocksDB: Open/Create temporary database
    LoadFunc->>RocksDB: Update variants per release
    LoadFunc-->>CLI: Return DB instance and releases
Loading

Possibly related PRs

Poem

🐰 In the realm of code, a tempfile springs,
Variants dance through RocksDB's strings,
Simplicity blooms, complexity wanes,
A rabbit's refactor, where efficiency reigns!
🔬 Hop, hop, data flows with grace! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/clinvar_genes/cli/import.rs (3)

93-93: Consider wrapping partial writes in a transaction or providing rollback logic.
If an error occurs midway, partial data may remain in the database. This might not be critical here, but transactional support can avoid inadvertent partial states.

Also applies to: 96-102, 105-105


123-158: Evaluate potential concurrency collisions.
Repeatedly retrieving, deserializing, and re-serializing variants in a loop could lead to race conditions or overwrites if multiple threads process the same key concurrently. Consider a locking strategy or write batches to minimize conflicts.


202-202: Efficiently merging keys from ephemeral DB.
Chaining DB keys with the in-memory sets is a direct solution for ensuring completeness. For huge datasets, though, an on-the-fly approach could reduce overall memory usage.

Also applies to: 207-220

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2875a28 and 482b773.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • Cargo.toml (1 hunks)
  • src/clinvar_genes/cli/import.rs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Schema
  • GitHub Check: Testing
  • GitHub Check: Linting
  • GitHub Check: build-and-push-image
🔇 Additional comments (6)
src/clinvar_genes/cli/import.rs (5)

8-12: New imports appear coherent with the overall purpose.
Adding clap::Parser, prost::Message, rocksdb::{DBWithThreadMode, MultiThreaded}, and so forth is aligned with the transition to RocksDB-based data handling.


89-90: Good use of type aliases for improved readability.
Defining VariantsPerGeneDb and Releases clarifies the intent of the data structures and simplifies referencing them throughout the code.


162-162: Return structure is clear and consistent.
Returning both the RocksDB handle and the set of releases provides a clean way to proceed with subsequent operations.


188-194: Temporary directory usage aligns well with ephemeral DB creation.
Using tempfile::TempDir inside the specified output path is a neat approach for local, transient data storage.


228-239: Final assembly of ClinvarPerGeneRecord is well-structured.
Gathering variant data per release and merging them with impact/frequency data is straightforward. Just be mindful of the potential overhead if the number of releases grows large.

Cargo.toml (1)

58-58: Dependency addition is appropriate for ephemeral file management.
Adding tempfile = "3.15.0" aligns with the creation of temporary directories for rock-solid ephemeral database usage.

@tedil
Copy link
Contributor Author

tedil commented Jan 30, 2025

Closing in favour of #640

@tedil tedil closed this Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant