Skip to content

Commit d131df2

Browse files
Update cargo docs, update crate to version 0.3.0 for publishing (#20)
* Updates to cargo docs * Skip second README doc test (requires first to have run and execution order isn't guaranteed) * Update to verison 0.3.0 for publishing to Cargo List @Keats, @GSGerritsen, and @boydgreenfield as maintainers. * Update to mmap-bitvec 0.4.1 This fixes an issue introduced by rust-lang/rust#98112 in 1.70+ that otherwise breaks pointer dereferencing `mmap-bitvec`. * Ignore notebook .python-version files
1 parent 1604535 commit d131df2

File tree

7 files changed

+80
-48
lines changed

7 files changed

+80
-48
lines changed

.github/workflows/ci.yml

+6-6
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,20 @@ name: CI
22
on:
33
push:
44
branches:
5-
- master
5+
- main
66
pull_request:
77

88
jobs:
99
tests:
1010
runs-on: ubuntu-latest
1111
steps:
1212
- name: Checkout
13-
uses: actions/checkout@master
13+
uses: actions/checkout@main
1414

1515
- uses: actions-rs/toolchain@v1
1616
with:
1717
profile: minimal
18-
toolchain: 1.60.0
18+
toolchain: stable
1919
override: true
2020

2121
- name: version info
@@ -28,7 +28,7 @@ jobs:
2828
runs-on: ubuntu-latest
2929
steps:
3030
- name: Checkout
31-
uses: actions/checkout@master
31+
uses: actions/checkout@main
3232

3333
- uses: actions-rs/toolchain@v1
3434
with:
@@ -46,7 +46,7 @@ jobs:
4646
runs-on: ubuntu-latest
4747
steps:
4848
- name: Checkout
49-
uses: actions/checkout@master
49+
uses: actions/checkout@main
5050

5151
- uses: actions-rs/toolchain@v1
5252
with:
@@ -63,7 +63,7 @@ jobs:
6363
runs-on: ubuntu-latest
6464
steps:
6565
- name: Checkout
66-
uses: actions/checkout@master
66+
uses: actions/checkout@main
6767

6868
- uses: actions-rs/toolchain@v1
6969
with:

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ Cargo.lock
44
.DS_Store
55
.idea/
66
old/
7+
docs/notebook/.python-version

Cargo.toml

+15-3
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,24 @@
11
[package]
22
name = "bfield"
3-
version = "0.2.1"
4-
authors = ["Roderick Bovee <[email protected]>"]
3+
description = "B-field datastructure implementation in Rust"
4+
version = "0.3.0"
5+
authors = ["Vincent Prouillet <[email protected]>", "Gerrit Gerritsen <[email protected]>", "Nick Greenfield <[email protected]>"]
6+
homepage = "https://github.com/onecodex/rust-bfield/"
7+
repository = "https://github.com/onecodex/rust-bfield/"
8+
readme = "README.md"
9+
keywords = ["B-field", "probabilistic data structures"]
10+
categories = ["data-structures"]
511
edition = "2018"
12+
license = "Apache 2.0"
13+
exclude = [
14+
".gitignore",
15+
".github/*",
16+
"docs/*",
17+
]
618

719
[dependencies]
820
bincode = "1"
9-
mmap-bitvec = "0.4.0"
21+
mmap-bitvec = "0.4.1"
1022
murmurhash3 = "0.0.5"
1123
serde = { version = "1.0", features = ["derive"] }
1224
once_cell = "1.3.1"

README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
# `rust-bfield`, an implementation of the B-field probabilistic key-value data structure
22

3+
[![Crates.io Version](https://img.shields.io/crates/v/bfield.svg)](https://crates.io/crates/bfield)
4+
35
The B-field is a novel, probabilistic data structure for storing key-value pairs (or, said differently, it is a probabilistic associative array or map). B-fields support insertion (`insert`) and lookup (`get`) operations, and share a number of mathematical and performance properties with the well-known [Bloom filter](https://doi.org/10.1145/362686.362692).
46

57
At [One Codex](https://www.onecodex.com), we use the `rust-bfield` crate in bioinformatics applications to efficiently store associations between billions of $k$-length nucleotide substrings (["k-mers"](https://en.wikipedia.org/wiki/K-mer)) and [their taxonomic identity](https://www.ncbi.nlm.nih.gov/taxonomy) _**using only 6-7 bytes per `(kmer, value)` pair**_ for up to 100,000 unique taxonomic IDs (distinct values) and a 0.1% error rate. We hope others are able to use this library (or implementations in other languages) for applications in bioinformatics and beyond.
68

7-
> _Note: In the [Implementation Details](#implementation-details) section below, we detail the use of this B-field implementation in Rust and use `code` formatting and English parameter names (e.g., we discuss the B-field being a data structure for storing `(key, value)` pairs). In the following [Formal Data Structure Details](#formal-data-structure-details) section, we detail the design and mechanics of the B-field using mathematical notation (i.e., we discuss it as an associate array mapping a set of_ $(x, y)$ _pairs). The generated Rust documentation includes both notations for ease of reference._
9+
> _Note: In the [Implementation Details](#implementation-details) section below, we detail the use of this B-field implementation in Rust and use `code` formatting and English parameter names (e.g., we discuss the B-field being a data structure for storing `(key, value)` pairs). In the following [Formal Data Structure Details](#formal-data-structure-details) section, we detail the design and mechanics of the B-field using mathematical notation (i.e., we discuss it as an associate array mapping a set of_ $(x, y)$ _pairs). The [generated Rust documentation](https://docs.rs/bfield/latest/bfield/) includes both notations for ease of reference._
810
911
## Implementation Details
1012

@@ -73,7 +75,7 @@ for p in 0..4u32 {
7375

7476
* After creation, a B-field can optionally be loaded from a directory containing the produced `mmap` and related files with the `load` function. And once created or loaded, a B-field can be directly queried using the `get` function, which will either return `None`, `Indeterminate`, or `Some(BFieldValue)` (which is currently an alias for `Some(u32)` see [limitations](#⚠️-current-limitations-of-the-rust-bfield-implementation) below for more details):
7577

76-
```rust
78+
```rust no_run
7779
use bfield::BField;
7880

7981
// Load based on filename of the first array ".0.bfd"

src/bfield.rs

+43-34
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ use serde::Serialize;
77

88
use crate::bfield_member::{BFieldLookup, BFieldMember, BFieldVal};
99

10-
/// The struct holding the various bfields
10+
/// The `struct` holding the `BField` primary and secondary bit arrays.
1111
pub struct BField<T> {
1212
members: Vec<BFieldMember<T>>,
1313
read_only: bool,
@@ -18,18 +18,26 @@ unsafe impl<T> Send for BField<T> {}
1818
unsafe impl<T> Sync for BField<T> {}
1919

2020
impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
21-
/// The (complicated) method to create a bfield.
22-
/// The bfield files will be created in `directory` with the given `filename` and the
23-
/// suffixes `(0..n_secondaries).bfd`
24-
/// `size` is the primary bfield size, subsequent bfield sizes will be determined by
25-
/// `secondary_scaledown` and `max_scaledown`.
26-
/// If you set `in_memory` to true, remember to call `persist_to_disk` when it's built to
21+
/// A (rather complex) method for creating a `BField`.
22+
///
23+
/// This will create a series of `BField` bit array files in `directory` with the given `filename` and the
24+
/// suffixes `(0..n_secondaries).bfd`. If you set `in_memory` to true, remember to call `persist_to_disk` once it's built to
2725
/// save it.
28-
/// The params are the following in the paper:
29-
/// `n_hashes` -> k
30-
/// `marker_width` -> v (nu)
31-
/// `n_marker_bits` -> κ (kappa)
32-
/// `secondary_scaledown` -> β (beta)
26+
///
27+
/// The following parameters are required. See the [README.md](https://github.com/onecodex/rust-bfield/)
28+
/// for additional details as well as the
29+
/// [parameter selection notebook](https://github.com/onecodex/rust-bfield/blob/main/docs/notebook/calculate-parameters.ipynb)
30+
/// for helpful guidance in picking optimal parameters.
31+
/// - `size` is the primary `BField` size, subsequent `BField` sizes will be determined
32+
/// by the `secondary_scaledown` and `max_scaledown` parameters
33+
/// - `n_hashes`. The number of hash functions _k_ to use.
34+
/// - `marker_width` or v (nu). The length of the bit-string to use for
35+
/// - `n_marker_bits` or κ (kappa). The number of 1s to set in each v-length bit-string (also its Hamming weight).
36+
/// - `secondary_scaledown` or β (beta). The scaling factor to use for each subsequent `BField` size.
37+
/// - `max_scaledown`. A maximum scaling factor to use for secondary `BField` sizes, since β raised to the power of
38+
/// `n_secondaries` can be impractically/needlessly small.
39+
/// - `n_secondaries`. The number of secondary `BField`s to create.
40+
/// - `in_memory`. Whether to create the `BField` in memory or on disk.
3341
#[allow(clippy::too_many_arguments)]
3442
pub fn create<P>(
3543
directory: P,
@@ -84,7 +92,7 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
8492
})
8593
}
8694

87-
/// Loads the bfield given the path to the "main" db path (eg the one ending with `0.bfd`).
95+
/// Loads the `BField` given the path to the primary array data file (eg the one ending with `0.bfd`).
8896
pub fn load<P: AsRef<Path>>(main_db_path: P, read_only: bool) -> Result<Self, io::Error> {
8997
let mut members = Vec::new();
9098
let mut n = 0;
@@ -126,8 +134,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
126134
Ok(BField { members, read_only })
127135
}
128136

129-
/// Write the current bfields to disk.
130-
/// Only useful if you are creating a bfield in memory
137+
/// Write the current `BField` to disk.
138+
/// Only useful if you are creating a `BField` in memory.
131139
pub fn persist_to_disk(self) -> Result<Self, io::Error> {
132140
let mut members = Vec::with_capacity(self.members.len());
133141
for m in self.members {
@@ -139,32 +147,32 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
139147
})
140148
}
141149

142-
/// Returns (n_hashes, marker_width, n_marker_bits, Vec<size of each member>)
150+
/// Returns `(n_hashes, marker_width, n_marker_bits, Vec<size of each member>)`.
143151
pub fn build_params(&self) -> (u8, u8, u8, Vec<usize>) {
144152
let (_, n_hashes, marker_width, n_marker_bits) = self.members[0].info();
145153
let sizes = self.members.iter().map(|i| i.info().0).collect();
146154
(n_hashes, marker_width, n_marker_bits, sizes)
147155
}
148156

149-
/// Returns the params given at build time to the bfields
157+
/// Returns the params given at build time to the `BField` arrays.
150158
pub fn params(&self) -> &Option<T> {
151159
&self.members[0].params.other
152160
}
153161

154-
/// This doesn't actually update the file, so we can use it to e.g.
155-
/// simulate params on an old legacy file that may not actually have
156-
/// them set.
162+
/// ⚠️ Method for setting parameters without actually updating any files on disk. **Only useful for supporting legacy file formats
163+
/// in which these parameters are not saved.**
157164
pub fn mock_params(&mut self, params: T) {
158165
self.members[0].params.other = Some(params);
159166
}
160167

161-
/// This allows an insert of a value into the b-field after the entire
162-
/// b-field build process has been completed.
163-
///
164-
/// It has the very bad downside of potentially knocking other keys out
165-
/// of the b-field by making them indeterminate (which will make them fall
166-
/// back to the secondaries where they don't exist and thus it'll appear
167-
/// as if they were never inserted to begin with)
168+
/// ⚠️ Method for inserting a value into a `BField`
169+
/// after it has been fully built and finalized.
170+
/// **This method should be used with extreme care**
171+
/// as it does not guarantee that keys are properly propagated
172+
/// to secondary arrays and therefore may make lookups of previously
173+
/// set values return an indeterminate result in the primary array,
174+
/// then causing fallback to the secondary arrays where they were never
175+
/// inserted (and returning a false negative).
168176
pub fn force_insert(&self, key: &[u8], value: BFieldVal) {
169177
debug_assert!(!self.read_only, "Can't insert into read_only bfields");
170178
for secondary in &self.members {
@@ -174,8 +182,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
174182
}
175183
}
176184

177-
/// Insert the given key/value at the given pass
178-
/// Returns whether the value was inserted during this call, eg will return `false` if
185+
/// Insert the given key/value at the given pass (1-indexed `BField` array/member).
186+
/// Returns whether the value was inserted during this call, i.e., will return `false` if
179187
/// the value was already present.
180188
pub fn insert(&self, key: &[u8], value: BFieldVal, pass: usize) -> bool {
181189
debug_assert!(!self.read_only, "Can't insert into read_only bfields");
@@ -195,8 +203,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
195203
true
196204
}
197205

198-
/// Returns the value of the given key if found, None otherwise.
199-
/// If the value is indeterminate, we still return None.
206+
/// Returns the value of the given key if found, `None` otherwise.
207+
/// The current implementation also returns `None` for indeterminate values.
200208
pub fn get(&self, key: &[u8]) -> Option<BFieldVal> {
201209
for secondary in self.members.iter() {
202210
match secondary.get(key) {
@@ -210,8 +218,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
210218
None
211219
}
212220

213-
/// Get the info of each member
214-
/// Returns Vec<(size, n_hashes, marker_width, n_marker_bits)>
221+
/// Get the info of each secondary array (`BFieldMember`) in the `BField`.
222+
/// Returns `Vec<(size, n_hashes, marker_width, n_marker_bits)>`.
215223
pub fn info(&self) -> Vec<(usize, u8, u8, u8)> {
216224
self.members.iter().map(|m| m.info()).collect()
217225
}
@@ -304,6 +312,7 @@ mod tests {
304312
}
305313
}
306314

315+
// Causes cargo test to run doc tests on all `rust` code blocks
307316
#[doc = include_str!("../README.md")]
308317
#[cfg(doctest)]
309-
pub struct ReadmeDoctests;
318+
struct ReadmeDoctests;

src/combinatorial.rs

+2-2
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,9 @@ pub fn unrank(marker: u128) -> usize {
6363
value as usize
6464
}
6565

66-
/// (Hopefully) fast implementation of a binomial
66+
/// (Hopefully) fast implementation of a binomial.
6767
///
68-
/// This uses a preset group of equations for k < 8 and then falls back to a
68+
/// This function uses a preset group of equations for k < 8 and then falls back to a
6969
/// multiplicative implementation that tries to prevent overflows while
7070
/// maintaining all results as exact integers.
7171
#[inline]

src/lib.rs

+9-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,15 @@
11
#![deny(missing_docs)]
22

3-
//! The bfield datastructure, implemented in Rust.
3+
//! The B-field datastructure, implemented in Rust.
44
//! A space-efficient, probabilistic data structure and storage and retrieval method for key-value information.
5+
//! These Rust docs represent some minimal documentation of the crate itself.
6+
//! See the [Github README](https://github.com/onecodex/rust-bfield) for an
7+
//! extensive write-up, including the math and design underpinning the B-field
8+
//! data structure, guidance on B-field parameter selection, as well as usage
9+
//! examples.[^1]
10+
//!
11+
//! [^1]: These are not embeddable in the Cargo docs as they include MathJax,
12+
//! which is currently unsupported.
513
614
mod bfield;
715
mod bfield_member;

0 commit comments

Comments
 (0)