-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent updates for hyperloglog and MinHash-like algorithms for your book #28
Comments
Thank you for the email, Jianshu.
Yes, there are many new pieces of research in this field that have been
completed since 2018 when the book was completed. It is a very long
discussion about that to include in the book to keep it feasible to read
... the original idea was to highlight the typical approaches without
having to explain all possible algorithms. As you surely know, every slight
change could produce a new algorithm that is a bit better/efficient/nicer
than the original but the underlying idea in most cases stays the same.
There is an idea to have another version of the book where I would like to
describe more algorithms but with fewer details, more like a textbook. But
these are rather plans.
Currently, I primarily have to focus on different languages (by the way, we
recently published an edition in simplified Chinese) and try to accumulate
the new algorithms to decide in which format to re-publish the book.
Your list contains excellent points, thank you for it. Of course, once
there will be new edition draft, I would be happy to send it to you
for early reading ;)
Thank you for reading!
Andrii Gakhov
Am So., 4. Juni 2023 um 06:46 Uhr schrieb Jianshu_Zhao <
***@***.***>:
… Dear Andrii,
This is Jianshu, a Ph.D student in CSE & Bioinformatics, studying
probabilistic data structures and their applications in biological
sequences analysis. I read your book on probabilistic data structures and
algorithms for big data application, which is a very detailed and thorough
introduction of many probabilistic data structures from very beginning. I
love it so much and I think this is the first book on this topic despite
that fact that many data structures and algorithms books cover some of them
but not all.
Since the book was written in 2019 or so, it does cover almost all
cutting-edge algorithms in cardinality estimation and similarity search
(MinHash et.al.). However, in recently years (including 2 or 3 years before
2019), there are many new breakthroughs such as those related to
HyperLogLog and MinHash.
1. in 2017, Ertl came up with new estimator for HyperLogLog (
https://arxiv.org/pdf/1706.07290.pdf), including an improved estimator
and a Maximum Likelihood Estimator, which is theoretical optimal (it will
be mentioned in below section 4). Improved estimator is much better at
small cardinalities but it still has large variation at very small and very
large cardinalities (no need to correct bias like HyperLogLog++). MLE is
the smallest relative variance both in theory and in practice. However, it
is expensive to compute because of the need to solve a three-dimensional
optimization problem, which does not easily translate into a fast and
robust algorithm. The cardinality of two set, A and B can be used to
estimate Jaccard index via the inclusion-exclusion rule for similarity
search (competing with MinHash, MinHash is also approximating Jaccard
index), which requires much smaller memory than MinHash for similar
accuracy (but slower). The paper also proposed a joint Maximum Likelihood
Estimator (JMLE), which estimate Jaccard index (estimate intersections and
unions first) based on HyperLogLog sketches of set A and B, also
theoretical optimal and has smaller variance than the one mentioned just
now (use cardinality of A and B, inclusion-exclusion rule). This is even
slower for Jaccard estimation but the most accurate. Ertl's work was based
on Daniel Ting's pioneering work here (
https://dl.acm.org/doi/abs/10.1145/2939672.2939772) and also Cohen's
work here (https://dl.acm.org/doi/abs/10.1145/3097983.3097999).
2. Seth Pettie team developed a more general framework, called Tau-GRA
(generalized remaining area), which includes LogLog, PCSA, HyperLogLog, and
improves the HyperLogLog estimator relative variance from 0.4% (1.079/m) to
0.04% (1.075/m), compare to the Camer-Rao lower bound which is 1.074/m when
Tau is 0.88989 and it does not need correction for small cardinalities. The
maximum likelihood estimator mentioned above meets the Camér-Rao lower
bound but very expensive, the Tau-GRA is very easy to compute and update
than MLE methods, despite not optimal (but very small variance). The paper
is here: https://arxiv.org/abs/2208.10578 . Those work are based on
the pioneering work of Daniel Ting here (
https://dl.acm.org/doi/abs/10.1145/2623330.2623669).
3. New MinHash algorithm, SuperMinHash, which improves MSE of original
MinHash (https://arxiv.org/abs/1706.05698) and ProbMinHash (estimate
Jp, not exactly Jaccard), which takes into account set
multiplicity/abundance and set size (e.g., several same element in the set,
each has to be considered, or relative frequencies because it also
normalize by the total set size) (
https://ieeexplore.ieee.org/abstract/document/9185081?casa_token=R8ncquggKvsAAAAA:JU4MPQOraTFHfBWCvufavLjEhIN67N269yu9-MVDr29-6kzE0iJ9h9RK-brFvhnRMVNoAvLQ).
BagMinHash (https://dl.acm.org/doi/abs/10.1145/3219819.3220089) and
DartMinHash (https://arxiv.org/abs/2005.11547), both take into account
the weights of element in set (but not set size). Those two approximate Jw
instead of simple Jaccard. The latter it the fastest weighted MinHash
algorithm.
4. B-bit one permutation MinHash with optimal densification,
theoretically fastest MinHash algorithm, based on work from several other
papers but summarize in this paper as a whole:
http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf
5. New data structure, SetSketch, which benefits from MinHash and
Hyperloglog, both fast and space efficient. The interesting point of paper
is that actually MinHash and HLL can be combined into such one data
structure. Several estimators are also derived, see paper here:
https://vldb.org/pvldb/vol14/p2244-ertl.pdf
I am wondering whether there will be a new version of the book in the
following years. If yes, those topics could be very relevant and
interesting to many readers. I would be happier to be part of it.
Looking forward to your reply.
Thanks,
Jianshu
—
Reply to this email directly, view it on GitHub
<#28>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHC737WAVCWFJOXL6AYFZ3XJQHKBANCNFSM6AAAAAAYZWMDCQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hello Andrii, I would be also happy to read the Chinese version(you know I am from China). I am working on implement those structures in Rust,a high performance compiling language with memory and thread safety. For real world large data,performance and parallelization is more important, e.g. parallelizing HLL on multithread computer. Is the Chinese version available now, I cannot find it online. We have already implemented SetSketch and ProbMinhash/SuperMinhash,next will be bagminhash or dartMinHash and the generalized remaining area for PCSA and HLL. Perhaps also b-bit one permutation with optimal densification. Those are not just small changes but a better estimator with theoretical minimum relative variance. Should be very interesting in a lot of applications. I also forget to mention the new Count-Min sketch from Ting Daniel, also a MLE based method, much better relative variance compared to the original estimator. See here: https://dl.acm.org/doi/abs/10.1145/3219819.3219975 I think explaining those new ideas based on you detailed explanation will be excellent for many beginners, who want to understand and implement all those cutting edge algorithms. Thanks for the quick response greatly appreciated. Jianshu |
The most important one for MinHash is #4 mentioned above, which break the complexity from O(dk) to O(d+k), d is nonzero elements in the set while k is the number of minhashes due the densify idea with 2-universal hash function. This has never been achieve before and my implementation and others (e.g. bindash) showed that, it is at least 100X faster than the classic Minhash and converge to 0 variance much faster (sqrt(J(1-J)/m)). Should be a replacement to the traditional MinHash. Thanks, Jianshu |
Dear Andrii, It seems the recent breakthrough in densification method for one permutation Minhash with densification has reached the theoretical limit. In additional to the optimal densification method mentioned above, fast densification (http://proceedings.mlr.press/v115/mai20a.html) and bidirectional densification (https://dl.acm.org/doi/abs/10.1145/3448016.3452833) further push the limit and achieve better average case running time, a limit of this method. Thanks, Jianshu |
Dear Andrii,
This is Jianshu, a Ph.D student in CSE & Bioinformatics, studying probabilistic data structures and their applications in biological sequences analysis. I read your book on probabilistic data structures and algorithms for big data application, which is a very detailed and thorough introduction of many probabilistic data structures from very beginning. I love it so much and I think this is the first book on this topic despite that fact that many data structures and algorithms books cover some of them but not all.
Since the book was written in 2019 or so, it does cover almost all cutting-edge algorithms in cardinality estimation and similarity search (MinHash et.al.). However, in recently years (including 2 or 3 years before 2019), there are many new breakthroughs such as those related to HyperLogLog and MinHash.
I am wondering whether there will be a new version of the book in the following years. If yes, those topics could be very relevant and interesting to many readers. I would be happier to be part of it.
Looking forward to your reply.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: