Skip to content

Commit c7ac6aa

Browse files
DOC-5228 node-redis probabilistic data type examples
1 parent 6b45988 commit c7ac6aa

File tree

1 file changed

+386
-0
lines changed
  • content/develop/clients/nodejs

1 file changed

+386
-0
lines changed
Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Learn how to use approximate calculations with Redis.
13+
linkTitle: Probabilistic data types
14+
title: Probabilistic data types
15+
weight: 5
16+
---
17+
18+
Redis supports several
19+
[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
20+
that let you calculate values approximately rather than exactly.
21+
The types fall into two basic categories:
22+
23+
- [Set operations](#set-operations): These types let you calculate (approximately)
24+
the number of items in a set of distinct values, and whether or not a given value is
25+
a member of a set.
26+
- [Statistics](#statistics): These types give you an approximation of
27+
statistics such as the quantiles, ranks, and frequencies of numeric data points in
28+
a list.
29+
30+
To see why these approximate calculations would be useful, consider the task of
31+
counting the number of distinct IP addresses that access a website in one day.
32+
33+
Assuming that you already have code that supplies you with each IP
34+
address as a string, you could record the addresses in Redis using
35+
a [set]({{< relref "/develop/data-types/sets" >}}):
36+
37+
```js
38+
await client.sAdd("ip_tracker", new_ip_address);
39+
```
40+
41+
The set can only contain each key once, so if the same address
42+
appears again during the day, the new instance will not change
43+
the set. At the end of the day, you could get the exact number of
44+
distinct addresses using the `sCard()` function:
45+
46+
```js
47+
const num_distinct_ips = await client.sCard("ip_tracker");
48+
```
49+
50+
This approach is simple, effective, and precise but if your website
51+
is very busy, the `ip_tracker` set could become very large and consume
52+
a lot of memory.
53+
54+
You would probably round the count of distinct IP addresses to the
55+
nearest thousand or more to deliver the usage statistics, so
56+
getting it exactly right is not important. It would be useful
57+
if you could trade off some accuracy in exchange for lower memory
58+
consumption. The probabilistic data types provide exactly this kind of
59+
trade-off. Specifically, you can count the approximate number of items in a
60+
set using the [HyperLogLog](#set-cardinality) data type, as described below.
61+
62+
In general, the probabilistic data types let you perform approximations with a
63+
bounded degree of error that have much lower memory consumption or execution
64+
time than the equivalent precise calculations.
65+
66+
## Set operations
67+
68+
Redis supports the following approximate set operations:
69+
70+
- [Membership](#set-membership): The
71+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
72+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
73+
data types let you track whether or not a given item is a member of a set.
74+
- [Cardinality](#set-cardinality): The
75+
[HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
76+
data type gives you an approximate value for the number of items in a set, also
77+
known as the *cardinality* of the set.
78+
79+
The sections below describe these operations in more detail.
80+
81+
### Set membership
82+
83+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
84+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
85+
objects provide a set membership operation that lets you track whether or not a
86+
particular item has been added to a set. These two types provide different
87+
trade-offs for memory usage and speed, so you can select the best one for your
88+
use case. Note that for both types, there is an asymmetry between presence and
89+
absence of items in the set. If an item is reported as absent, then it is definitely
90+
absent, but if it is reported as present, then there is a small chance it may really be
91+
absent.
92+
93+
Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
94+
a Bloom filter records the presence or absence of the
95+
[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
96+
This gives a very compact representation of the
97+
set's membership with a fixed memory size, regardless of how many items you
98+
add. The following example adds some names to a Bloom filter representing
99+
a list of users and checks for the presence or absence of users in the list.
100+
Note that you must use the `bf` property to access the Bloom filter commands.
101+
102+
```js
103+
const res1 = await client.bf.mAdd(
104+
"recorded_users",
105+
["andy", "cameron", "david", "michelle"]
106+
);
107+
console.log(res1); // >>> [true, true, true, true]
108+
109+
const res2 = await client.bf.exists("recorded_users", "cameron");
110+
console.log(res2); // >>> true
111+
112+
const res3 = await client.bf.exists("recorded_users", "kaitlyn");
113+
console.log(res3); // >>> false
114+
```
115+
116+
A Cuckoo filter has similar features to a Bloom filter, but also supports
117+
a deletion operation to remove hashes from a set, as shown in the example
118+
below. Note that you must use the `cf` property to access the Cuckoo filter
119+
commands.
120+
121+
```js
122+
const res4 = await client.cf.add("other_users", "paolo");
123+
console.log(res4); // >>> true
124+
125+
const res5 = await client.cf.add("other_users", "kaitlyn");
126+
console.log(res5); // >>> true
127+
128+
const res6 = await client.cf.add("other_users", "rachel");
129+
console.log(res6); // >>> true
130+
131+
const res7 = await client.cf.exists("other_users", "paolo");
132+
const res7a = await client.cf.exists("other_users", "kaitlyn");
133+
const res7b = await client.cf.exists("other_users", "rachel");
134+
const res7c = await client.cf.exists("other_users", "andy");
135+
console.log([res7, res7a, res7b, res7c]); // >>> [true, true, true, false]
136+
137+
const res8 = await client.cf.del("other_users", "paolo");
138+
console.log(res8); // >>> true
139+
140+
const res9 = await client.cf.exists("other_users", "paolo");
141+
console.log(res9); // >>> false
142+
```
143+
144+
Which of these two data types you choose depends on your use case.
145+
Bloom filters are generally faster than Cuckoo filters when adding new items,
146+
and also have better memory usage. Cuckoo filters are generally faster
147+
at checking membership and also support the delete operation. See the
148+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
149+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
150+
reference pages for more information and comparison between the two types.
151+
152+
### Set cardinality
153+
154+
A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
155+
object calculates the cardinality of a set. As you add
156+
items, the HyperLogLog tracks the number of distinct set members but
157+
doesn't let you retrieve them or query which items have been added.
158+
You can also merge two or more HyperLogLogs to find the cardinality of the
159+
[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they
160+
represent.
161+
162+
```js
163+
const res10 = await client.pfAdd("group:1", ["andy", "cameron", "david"]);
164+
console.log(res10); // >>> true
165+
166+
const res11 = await client.pfCount("group:1");
167+
console.log(res11); // >>> 3
168+
169+
const res12 = await client.pfAdd("group:2", ["kaitlyn", "michelle", "paolo", "rachel"]);
170+
console.log(res12); // >>> true
171+
172+
const res13 = await client.pfCount("group:2");
173+
console.log(res13); // >>> 4
174+
175+
const res14 = await client.pfMerge("both_groups", ["group:1", "group:2"]);
176+
console.log(res14); // >>> OK
177+
178+
const res15 = await client.pfCount("both_groups");
179+
console.log(res15); // >>> 7
180+
```
181+
182+
The main benefit that HyperLogLogs offer is their very low
183+
memory usage. They can count up to 2^64 items with less than
184+
1% standard error using a maximum 12KB of memory. This makes
185+
them very useful for counting things like the total of distinct
186+
IP addresses that access a website or the total of distinct
187+
bank card numbers that make purchases within a day.
188+
189+
## Statistics
190+
191+
Redis supports several approximate statistical calculations
192+
on numeric data sets:
193+
194+
- [Frequency](#frequency): The
195+
[Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
196+
data type lets you find the approximate frequency of a labeled item in a data stream.
197+
- [Quantiles](#quantiles): The
198+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
199+
data type estimates the quantile of a query value in a data stream.
200+
- [Ranking](#ranking): The
201+
[Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
202+
estimates the ranking of labeled items by frequency in a data stream.
203+
204+
The sections below describe these operations in more detail.
205+
206+
### Frequency
207+
208+
A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
209+
(CMS) object keeps count of a set of related items represented by
210+
string labels. The count is approximate, but you can specify
211+
how close you want to keep the count to the true value (as a fraction)
212+
and the acceptable probability of failing to keep it in this
213+
desired range. For example, you can request that the count should
214+
stay within 0.1% of the true value and have a 0.05% probability
215+
of going outside this limit. The example below shows how to create
216+
a Count-min sketch object, add data to it, and then query it.
217+
Note that you must use the `cms` property to access the Count-min
218+
sketch commands.
219+
220+
```js
221+
// Specify that you want to keep the counts within 0.01
222+
// (0.1%) of the true value with a 0.005 (0.05%) chance
223+
// of going outside this limit.
224+
const res16 = await client.cms.initByProb("items_sold", 0.01, 0.005);
225+
console.log(res16); // >>> OK
226+
227+
// The parameters for `incrBy()` are passed as an array of objects
228+
// each containing an `item` and `incrementBy` property.
229+
const res17 = await client.cms.incrBy(
230+
"items_sold",
231+
[
232+
{ item: "bread", incrementBy: 300},
233+
{ item: "tea", incrementBy: 200},
234+
{ item: "coffee", incrementBy: 200},
235+
{ item: "beer", incrementBy: 100}
236+
]
237+
);
238+
console.log(res17); // >>> [300, 200, 200, 100]
239+
240+
const res18 = await client.cms.incrBy(
241+
"items_sold",
242+
[
243+
{ item: "bread", incrementBy: 100},
244+
{ item: "coffee", incrementBy: 150}
245+
]
246+
);
247+
console.log(res18); // >>> [400, 350]
248+
249+
const res19 = await client.cms.query(
250+
"items_sold",
251+
["bread", "tea", "coffee", "beer"]
252+
);
253+
console.log(res19); // >>> [400, 200, 350, 100]
254+
```
255+
256+
The advantage of using a CMS over keeping an exact count with a
257+
[sorted set]({{< relref "/develop/data-types/sorted-sets" >}})
258+
is that that a CMS has very low and fixed memory usage, even for
259+
large numbers of items. Use CMS objects to keep daily counts of
260+
items sold, accesses to individual web pages on your site, and
261+
other similar statistics.
262+
263+
### Quantiles
264+
265+
A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value
266+
below which a certain fraction of samples lie. For example, with
267+
a set of measurements of people's heights, the quantile of 0.75 is
268+
the value of height below which 75% of all people's heights lie.
269+
[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent
270+
to quantiles, except that the fraction is expressed as a percentage.
271+
272+
A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
273+
object can estimate quantiles from a set of values added to it
274+
without having to store each value in the set explicitly. This can
275+
save a lot of memory when you have a large number of samples.
276+
277+
The example below shows how to add data samples to a t-digest
278+
object and obtain some basic statistics, such as the minimum and
279+
maximum values, the quantile of 0.75, and the
280+
[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
281+
(CDF), which is effectively the inverse of the quantile function. It also
282+
shows how to merge two or more t-digest objects to query the combined
283+
data set. Note that you must use the `tDigest` property to access the
284+
t-digest commands.
285+
286+
```js
287+
const res20 = await client.tDigest.create("male_heights");
288+
console.log(res20); // >>> OK
289+
290+
const res21 = await client.tDigest.add(
291+
"male_heights",
292+
[175.5, 181, 160.8, 152, 177, 196, 164]
293+
);
294+
console.log(res21); // >>> OK
295+
296+
const res22 = await client.tDigest.min("male_heights");
297+
console.log(res22); // >>> 152
298+
299+
const res23 = await client.tDigest.max("male_heights");
300+
console.log(res23); // >>> 196
301+
302+
const res24 = await client.tDigest.quantile("male_heights", [0.75]);
303+
console.log(res24); // >>> [181]
304+
305+
// Note that the CDF value for 181 is not exactly
306+
// 0.75. Both values are estimates.
307+
const res25 = await client.tDigest.cdf("male_heights", [181]);
308+
console.log(res25); // >>> [0.7857142857142857]
309+
310+
const res26 = await client.tDigest.create("female_heights");
311+
console.log(res26); // >>> OK
312+
313+
const res27 = await client.tDigest.add(
314+
"female_heights",
315+
[155.5, 161, 168.5, 170, 157.5, 163, 171]
316+
);
317+
console.log(res27); // >>> OK
318+
319+
const res28 = await client.tDigest.quantile("female_heights", [0.75]);
320+
console.log(res28); // >>> [170]
321+
322+
const res29 = await client.tDigest.merge(
323+
"all_heights", ["male_heights", "female_heights"]
324+
);
325+
console.log(res29); // >>> OK
326+
327+
const res30 = await client.tDigest.quantile("all_heights", [0.75]);
328+
console.log(res30); // >>> [175.5]
329+
```
330+
331+
A t-digest object also supports several other related commands, such
332+
as querying by rank. See the
333+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
334+
reference for more information.
335+
336+
### Ranking
337+
338+
A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}})
339+
object estimates the rankings of different labeled items in a data
340+
stream according to frequency. For example, you could use this to
341+
track the top ten most frequently-accessed pages on a website, or the
342+
top five most popular items sold.
343+
344+
The example below adds several different items to a Top-K object
345+
that tracks the top three items (this is the second parameter to
346+
the `topK.reserve()` method). It also shows how to list the
347+
top *k* items and query whether or not a given item is in the
348+
list. Note that you must use the `topK` property to access the
349+
Top-K commands.
350+
351+
```js
352+
// The `reserve()` method creates the Top-K object with
353+
// the given key. The parameters are the number of items
354+
// in the ranking and values for `width`, `depth`, and
355+
// `decay`, described in the Top-K reference page.
356+
const res31 = await client.topK.reserve(
357+
"top_3_songs", 3,
358+
{ width: 7, depth: 8, decay: 0.9 }
359+
);
360+
console.log(res31); // >>> OK
361+
362+
// The parameters for `incrBy()` are passed as an array of objects
363+
// each containing an `item` and `incrementBy` property.
364+
const res32 = await client.topK.incrBy(
365+
"top_3_songs",
366+
[
367+
{ item: "Starfish Trooper", incrementBy: 3000},
368+
{ item: "Only one more time", incrementBy: 1850},
369+
{ item: "Rock me, Handel", incrementBy: 1325},
370+
{ item: "How will anyone know?", incrementBy: 3890},
371+
{ item: "Average lover", incrementBy: 4098},
372+
{ item: "Road to everywhere", incrementBy: 770}
373+
]
374+
);
375+
console.log(res32);
376+
// >>> [null, null, null, 'Rock me, Handel', 'Only one more time', null]
377+
378+
const res33 = await client.topK.list("top_3_songs");
379+
console.log(res33);
380+
// >>> ['Average lover', 'How will anyone know?', 'Starfish Trooper']
381+
382+
const res34 = await client.topK.query(
383+
"top_3_songs", ["Starfish Trooper", "Road to everywhere"]
384+
);
385+
console.log(res34); // >>> [true, false]
386+
```

0 commit comments

Comments
 (0)