|
| 1 | +# Reservoir sampling |
| 2 | + |
| 3 | +Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, |
| 4 | +of k items from a population of unknown size n in a single pass over the items. The size of the population n is not |
| 5 | +known to the algorithm and is typically too large for all n items to fit into main memory. The population is revealed |
| 6 | +to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of |
| 7 | +the algorithm must permit extraction of a simple random sample without replacement of size k over the part of |
| 8 | +the population seen so far. |
| 9 | + |
| 10 | +## **🔹 Variants of Reservoir Sampling** |
| 11 | +While **Algorithm R** is the simplest and most commonly used, there are **other variants** that improve performance in specific cases: |
| 12 | + |
| 13 | +| **Algorithm** | **Description** | **Complexity** | |
| 14 | +|--------------------------------|----------------|---------------| |
| 15 | +| **Algorithm R** | Basic reservoir sampling, replaces elements with probability `k/i` | **O(N)** | |
| 16 | +| **Algorithm L** | Optimized for large `N`, reduces replacements via skipping | **O(N), fewer iterations** | |
| 17 | +| **Weighted Reservoir Sampling** | Assigns elements weights, prioritizing selection based on weight | **O(N log k)** (heap-based) | |
| 18 | +| **Random Sort Reservoir Sampling** | Uses a min-heap priority queue, selecting `k` elements with highest random priority scores | **O(N log k)** | |
| 19 | + |
| 20 | +## Algorithm Weighted R – Weighted Reservoir Sampling |
| 21 | +**Weighted Reservoir Sampling** is an **efficient algorithm** for selecting `k` elements **proportionally to their weights** from a stream of unknown length `N`, using only `O(k)` memory. |
| 22 | + |
| 23 | +This repository implements **Weighted Algorithm R**, an extension of **Jeffrey Vitter's Algorithm R**, which allows weighted sampling using a **heap-based approach**. |
| 24 | + |
| 25 | +> This algorithm uses a **min-heap-based priority selection**, ensuring **O(N log k)** time complexity, making it efficient for large streaming datasets. |
| 26 | +
|
| 27 | +## 📊 **Mathematical Formula for Weighted Algorithm R** |
| 28 | + |
| 29 | +### **Problem Definition** |
| 30 | +We need to select **`k` elements** from a data stream **of unknown length `N`**, ensuring **each element is selected with a probability proportional to its weight `w_i`**. |
| 31 | + |
| 32 | +### **Algorithm Steps** |
| 33 | +1. **Initialize a Min-Heap of Size `k`** |
| 34 | + - Store the first `k` elements **with their priority scores**: |
| 35 | + \[ |
| 36 | + $p_i = \frac{w_i}{U_i}$ |
| 37 | + \] |
| 38 | + where \( $U_i$ \) is a uniform random number from **(0,1]**. |
| 39 | + |
| 40 | +2. **Process Remaining Elements (`i > k`)** |
| 41 | + - For each new element `s_i`: |
| 42 | + - Compute **priority score**: |
| 43 | + \[ |
| 44 | + $p_i = \frac{w_i}{U_i}$ |
| 45 | + \] |
| 46 | + - If `p_i` is greater than the **smallest priority in the heap**, replace the smallest element. |
| 47 | + |
| 48 | +3. **After processing `N` elements**, the reservoir will contain `k` elements **selected proportionally to their weights**. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## 🔬 **Probability Proof** |
| 53 | +For any element \( $s_i$ \) with weight \( $w_i$ \): |
| 54 | +1. The **priority score** is: |
| 55 | + \[ |
| 56 | + $p_i = \frac{w_i}{U_i}$ |
| 57 | + \] |
| 58 | + where \( $U_i \sim U(0,1]$ \). |
| 59 | + |
| 60 | +2. The **probability that `s_i` is among the top `k` elements**: |
| 61 | + \[ |
| 62 | + $P(s_i \text{ is selected}) \propto w_i$ |
| 63 | + \] |
| 64 | + meaning elements with **higher weights** are **more likely to be selected**. |
| 65 | + |
| 66 | +✅ **Conclusion:** Weighted Algorithm R correctly samples elements **proportionally to their weights**, unlike uniform Algorithm R. |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## 🧪 **Test Case Formula for Weighted Algorithm R** |
| 71 | + |
| 72 | +### **Test Case Design** |
| 73 | +To validate Weighted Algorithm R, we must check if: |
| 74 | +- **Higher-weight elements are chosen more frequently**. |
| 75 | +- **Selection follows the weight distribution over multiple runs**. |
| 76 | + |
| 77 | +### **Mathematical Test** |
| 78 | +For `T` independent runs: |
| 79 | +- Let `count(s_i)` be the number of times `s_i` appears in the reservoir. |
| 80 | +- Expected probability: |
| 81 | + \[ |
| 82 | + $P(s_i) = \frac{w_i}{\sum w_j}$ |
| 83 | + \] |
| 84 | +- Expected occurrence over `T` runs: |
| 85 | + \[ |
| 86 | + $\text{Expected count}(s_i) = T \times \frac{w_i}{\sum w_j}$ |
| 87 | + \] |
| 88 | +- We verify that `count(s_i)` is **statistically close** to this value. |
| 89 | + |
| 90 | +# 🎯 Algorithm L |
| 91 | + |
| 92 | +**Reservoir Sampling** is a technique for randomly selecting `k` elements from a stream of unknown length `N`. |
| 93 | +**Algorithm L**, introduced by **Jeffrey Vitter (1985)**, improves upon traditional methods by using an **optimized skipping approach**, significantly reducing the number of random number calls. |
| 94 | + |
| 95 | +### **Problem Definition** |
| 96 | +We need to select **`k` elements** from a data stream **of unknown length `N`**, ensuring **each element has an equal probability `k/N`** of being chosen. |
| 97 | + |
| 98 | +### **Algorithm Steps** |
| 99 | +1. **Fill the reservoir** with the **first `k` elements**. |
| 100 | +2. **Initialize weight factor `W`** using: |
| 101 | + |
| 102 | + $W = \exp\left(\frac{\log(\text{random}())}{k}\right)$ |
| 103 | + |
| 104 | +3. **Skip elements efficiently** using the geometric formula: |
| 105 | + |
| 106 | + $\text{skip} = \lfloor \frac{\log(\text{random}())}{\log(1 - W)} \rfloor$ |
| 107 | + |
| 108 | +4. **If still in bounds**, **randomly replace** an element in the reservoir. |
| 109 | +5. **Update `W`** for the next iteration using: |
| 110 | + |
| 111 | + $W = W \times \exp\left(\frac{\log(\text{random}())}{k}\right)$ |
| 112 | + |
| 113 | +6. **Repeat until the end of the stream**. |
| 114 | + |
| 115 | +### **Probability Proof** |
| 116 | +For each element \( $s_i$ \), we show that it has an equal probability of being selected: |
| 117 | + |
| 118 | +1. The probability that \( $s_i$ \) **reaches the selection process**: |
| 119 | + |
| 120 | + $P(s_i \text{ is considered}) = \frac{k}{i}$ |
| 121 | + |
| 122 | +2. The probability that \( $s_i$ \) **remains in the reservoir** is: |
| 123 | + |
| 124 | + $P(s_i \text{ in final reservoir}) = \frac{k}{N}, \quad \forall i \in \{1, ..., N\}$ |
| 125 | + |
| 126 | +This confirms that **Algorithm L ensures uniform selection**. |
| 127 | + |
| 128 | + |
| 129 | +## 🧪 **Test Case Formula for Algorithm L** |
| 130 | + |
| 131 | +### **Test Case Design** |
| 132 | +To validate Algorithm L, we must check if: |
| 133 | +- **Each element is chosen with probability `k/N`**. |
| 134 | +- **Selection is uniform over multiple runs**. |
| 135 | + |
| 136 | +### **Mathematical Test** |
| 137 | +For `T` independent runs: |
| 138 | +- Let `count(s_i)` be the number of times `s_i` appears in the reservoir. |
| 139 | +- Expected probability: |
| 140 | + |
| 141 | + $P(s_i) = \frac{k}{N}$ |
| 142 | + |
| 143 | +- Expected occurrence over `T` runs: |
| 144 | + |
| 145 | + $\text{Expected count}(s_i) = T \times \frac{k}{N}$ |
| 146 | + |
| 147 | +- We verify that `count(s_i)` is **statistically close** to this value. |
0 commit comments