Skip to content

Commit 8e27f8f

Browse files
committed
Add samza doc
1 parent fb12e62 commit 8e27f8f

File tree

2 files changed

+91
-10
lines changed

2 files changed

+91
-10
lines changed

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Venice supports flexible data ingestion:
113113

114114
- **Batch Push**: Full dataset replacement from Hadoop, Spark
115115
- **Incremental Push**: Bulk additions without full replacement
116-
- **Streaming Writes**: Real-time updates via Kafka, Samza, Flink, or the
116+
- **Streaming Writes**: Real-time updates via Apache Samza or the
117117
[Online Producer](./user-guide/write-apis/online-producer.md)
118118
- **Write Compute**: Partial updates and collection merging for efficiency
119119
- **Hybrid Stores**: Mix batch and streaming with configurable rewind time
Lines changed: 90 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,96 @@
11
# Stream Processor
22

3-
Data can be produced to Venice in a nearline fashion, from stream processors. The best supported stream processor is
4-
Apache Samza though we intend to add first-class support for other stream processors in the future. The difference
5-
between using a stream processor library and the [Online Producer](online-producer.md) library is that a stream
6-
processor has well-defined semantics around when to ensure that produced data is flushed and a built-in mechanism to
7-
checkpoint its progress relative to its consumption progress in upstream data sources, whereas the online producer
8-
library is a lower-level building block which leaves these reliability details up to the user.
9-
10-
For Apache Samza, the integration point is done at the level of the
3+
Stream processors enable nearline data ingestion into Venice with automatic checkpointing and exactly-once semantics.
4+
The best supported stream processor is Apache Samza, with integration provided through the `venice-samza` module.
5+
6+
## When to Use
7+
8+
Choose stream processors when you need:
9+
10+
- **Exactly-once processing guarantees** - Samza's checkpointing ensures no duplicate writes after restarts
11+
- **Stateful stream processing** - Transform and enrich data before writing to Venice
12+
- **Hybrid stores with nearline updates** - Combine batch push with real-time streaming updates
13+
- **Automatic checkpointing** - Built-in progress tracking relative to upstream data sources
14+
15+
For simpler use cases without stream processing logic, consider the [Online Producer](online-producer.md).
16+
17+
## Prerequisites
18+
19+
To write to Venice from a stream processor, the store must be configured as a **hybrid store** with:
20+
21+
1. Hybrid store enabled with a rewind time
22+
2. Current version capable of receiving nearline writes
23+
3. Either `ACTIVE_ACTIVE` or `NON_AGGREGATE` replication policy
24+
25+
## Apache Samza Integration
26+
27+
Venice integrates with Apache Samza through
1128
[VeniceSystemProducer](https://github.com/linkedin/venice/blob/main/integrations/venice-samza/src/main/java/com/linkedin/venice/samza/VeniceSystemProducer.java)
1229
and
1330
[VeniceSystemFactory](https://github.com/linkedin/venice/blob/main/integrations/venice-samza/src/main/java/com/linkedin/venice/samza/VeniceSystemFactory.java).
1431

15-
More details to come.
32+
### Configuration
33+
34+
Configure Venice as an output system in your Samza job properties file:
35+
36+
```properties
37+
# Define Venice as an output system
38+
systems.venice.samza.factory=com.linkedin.venice.samza.VeniceSystemFactory
39+
40+
# Required: Store name to write to
41+
systems.venice.store=my-store-name
42+
43+
# Required: Push type
44+
systems.venice.push.type=STREAM
45+
46+
# Required: Venice controller discovery URL
47+
systems.venice.venice.controller.discovery.url=http://controller.host:1234
48+
```
49+
50+
### Writing Data
51+
52+
Once configured, write to Venice using Samza's `MessageCollector`:
53+
54+
```java
55+
public class MyStreamTask implements StreamTask {
56+
57+
@Override
58+
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
59+
// Process your data
60+
String key = processKey(envelope);
61+
GenericRecord value = processValue(envelope);
62+
63+
// Send to Venice
64+
OutgoingMessageEnvelope out = new OutgoingMessageEnvelope(new SystemStream("venice", "my-store-name"), key, value);
65+
66+
collector.send(out);
67+
}
68+
}
69+
70+
```
71+
72+
## Comparison: Stream Processor vs Online Producer
73+
74+
| Feature | Stream Processor (Samza) | Online Producer |
75+
| ------------------- | ------------------------------------ | ---------------------------------- |
76+
| Checkpointing | Automatic via Samza | Manual application responsibility |
77+
| Delivery guarantees | Exactly-once | At-least-once |
78+
| Stream processing | Full Samza capabilities | None - direct write only |
79+
| Complexity | Higher (Samza deployment required) | Lower (library in application) |
80+
| Latency | Moderate (batched by Samza) | Lower (direct writes) |
81+
| Best for | Complex pipelines, historical replay | Simple online writes from services |
82+
83+
## Best Practices
84+
85+
1. **Monitor Samza lag metrics** - Ensure your stream processor keeps up with upstream data
86+
2. **Configure appropriate buffer sizes** - Balance memory usage and throughput via Kafka producer configs
87+
3. **Handle schema evolution** - Ensure your Samza job can handle multiple value schema versions
88+
4. **Set appropriate rewind time** - Configure hybrid store rewind time based on expected downtime and reprocessing
89+
needs
90+
91+
## See Also
92+
93+
- [Online Producer](online-producer.md) - Lower-level direct write API
94+
- [Batch Push](batch-push.md) - Full dataset replacement from Hadoop/Spark
95+
- [Hybrid Stores](../../getting-started/learn-venice/merging-batch-and-rt-data.md#hybrid-store) - Combining batch and
96+
real-time data

0 commit comments

Comments
 (0)