|
1 | 1 | # Stream Processor |
2 | 2 |
|
3 | | -Data can be produced to Venice in a nearline fashion, from stream processors. The best supported stream processor is |
4 | | -Apache Samza though we intend to add first-class support for other stream processors in the future. The difference |
5 | | -between using a stream processor library and the [Online Producer](online-producer.md) library is that a stream |
6 | | -processor has well-defined semantics around when to ensure that produced data is flushed and a built-in mechanism to |
7 | | -checkpoint its progress relative to its consumption progress in upstream data sources, whereas the online producer |
8 | | -library is a lower-level building block which leaves these reliability details up to the user. |
9 | | - |
10 | | -For Apache Samza, the integration point is done at the level of the |
| 3 | +Stream processors enable nearline data ingestion into Venice with automatic checkpointing and exactly-once semantics. |
| 4 | +The best supported stream processor is Apache Samza, with integration provided through the `venice-samza` module. |
| 5 | + |
| 6 | +## When to Use |
| 7 | + |
| 8 | +Choose stream processors when you need: |
| 9 | + |
| 10 | +- **Exactly-once processing guarantees** - Samza's checkpointing ensures no duplicate writes after restarts |
| 11 | +- **Stateful stream processing** - Transform and enrich data before writing to Venice |
| 12 | +- **Hybrid stores with nearline updates** - Combine batch push with real-time streaming updates |
| 13 | +- **Automatic checkpointing** - Built-in progress tracking relative to upstream data sources |
| 14 | + |
| 15 | +For simpler use cases without stream processing logic, consider the [Online Producer](online-producer.md). |
| 16 | + |
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +To write to Venice from a stream processor, the store must be configured as a **hybrid store** with: |
| 20 | + |
| 21 | +1. Hybrid store enabled with a rewind time |
| 22 | +2. Current version capable of receiving nearline writes |
| 23 | +3. Either `ACTIVE_ACTIVE` or `NON_AGGREGATE` replication policy |
| 24 | + |
| 25 | +## Apache Samza Integration |
| 26 | + |
| 27 | +Venice integrates with Apache Samza through |
11 | 28 | [VeniceSystemProducer](https://github.com/linkedin/venice/blob/main/integrations/venice-samza/src/main/java/com/linkedin/venice/samza/VeniceSystemProducer.java) |
12 | 29 | and |
13 | 30 | [VeniceSystemFactory](https://github.com/linkedin/venice/blob/main/integrations/venice-samza/src/main/java/com/linkedin/venice/samza/VeniceSystemFactory.java). |
14 | 31 |
|
15 | | -More details to come. |
| 32 | +### Configuration |
| 33 | + |
| 34 | +Configure Venice as an output system in your Samza job properties file: |
| 35 | + |
| 36 | +```properties |
| 37 | +# Define Venice as an output system |
| 38 | +systems.venice.samza.factory=com.linkedin.venice.samza.VeniceSystemFactory |
| 39 | + |
| 40 | +# Required: Store name to write to |
| 41 | +systems.venice.store=my-store-name |
| 42 | + |
| 43 | +# Required: Push type |
| 44 | +systems.venice.push.type=STREAM |
| 45 | + |
| 46 | +# Required: Venice controller discovery URL |
| 47 | +systems.venice.venice.controller.discovery.url=http://controller.host:1234 |
| 48 | +``` |
| 49 | + |
| 50 | +### Writing Data |
| 51 | + |
| 52 | +Once configured, write to Venice using Samza's `MessageCollector`: |
| 53 | + |
| 54 | +```java |
| 55 | +public class MyStreamTask implements StreamTask { |
| 56 | + |
| 57 | + @Override |
| 58 | + public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { |
| 59 | + // Process your data |
| 60 | + String key = processKey(envelope); |
| 61 | + GenericRecord value = processValue(envelope); |
| 62 | + |
| 63 | + // Send to Venice |
| 64 | + OutgoingMessageEnvelope out = new OutgoingMessageEnvelope(new SystemStream("venice", "my-store-name"), key, value); |
| 65 | + |
| 66 | + collector.send(out); |
| 67 | + } |
| 68 | +} |
| 69 | + |
| 70 | +``` |
| 71 | + |
| 72 | +## Comparison: Stream Processor vs Online Producer |
| 73 | + |
| 74 | +| Feature | Stream Processor (Samza) | Online Producer | |
| 75 | +| ------------------- | ------------------------------------ | ---------------------------------- | |
| 76 | +| Checkpointing | Automatic via Samza | Manual application responsibility | |
| 77 | +| Delivery guarantees | Exactly-once | At-least-once | |
| 78 | +| Stream processing | Full Samza capabilities | None - direct write only | |
| 79 | +| Complexity | Higher (Samza deployment required) | Lower (library in application) | |
| 80 | +| Latency | Moderate (batched by Samza) | Lower (direct writes) | |
| 81 | +| Best for | Complex pipelines, historical replay | Simple online writes from services | |
| 82 | + |
| 83 | +## Best Practices |
| 84 | + |
| 85 | +1. **Monitor Samza lag metrics** - Ensure your stream processor keeps up with upstream data |
| 86 | +2. **Configure appropriate buffer sizes** - Balance memory usage and throughput via Kafka producer configs |
| 87 | +3. **Handle schema evolution** - Ensure your Samza job can handle multiple value schema versions |
| 88 | +4. **Set appropriate rewind time** - Configure hybrid store rewind time based on expected downtime and reprocessing |
| 89 | + needs |
| 90 | + |
| 91 | +## See Also |
| 92 | + |
| 93 | +- [Online Producer](online-producer.md) - Lower-level direct write API |
| 94 | +- [Batch Push](batch-push.md) - Full dataset replacement from Hadoop/Spark |
| 95 | +- [Hybrid Stores](../../getting-started/learn-venice/merging-batch-and-rt-data.md#hybrid-store) - Combining batch and |
| 96 | + real-time data |
0 commit comments