Add samza doc

kvargha · kvargha · commit 8e27f8f3cea1 · 2026-01-27T16:16:32.000-08:00
diff --git a/docs/README.md b/docs/README.md
@@ -113,7 +113,7 @@ Venice supports flexible data ingestion:
 
 - **Batch Push**: Full dataset replacement from Hadoop, Spark
 - **Incremental Push**: Bulk additions without full replacement
-- **Streaming Writes**: Real-time updates via Kafka, Samza, Flink, or the
+- **Streaming Writes**: Real-time updates via Apache Samza or the
   [Online Producer](./user-guide/write-apis/online-producer.md)
 - **Write Compute**: Partial updates and collection merging for efficiency
 - **Hybrid Stores**: Mix batch and streaming with configurable rewind time
diff --git a/docs/user-guide/write-apis/stream-processor.md b/docs/user-guide/write-apis/stream-processor.md
@@ -1,15 +1,96 @@
 # Stream Processor
 
-Data can be produced to Venice in a nearline fashion, from stream processors. The best supported stream processor is
-Apache Samza though we intend to add first-class support for other stream processors in the future. The difference
-between using a stream processor library and the [Online Producer](online-producer.md) library is that a stream
-processor has well-defined semantics around when to ensure that produced data is flushed and a built-in mechanism to
-checkpoint its progress relative to its consumption progress in upstream data sources, whereas the online producer
-library is a lower-level building block which leaves these reliability details up to the user.
-
-For Apache Samza, the integration point is done at the level of the
+Stream processors enable nearline data ingestion into Venice with automatic checkpointing and exactly-once semantics.
+The best supported stream processor is Apache Samza, with integration provided through the `venice-samza` module.
+
+## When to Use
+
+Choose stream processors when you need:
+
+- **Exactly-once processing guarantees** - Samza's checkpointing ensures no duplicate writes after restarts
+- **Stateful stream processing** - Transform and enrich data before writing to Venice
+- **Hybrid stores with nearline updates** - Combine batch push with real-time streaming updates
+- **Automatic checkpointing** - Built-in progress tracking relative to upstream data sources
+
+For simpler use cases without stream processing logic, consider the [Online Producer](online-producer.md).
+
+## Prerequisites
+
+To write to Venice from a stream processor, the store must be configured as a **hybrid store** with:
+
+1. Hybrid store enabled with a rewind time
+2. Current version capable of receiving nearline writes
+3. Either `ACTIVE_ACTIVE` or `NON_AGGREGATE` replication policy
+
+## Apache Samza Integration
+
+Venice integrates with Apache Samza through
 [VeniceSystemProducer](https://github.com/linkedin/venice/blob/main/integrations/venice-samza/src/main/java/com/linkedin/venice/samza/VeniceSystemProducer.java)
 and
 [VeniceSystemFactory](https://github.com/linkedin/venice/blob/main/integrations/venice-samza/src/main/java/com/linkedin/venice/samza/VeniceSystemFactory.java).
 
-More details to come.
+### Configuration
+
+Configure Venice as an output system in your Samza job properties file:
+
+```properties
+# Define Venice as an output system
+systems.venice.samza.factory=com.linkedin.venice.samza.VeniceSystemFactory
+
+# Required: Store name to write to
+systems.venice.store=my-store-name
+
+# Required: Push type
+systems.venice.push.type=STREAM
+
+# Required: Venice controller discovery URL
+systems.venice.venice.controller.discovery.url=http://controller.host:1234
+```
+
+### Writing Data
+
+Once configured, write to Venice using Samza's `MessageCollector`:
+
+```java
+public class MyStreamTask implements StreamTask {
+
+  @Override
+  public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
+    // Process your data
+    String key = processKey(envelope);
+    GenericRecord value = processValue(envelope);
+
+    // Send to Venice
+    OutgoingMessageEnvelope out = new OutgoingMessageEnvelope(new SystemStream("venice", "my-store-name"), key, value);
+
+    collector.send(out);
+  }
+}
+
+```
+
+## Comparison: Stream Processor vs Online Producer
+
+| Feature             | Stream Processor (Samza)             | Online Producer                    |
+| ------------------- | ------------------------------------ | ---------------------------------- |
+| Checkpointing       | Automatic via Samza                  | Manual application responsibility  |
+| Delivery guarantees | Exactly-once                         | At-least-once                      |
+| Stream processing   | Full Samza capabilities              | None - direct write only           |
+| Complexity          | Higher (Samza deployment required)   | Lower (library in application)     |
+| Latency             | Moderate (batched by Samza)          | Lower (direct writes)              |
+| Best for            | Complex pipelines, historical replay | Simple online writes from services |
+
+## Best Practices
+
+1. **Monitor Samza lag metrics** - Ensure your stream processor keeps up with upstream data
+2. **Configure appropriate buffer sizes** - Balance memory usage and throughput via Kafka producer configs
+3. **Handle schema evolution** - Ensure your Samza job can handle multiple value schema versions
+4. **Set appropriate rewind time** - Configure hybrid store rewind time based on expected downtime and reprocessing
+   needs
+
+## See Also
+
+- [Online Producer](online-producer.md) - Lower-level direct write API
+- [Batch Push](batch-push.md) - Full dataset replacement from Hadoop/Spark
+- [Hybrid Stores](../../getting-started/learn-venice/merging-batch-and-rt-data.md#hybrid-store) - Combining batch and
+  real-time data