Add control for deserialization error behavior #443

jbeisen · 2023-12-08T00:27:23Z

Adds a config option 'bad_data' to control whether deserialization errors result in dropping the data or failing the job. The logic is mostly contained in source_collector.rs, and it uses a new RateLimiter struct to limit the logging frequency and user error reporting.

mwylde

This is a great start.

A few general thoughts:

invalid_data_behavior is pretty long — can we come up with a shorter name for this (at least in SQL?)
I think the key design decision here will be what part of the logic is handled in the connectors vs the deserializer. In the very near future, we might want to support things like "allow 10 failures per second" (of possibly processing or event time) or "10 failures per 100 messages" or something. Where would that logic be possible/easy to implement?
For Avro schema registry, there are different kinds of errors, some of which may be transient. For example, if we fail to fetch the id for a record that may be because the schema registry is temporarily down or temporarily inconsistent. We may want to handle those differently than bad data errors.
Longer term, we want to support redirecting invalid records elsewhere in the pipeline. For example, allowing users to define a sink as a dead-letter queue. That will likely need a more complex implementation of this feature that's more integrated with the dataflow, but we don't have to solve that now.

mwylde · 2023-12-11T21:50:42Z

arroyo-formats/src/lib.rs

@@ -290,7 +298,7 @@ impl<T: SchemaData> DataDeserializer<T> {
    pub async fn deserialize_slice<'a>(
        &mut self,
        msg: &'a [u8],
-    ) -> impl Iterator<Item = Result<T, UserError>> + 'a + Send {
+    ) -> impl Iterator<Item = Result<Option<T>, UserError>> + 'a + Send {


Why is this now an option? It looks like the deserializer is responsible for handling the InvalidDataBehavior, and all of the users just ignore None values.

True, changed.

mwylde · 2023-12-11T21:52:24Z

arroyo-formats/src/lib.rs

+            Ok(t) => Ok(Some(t)),
+            Err(e) => match self.invalid_data_behavior {
+                None | Some(InvalidDataBehavior::Drop) => {
+                    warn!("Dropping invalid data: {}", e.details.clone());


I think in this case we still want to report back to the user. We also likely need some amount of rate-limiting on logging to the console to avoid perf impact (imagine if you're reading 100k messages/s and sometimes changes such that 10% of them are invalid—you're going to want to know, but also not log every one of them.)

(unrelatedly — the clone isn't necessary)

I've moved this logic to source_collector.rs and added rate limiting.

mwylde · 2023-12-11T22:04:47Z

arroyo-worker/src/connectors/sse.rs

-                                                            key: None,
-                                                            value,
-                                                        }).await;
+                                                        if let Some(value) = value {


The HTTP sources (SSE and websocket) have their own implementation of this that we probably want to replace

jbeisen · 2023-12-13T20:19:33Z

I've made a few changes:

moved the handling logic out of the DataDeserializer and put it a function collect_source_record invoked by the sources.
the deserialization functions now produce iterators with items of type Result<T, SourceError>. The sources pass these to collect_source_record which either drop the errors or convert them into UserErrors.
renamed the option to bad_data

About Avro data, I left the existing behavior for non-deserialization errors, which is that the job fails. Should we change that?

mwylde · 2023-12-14T20:59:11Z

arroyo-rpc/src/formats.rs

@@ -260,6 +260,29 @@ impl Format {
    }
 }

+#[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq, Hash, PartialOrd, ToSchema)]
+#[serde(rename_all = "snake_case")]
+pub enum BadData {


I don't think this will be extensible to policies that require configuration (like allowing 10 bad records/sec). This turns into a JSON schema like

"BadData": { "type": "string", "enum": [ "drop", "fail" ] },

I think if/when we add policies like that we'll want something like this:

struct BadData { limit: i32, per_seconds: i32, }

or we could even represent it as a percentage, like an SLA.

But either way, I'm not sure there's a clean way of implementing something backwards compatible with the current binary option.

I've made the enum variants structs so it gets serialized as a json object so we can add more complex ones in the future.

arroyo-worker/src/rate_limiter.rs

arroyo-worker/src/metrics.rs

arroyo-worker/src/rate_limiter.rs

arroyo-worker/src/connectors/source_collector.rs

mwylde · 2023-12-14T21:21:59Z

arroyo-worker/src/connectors/kafka/source/mod.rs

-                                        key: None,
-                                        value: value?,
-                                    }).await;
+                                    match value {


Does this need to be updated to use collect_source_record?

Yup, missed this. It's done now.

mwylde · 2023-12-14T21:23:57Z

arroyo-worker/src/connectors/polling_http.rs

-                                        Err(e) => {
-                                            ctx.report_user_error(e).await;
-                                        }
+                                    if let Err(e) = collect_source_record(ctx, SystemTime::now(), record, &self.bad_data, &mut self.rate_limiter).await {


I think this needs to be updated to remove the existing bad data handling code

It looks like most sources panic on user errors, except for this one and SSE. Is there a reason for that or should we make them all consistent?

I've removed this and made all the sources consistent in handling user errors.

mwylde · 2023-12-14T21:24:24Z

arroyo-worker/src/connectors/sse.rs

-                                                        }).await;
-                                                    }
-                                                    Err(e) => {
+                                                if let Err(e) = collect_source_record(ctx, SystemTime::now(), v, &self.bad_data, &mut self.rate_limiter).await {


Same here — there's existing error handling code that is now duplicative with what collect_source_record is doing

I've removed this

Adds a config option 'bad_data' to control whether deserialization errors result in dropping the data or failing the job. The logic is mostly contained in `source_collector.rs`, and it uses a new `RateLimiter` struct to limit the logging frequency and user error reporting.

jbeisen force-pushed the invalid-data branch 4 times, most recently from 1fa5420 to 0de45fb Compare December 9, 2023 00:08

mwylde reviewed Dec 11, 2023

View reviewed changes

jbeisen force-pushed the invalid-data branch 6 times, most recently from 60b62d1 to c25839f Compare December 13, 2023 20:15

jbeisen force-pushed the invalid-data branch 2 times, most recently from 6d2a9e6 to 8381f4f Compare December 13, 2023 20:22

jbeisen marked this pull request as ready for review December 13, 2023 20:27

jbeisen force-pushed the invalid-data branch from 8381f4f to 1aa80a2 Compare December 14, 2023 16:15

mwylde requested changes Dec 14, 2023

View reviewed changes

jbeisen force-pushed the invalid-data branch 2 times, most recently from ec31279 to 0b42734 Compare December 15, 2023 19:25

jbeisen force-pushed the invalid-data branch from 0b42734 to 152090c Compare December 15, 2023 19:59

mwylde approved these changes Dec 15, 2023

View reviewed changes

jbeisen merged commit 9aa7b59 into master Dec 18, 2023
8 checks passed

jbeisen deleted the invalid-data branch December 18, 2023 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add control for deserialization error behavior #443

Add control for deserialization error behavior #443

jbeisen commented Dec 8, 2023 •

edited

Loading

mwylde left a comment •

edited

Loading

mwylde Dec 11, 2023

jbeisen Dec 13, 2023

mwylde Dec 11, 2023

mwylde Dec 11, 2023

jbeisen Dec 13, 2023

mwylde Dec 11, 2023

jbeisen commented Dec 13, 2023 •

edited

Loading

mwylde Dec 14, 2023

jbeisen Dec 15, 2023

jbeisen Dec 15, 2023

mwylde Dec 14, 2023

jbeisen Dec 15, 2023

mwylde Dec 14, 2023

jbeisen Dec 15, 2023

jbeisen Dec 15, 2023

mwylde Dec 14, 2023

jbeisen Dec 15, 2023

Add control for deserialization error behavior #443

Add control for deserialization error behavior #443

Conversation

jbeisen commented Dec 8, 2023 • edited Loading

mwylde left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbeisen commented Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbeisen commented Dec 8, 2023 •

edited

Loading

mwylde left a comment •

edited

Loading

jbeisen commented Dec 13, 2023 •

edited

Loading