Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unexpected Empty Records in Nested Arrays When Using BigQueryIO.write() with .withAutoSchemaUpdate(true) and .ignoreUnknownValues() #33842

Open
1 of 17 tasks
Igor-Domin opened this issue Feb 3, 2025 · 5 comments

Comments

@Igor-Domin
Copy link

Igor-Domin commented Feb 3, 2025

What happened?

When enabling .withAutoSchemaUpdate(true) and .ignoreUnknownValues() with BigQueryIO.write() method, unexpected behavior happens causing records with nested arrays to fail being written to BQ. An empty record is added into the array, causing schema validation issues. This behavior does not occur without these two settings enabled.

Expected behavior:
Records should be mapped without additional empty objects being added to array

Actual Behavior:
Records in arrays are unexpectedly populated with an empty record, resulting in schema validation failures during write operations.

Steps to reproduce:

  1. Use BigQueryIO.write() with: .withAutoSchemaUpdate(true) and .ignoreUnknownValues() (Beam version is 2.59.0)
  2. Write records containing array data with the schema mentioned above
  3. Observe failures due to an empty record being inserted into arrays.

Example big querry nested record field:

{
  "name": "external_ids",
  "mode": "REPEATED",
  "type": "RECORD",
  "fields": [
    {
      "name": "id",
      "mode": "REQUIRED",
      "type": "STRING"
    },
    {
      "name": "name",
      "mode": "REQUIRED",
      "type": "STRING"
    }
  ]
}

This is what response would look like:

(...)
"errorMessage": "Field value of id cannot be empty. on field external_ids.",
(...)
  "stringifiedData": {
    "external_ids": [
      {
        "id": "fd7da837-13cb-4b67-8ced-220a27130e34",
        "name": "some_custom_id"
      },
      {}
    ],
  }
(...)

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@michalmisiewicz
Copy link

michalmisiewicz commented Feb 3, 2025

For additional context, we’re writing to a dynamically selected BigQuery table:

BigQueryIO.<TableRowWrapper>write()
                .to(input -> createTableSpec(input, project, topicTableMapping)) 

The issue may stem from line 181, where unknownFields are assigned to an empty TableRow and passed to a function which converts TableRow to Proto Message.

boolean ignoreUnknown = ignoreUnknownValues || autoSchemaUpdates;
@Nullable TableRow unknownFields = autoSchemaUpdates ? new TableRow() : null;
boolean allowMissingFields = autoSchemaUpdates;
Message msg =
TableRowToStorageApiProto.messageFromTableRow(
schemaInformation,
descriptorToUse,
tableRow,
ignoreUnknown,
allowMissingFields,
unknownFields,
changeType,
changeSequenceNum);

@liferoad
Copy link
Contributor

liferoad commented Feb 3, 2025

cc @ahmedabu98

@ahmedabu98
Copy link
Contributor

Hey @Igor-Domin, I'm trying to figure out why you're writing array data to a REQUIRED field.

Unless you intend for "external_ids" to indeed be an array of records, in which case you'd need to change type to REPEATED.

Lmk if I'm missing something

@Igor-Domin
Copy link
Author

Igor-Domin commented Feb 4, 2025

Hello @ahmedabu98, oh I've made a typo, the mode is set to REPEATED :)

@michalmisiewicz
Copy link

@ahmedabu98 Could you have a second look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants