You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/api-reference/loaders-storage-targets/bigquery-loader/configuration-reference/_common_config.md
+38-2
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ import Link from '@docusaurus/Link';
4
4
5
5
<tr>
6
6
<td><code>batching.maxBytes</code></td>
7
-
<td>Optional. Default value <code>16000000</code>. Events are emitted to BigQuery when the batch reaches this size in bytes</td>
7
+
<td>Optional. Default value <code>10000000</code>. Events are emitted to BigQuery when the batch reaches this size in bytes</td>
8
8
</tr>
9
9
<tr>
10
10
<td><code>batching.maxDelay</code></td>
@@ -37,7 +37,35 @@ import Link from '@docusaurus/Link';
37
37
</tr>
38
38
<tr>
39
39
<td><code>skipSchemas</code></td>
40
-
<td>Optional, e.g. <code>["iglu:com.example/skipped1/jsonschema/1-0-0"]</code> or with wildcards <code>["iglu:com.example/skipped2/jsonschema/1-*-*"]</code>. A list of schemas that won't be loaded to BigQuery. This feature could be helpful when recovering from edge-case schemas which for some reason cannot be loaded to the table.</td>
40
+
<td>
41
+
Optional, e.g. <code>["iglu:com.example/skipped1/jsonschema/1-0-0"]</code> or with wildcards <code>["iglu:com.example/skipped2/jsonschema/1-*-*"]</code>.
42
+
A list of schemas that won't be loaded to BigQuery.
43
+
This feature could be helpful when recovering from edge-case schemas which for some reason cannot be loaded to the table.
44
+
</td>
45
+
</tr>
46
+
<tr>
47
+
<td><code>legacyColumnMode</code></td>
48
+
<td>Optional. Default value <code>false</code>.
49
+
When this mode is enabled, the loader uses the legacy column style used by the v1 BigQuery loader.
50
+
For example, an entity for a <code>1-0-0</code> schema is loaded into a column ending in <code>_1_0_0</code>, instead of a column ending in <code>_1</code>.
51
+
This feature could be helpful when migrating from the v1 loader to the v2 loader.
52
+
</td>
53
+
</tr>
54
+
<tr>
55
+
<td><code>legacyColumns</code></td>
56
+
<td>
57
+
Optional, e.g. <code>["iglu:com.example/legacy/jsonschema/1-0-0"]</code> or with wildcards <code>["iglu:com.example/legacy/jsonschema/1-*-*"]</code>.
58
+
Schemas for which to use the legacy column style used by the v1 BigQuery loader, even when <code>legacyColumnMode</code> is disabled.
59
+
</td>
60
+
</tr>
61
+
<tr>
62
+
<td><code>exitOnMissingIgluSchema</code></td>
63
+
<td>
64
+
Optional. Default value <code>true</code>.
65
+
Whether the loader should crash and exit if it fails to resolve an Iglu Schema.
66
+
We recommend `true` because Snowplow enriched events have already passed validation, so a missing schema normally indicates an error that needs addressing.
67
+
Change to <code>false</code> so events go the failed events stream instead of crashing the loader.
<td> Optional. Default value 4. Configures the internal HTTP client used for iglu resolver, alerts and telemetry. The maximum number of open HTTP requests to any single server at any one time.</td>
Copy file name to clipboardexpand all lines: docs/api-reference/loaders-storage-targets/bigquery-loader/configuration-reference/_kinesis_config.md
+18-2
Original file line number
Diff line number
Diff line change
@@ -23,8 +23,24 @@
23
23
<td>Optional. Default value 1000. How many events the Kinesis client may fetch in a single poll. Only used when `input.retrievalMode` is Polling.</td>
24
24
</tr>
25
25
<tr>
26
-
<td><code>input.bufferSize</code></td>
27
-
<td>Optional. Default value 1. The number of batches of events which are pre-fetched from kinesis. The default value is known to work well.</td>
26
+
<td><code>input.workerIdentifier</code></td>
27
+
<td>Optional. Defaults to the <code>HOSTNAME</code> environment variable. The name of this KCL worker used in the dynamodb lease table.</td>
28
+
</tr>
29
+
<tr>
30
+
<td><code>input.leaseDuration</code></td>
31
+
<td>Optional. Default value <code>10 seconds</code>. The duration of shard leases. KCL workers must periodically refresh leases in the dynamodb table before this duration expires.</td>
<td>Optional. Default value <code>2.0</code>. Controls how to pick the max number of shard leases to steal at one time. E.g. If there are 4 available processors, and <code>maxLeasesToStealAtOneTimeFactor = 2.0</code>, then allow the loader to steal up to 8 leases. Allows bigger instances to more quickly acquire the shard-leases they need to combat latency.</td>
<td>Optional. Default value <code>100 milliseconds</code>. Initial backoff used to retry checkpointing if we exceed the DynamoDB provisioned write limits.</td>
<td>Optional. Default value <code>1 second</code>. Maximum backoff used to retry checkpointing if we exceed the DynamoDB provisioned write limits.</td>
Copy file name to clipboardexpand all lines: docs/api-reference/loaders-storage-targets/bigquery-loader/configuration-reference/_pubsub_config.md
+19-10
Original file line number
Diff line number
Diff line change
@@ -3,24 +3,33 @@
3
3
<td>Required, e.g. <code>projects/myproject/subscriptions/snowplow-enriched</code>. Name of the Pub/Sub subscription with the enriched events</td>
4
4
</tr>
5
5
<tr>
6
-
<td><code>input.parallelPullCount</code></td>
7
-
<td>Optional. Default value 1. Number of threads used internally by the pubsub client library for fetching events</td>
6
+
<td><code>input.parallelPullFactor</code></td>
7
+
<td>Optional. Default value 0.5. <code>parallelPullFactor * cpu count</code> will determine the number of threads used internally by the Pub/Sub client library for fetching events</td>
8
8
</tr>
9
9
<tr>
10
-
<td><code>input.bufferMaxBytes</code></td>
11
-
<td>Optional. Default value 10000000. How many bytes can be buffered by the loader app before blocking the pubsub client library from fetching more events. This is a balance between memory usage vs how efficiently the app can operate. The default value works well.</td>
Controls when ack deadlines are re-extended, for a message that is close to exceeding its ack deadline.
18
+
For example, if <code>durationPerAckExtension</code> is <code>60 seconds</code> and <code>minRemainingAckDeadline</code> is <code>0.1</code> then the loader
19
+
will wait until there is <code>6 seconds</code> left of the remining deadline, before re-extending the message deadline.
<td>Optional. Default value 60 seconds. Sets min boundary on the value by which an ack deadline is extended. The actual value used is guided by runtime statistics collected by the pubsub client library.</td>
23
+
<td><code>input.maxMessagesPerPull</code></td>
24
+
<td>Optional. Default value 1000. How many Pub/Sub messages to pull from the server in a single request.</td>
<td>Optional. Default value 600 seconds. Sets max boundary on the value by which an ack deadline is extended. The actual value used is guided by runtime statistics collected by the pubsub client library.</td>
27
+
<td><code>input.debounceRequests</code></td>
28
+
<td>
29
+
Optional. Default value <code>100 millis</code>.
30
+
Adds an artifical delay between consecutive requests to Pub/Sub for more messages.
31
+
Under some circumstances, this was found to slightly alleviate a problem in which Pub/Sub might re-deliver the same messages multiple times.
See the [configuration reference](/docs/api-reference/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters.
69
+
70
+
### Iglu
71
+
72
+
The BigQuery Loader requires an [Iglu resolver file](/docs/api-reference/iglu/iglu-resolver/index.md) which describes the Iglu repositories that host your schemas. This should be the same Iglu configuration file that you used in the Enrichment process.
73
+
74
+
## Metrics
75
+
76
+
The BigQuery Loader can be configured to send the following custom metrics to a [StatsD](https://www.datadoghq.com/statsd-monitoring/) receiver:
77
+
78
+
| Metric | Definition |
79
+
|-----------------------------|------------|
80
+
|`events_good`| A count of events that are successfully written to BigQuery. |
81
+
|`events_bad`| A count of failed events that could not be loaded, and were instead sent to the bad output stream. |
82
+
|`latency_millis`| The time in milliseconds from when events are written to the source stream of events (i.e. by Enrich) until when they are read by the loader. |
83
+
|`e2e_latency_millis`| The end-to-end latency of the snowplow pipeline. The time in milliseconds from when an event was received by the collector, until it is written into BigQuery. |
84
+
85
+
See the `monitoring.metrics.statsd` options in the [configuration reference](/docs/api-reference/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for how to configure the StatsD receiver.
86
+
87
+
```mdx-code-block
88
+
import Telemetry from "@site/docs/reusable/telemetry/_index.md"
Copy file name to clipboardexpand all lines: docs/api-reference/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md
+17
Original file line number
Diff line number
Diff line change
@@ -41,6 +41,23 @@ If you are not using Snowplow dbt models but still use dbt, you can employ [this
41
41
42
42
:::
43
43
44
+
### Enable legacy mode for the old table format
45
+
46
+
To simplify migration to the new table format, it is possible to run the 2.x loader in legacy mode, so it loads self-describing events and entities using the old column names of the 1.x loader.
47
+
48
+
**Option 1:** In the configuration file, set `legacyColumnMode` to `true`. When this mode is enabled, the loader uses the legacy column style for all self-describing events and entities.
49
+
50
+
**Option 2:** In the configuration file, set `legacyColumns` to list specific schemas for which to use the legacy column style. This list is used when `legacyColumnMode` is `false` (the default).
Copy file name to clipboardexpand all lines: docs/destinations/warehouses-lakes/querying-data/index.md
+2-1
Original file line number
Diff line number
Diff line change
@@ -87,7 +87,8 @@ FROM
87
87
```
88
88
89
89
:::note
90
-
The column name produced by previous versions of the BigQuery Loader (<2.0.0) would contain full schema version, e.g. `unstruct_event_my_example_event_1_0_0`
90
+
The column name produced by previous versions of the BigQuery Loader (<2.0.0) would contain full schema version, e.g. `unstruct_event_my_example_event_1_0_0`.
91
+
The [BigQuery Loader upgrade guide](/docs/api-reference/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md) describes how to enable the legacy column names in the 2.0.0 loader.
0 commit comments