Skip to content

Commit e1b34d5

Browse files
reludjklukas
authored andcommitted
Document utility function for determine a view's underlying table
1 parent 4c7335f commit e1b34d5

File tree

1 file changed

+27
-6
lines changed

1 file changed

+27
-6
lines changed

src/tools/spark.md

+27-6
Original file line numberDiff line numberDiff line change
@@ -70,22 +70,43 @@ Example of using the Storage API from Databricks:
7070
dbutils.library.installPyPI("google-cloud-bigquery", "1.16.0")
7171
dbutils.library.restartPython()
7272

73-
# Read one day of pings and select a subset of columns.
73+
from google.cloud import bigquery
74+
75+
76+
def get_table(view):
77+
bq = bigquery.Client()
78+
view = view.replace(":", ".")
79+
# partition filter is required, so try a couple options
80+
for partition_column in ["DATE(submission_timestamp)", "submission_date"]:
81+
try:
82+
job = bq.query(
83+
f"SELECT * FROM `{view}` WHERE {partition_column} = CURRENT_DATE",
84+
bigquery.QueryJobConfig(dry_run=True),
85+
)
86+
break
87+
except Exception:
88+
continue
89+
else:
90+
raise ValueError("could not determine partition column")
91+
assert len(job.referenced_tables) == 1
92+
table = job.referenced_tables[0]
93+
return f"{table.project}:{table.dataset_id}.{table.table_id}"
94+
95+
96+
# Read one day of main pings and select a subset of columns.
7497
core_pings_single_day = spark.read.format("bigquery") \
75-
.option("table", "moz-fx-data-shared-prod.telemetry_stable.core_v10") \
98+
.option("table", get_table("moz-fx-data-shared-prod.telemetry.main")) \
7699
.load() \
77100
.where("submission_timestamp >= to_date('2019-08-25') submission_timestamp < to_date('2019-08-26')") \
78101
.select("client_id", "experiments", "normalized_channel")
79102
```
80103

81104
A couple of things are worth noting in the above example.
82105

83-
* You must supply an actual _table_ name to read from here, fully qualified
84-
with project name and dataset name.
106+
* `get_table` is necessary because an actual _table_ name is required to read
107+
from BigQuery here, fully qualified with project name and dataset name.
85108
The Storage API does not support accessing `VIEW`s, so the convenience names
86109
such as `telemetry.core` are not available via this API.
87-
You can find the table corresponding to a given view using the BigQuery
88-
console or using Data Catalog.
89110
* You must supply a filter on the table's date partitioning column, in this
90111
case `submission_timestamp`.
91112
Additionally, you must use the `to_date` function to make sure that predicate

0 commit comments

Comments
 (0)