Skip to content

Commit fbbc3cc

Browse files
committed
Add Athena sample of exploring Amazon Reviews dataset
1 parent 5a832d0 commit fbbc3cc

File tree

1 file changed

+252
-0
lines changed

1 file changed

+252
-0
lines changed

Diff for: athena/Athena_Exploration.md

+252
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# Athena exploration
2+
3+
Walkthrough of different Athena functionality using the [Amazon Customer Reviews](https://registry.opendata.aws/amazon-reviews/) open dataset.
4+
5+
This dataset provides both TSV and Parquet versions of over 130 million customer reviews since 1995.
6+
7+
## Table Definitions
8+
9+
Create a table in Athena over the TSV dataset.
10+
11+
```sql
12+
CREATE EXTERNAL TABLE amazon_reviews_tsv(
13+
marketplace string,
14+
customer_id string,
15+
review_id string,
16+
product_id string,
17+
product_parent string,
18+
product_title string,
19+
product_category string,
20+
star_rating int,
21+
helpful_votes int,
22+
total_votes int,
23+
vine string,
24+
verified_purchase string,
25+
review_headline string,
26+
review_body string,
27+
review_date date)
28+
ROW FORMAT DELIMITED
29+
FIELDS TERMINATED BY '\t'
30+
ESCAPED BY '\\'
31+
LINES TERMINATED BY '\n'
32+
LOCATION
33+
's3://amazon-reviews-pds/tsv/'
34+
TBLPROPERTIES ("skip.header.line.count"="1");
35+
```
36+
37+
Run a simple query to preview the data.
38+
39+
```sql
40+
SELECT * FROM "amazon_reviews_tsv"
41+
WHERE marketplace = 'US'
42+
limit 10;
43+
```
44+
45+
Create a table over the Parquet dataset. It's partitioned by `product_category`.
46+
47+
Run a couple aggregation queries to see the amount of data scanned is minimal (kb-mb) compared to the full size of the data on S3 (~50 GiB).
48+
49+
```sql
50+
CREATE EXTERNAL TABLE `amazon_reviews_parquet`(
51+
`marketplace` string,
52+
`customer_id` string,
53+
`review_id` string,
54+
`product_id` string,
55+
`product_parent` string,
56+
`product_title` string,
57+
`star_rating` int,
58+
`helpful_votes` int,
59+
`total_votes` int,
60+
`vine` string,
61+
`verified_purchase` string,
62+
`review_headline` string,
63+
`review_body` string,
64+
`review_date` bigint,
65+
`year` int)
66+
PARTITIONED BY ( `product_category` string )
67+
STORED AS PARQUET
68+
LOCATION 's3://amazon-reviews-pds/parquet';
69+
```
70+
71+
```sql
72+
SELECT product_id, product_title, count(*) as num_reviews, avg(star_rating) as avg_stars
73+
FROM amazon_reviews_parquet where product_category='Toys'
74+
GROUP BY 1, 2
75+
ORDER BY 3 DESC
76+
limit 100;
77+
78+
SELECT COUNT(*) FROM amazon_reviews_parquet where product_category='Toys'AND year >= 2012
79+
80+
SELECT * FROM amazon_reviews_parquet
81+
WHERE product_category='Toys'
82+
LIMIT 100;
83+
```
84+
85+
## CTAS Example
86+
87+
Re-partition by marketplace and year to allow for efficient queries. (This takes ~5 minutes to run),.
88+
89+
By default, the results are stored in a bucket automatically created in your account for Athena output: `aws-athena-query-results-<account_id>-<region>`.
90+
91+
See [Athena CTAS examples](https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html) for how to specify a specific S3 location with the `external_location` parameter.
92+
93+
```sql
94+
CREATE TABLE amazon_reviews_by_marketplace
95+
WITH (
96+
format='PARQUET',
97+
partitioned_by = ARRAY['marketplace', 'year']
98+
) AS
99+
SELECT customer_id, review_id, product_id, product_parent, product_title, product_category, star_rating, helpful_votes, total_votes, verified_purchase, review_headline, review_body, review_date,
100+
marketplace,
101+
year(review_date) as year
102+
FROM amazon_reviews_tsv
103+
WHERE "$path" LIKE '%tsv.gz'
104+
-- Run time: 4 minutes 43 seconds, Data scanned: 32.24 GB
105+
```
106+
107+
Show different query times and data scanned
108+
109+
```sql
110+
SELECT product_id, COUNT(*) FROM amazon_reviews_by_marketplace
111+
GROUP BY 1 ORDER BY 2 DESC LIMIT 10
112+
-- Run time: 6.7 seconds, Data scanned: 790.26 MB
113+
```
114+
115+
vs.
116+
117+
```sql
118+
SELECT product_id, COUNT(*) FROM amazon_reviews_by_marketplace
119+
WHERE marketplace='US' AND year = 2013
120+
GROUP BY 1 ORDER BY 2 DESC LIMIT 10
121+
-- Run time: 3.87 seconds, Data scanned: 145 MB
122+
```
123+
124+
## Optimization Techniques
125+
126+
### Sorting by a specific field
127+
128+
If you frequently query data based on an ID and expect a limited amount of data to be returned, you can sort the original dataset by that ID and write it out to a limited number of objects on S3. Athena will use the [parquet metadata](#parquet-metadata) to determine if it should read the underlying data.
129+
130+
One option is to use CTAS to create a derivative dataset and sort on the specific fields. This can take a while to run thanks to the sorting and the execution plan.
131+
132+
```sql
133+
CREATE TABLE amazon_reviews_sorted
134+
WITH (
135+
format='PARQUET'
136+
) AS
137+
SELECT product_id, customer_id, product_parent, star_rating, helpful_votes, total_votes, verified_purchase, marketplace, product_category, review_date
138+
FROM amazon_reviews_by_marketplace
139+
ORDER BY product_id ASC
140+
-- Run time: 18 minutes 13 seconds, Data scanned: 2.44 GB
141+
```
142+
143+
Note that this only outputs seven heavily-skewed files, but all rows for a specific `product_id` should be in one file.
144+
145+
```sql
146+
SELECT "$path", product_id, COUNT(*) FROM amazon_reviews_sorted
147+
WHERE product_id = 'B00E8KLWB4'
148+
GROUP BY 1, 2 ORDER BY 1 ASC
149+
-- Run time: 4.18 seconds, Data scanned: 81.9 MB)
150+
```
151+
152+
vs.
153+
154+
```sql
155+
CREATE TABLE amazon_reviews_unsorted
156+
WITH (
157+
format='PARQUET',
158+
bucketed_by = ARRAY['review_id'],
159+
bucket_count = 30
160+
) AS
161+
SELECT review_id, product_id, customer_id, product_parent, star_rating, helpful_votes, total_votes, verified_purchase, marketplace, product_category, review_date
162+
FROM amazon_reviews_by_marketplace
163+
-- Run time: 40.04 seconds, Data scanned: 2.44 GB
164+
```
165+
166+
We used the bucketing functionality to distribute the data evenly across 30 buckets. We used `review_id` as it is high cardinality and will allow for an even distribution.
167+
168+
```sql
169+
SELECT "$path", product_id, COUNT(*) FROM amazon_reviews_unsorted
170+
WHERE product_id = 'B00E8KLWB4'
171+
GROUP BY 1, 2 ORDER BY 1 ASC
172+
-- Run time: 4.39 seconds, Data scanned: 834.36 MB
173+
```
174+
175+
Initially I tried to bucket by `product_id`, but that still puts `product_id` in one file.
176+
It's not sorted across all files, though, as the field is hashed and the hash is used.
177+
Instead, we'll bucket on `review_id`, which will effectively randomize the `product_id`s.
178+
179+
It's tough to control sorting and # of output files using CTAS, but Spark can do this well. Using something like EMR Notebooks or AWS Glue, we read the original dataset and use `repartitionByRange` to sort `product_id` into 30 different output files.
180+
181+
```python
182+
(spark.read.parquet("s3://amazon-reviews-pds/parquet/")
183+
.select("marketplace", "customer_id", "review_id", "product_id", "product_parent", "star_rating")
184+
.repartitionByRange(30, "product_id")
185+
.write.mode("overwrite")
186+
.parquet("s3://<bucket>/amazon-reviews-sorted-subset/", compression="gzip")
187+
)
188+
```
189+
190+
And then back in Athena...
191+
192+
```sql
193+
CREATE EXTERNAL TABLE amazon_reviews_spark_sorted (
194+
marketplace string,
195+
customer_id string,
196+
review_id string,
197+
product_id string,
198+
product_parent string,
199+
star_rating int
200+
)
201+
STORED AS PARQUET
202+
LOCATION 's3://<bucket>/amazon-reviews-sorted-subset/'
203+
```
204+
205+
```sql
206+
SELECT "$path", COUNT(*) FROM amazon_reviews_spark_sorted
207+
GROUP BY 1 ORDER BY 1 ASC
208+
-- About 5-6M records per file
209+
```
210+
211+
## Parquet metadata
212+
213+
[parquet-tools](https://github.com/apache/parquet-mr/tree/master/parquet-tools) is a fantastic utility for analyzing the content of Parquet files.
214+
215+
If you're on a mac, it's available via homebrew: `brew install parquet-tools`
216+
217+
Download a sample Parquet file and print out the metadata:
218+
219+
```shell
220+
curl -O https://s3.amazonaws.com/amazon-reviews-pds/parquet/product_category=Watches/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
221+
parquet-tools meta part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
222+
```
223+
224+
You'll see a bunch of detailed information about the file including number of rows, minimum and maximum values, and the schema.
225+
226+
_Some rows left out for brevity_
227+
228+
```
229+
file: file:/private/tmp/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
230+
creator: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
231+
232+
file schema: spark_schema
233+
--------------------------------------------------------------------------------
234+
product_title: OPTIONAL BINARY O:UTF8 R:0 D:1
235+
star_rating: OPTIONAL INT32 R:0 D:1
236+
helpful_votes: OPTIONAL INT32 R:0 D:1
237+
review_date: OPTIONAL INT32 O:DATE R:0 D:1
238+
year: OPTIONAL INT32 R:0 D:1
239+
240+
row group 1: RC:97608 TS:39755962 OFFSET:4
241+
--------------------------------------------------------------------------------
242+
product_title: BINARY SNAPPY DO:0 FPO:3243045 SZ:3170609/6450771/2.03 VC:97608 ENC:PLAIN,PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[no stats for this column]
243+
star_rating: INT32 SNAPPY DO:0 FPO:6413654 SZ:36016/36709/1.02 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1, max: 5, num_nulls: 0]
244+
helpful_votes: INT32 SNAPPY DO:0 FPO:6449670 SZ:48348/93031/1.92 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 0, max: 753, num_nulls: 0]
245+
review_date: INT32 SNAPPY DO:0 FPO:23689606 SZ:35674/146381/4.10 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2001-04-05, max: 2015-08-31, num_nulls: 0]
246+
year: INT32 SNAPPY DO:0 FPO:23725280 SZ:2004/37279/18.60 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2001, max: 2015, num_nulls: 0]
247+
```
248+
249+
250+
More detailed information on the different fields for each column is [here](https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend).
251+
252+
Note that current versions of the tool may not show string statistics by default as they could be incorrect: [PARQUET-686](https://issues.apache.org/jira/browse/PARQUET-686).

0 commit comments

Comments
 (0)