|
| 1 | +# Athena exploration |
| 2 | + |
| 3 | +Walkthrough of different Athena functionality using the [Amazon Customer Reviews](https://registry.opendata.aws/amazon-reviews/) open dataset. |
| 4 | + |
| 5 | +This dataset provides both TSV and Parquet versions of over 130 million customer reviews since 1995. |
| 6 | + |
| 7 | +## Table Definitions |
| 8 | + |
| 9 | +Create a table in Athena over the TSV dataset. |
| 10 | + |
| 11 | +```sql |
| 12 | +CREATE EXTERNAL TABLE amazon_reviews_tsv( |
| 13 | + marketplace string, |
| 14 | + customer_id string, |
| 15 | + review_id string, |
| 16 | + product_id string, |
| 17 | + product_parent string, |
| 18 | + product_title string, |
| 19 | + product_category string, |
| 20 | + star_rating int, |
| 21 | + helpful_votes int, |
| 22 | + total_votes int, |
| 23 | + vine string, |
| 24 | + verified_purchase string, |
| 25 | + review_headline string, |
| 26 | + review_body string, |
| 27 | + review_date date) |
| 28 | +ROW FORMAT DELIMITED |
| 29 | + FIELDS TERMINATED BY '\t' |
| 30 | + ESCAPED BY '\\' |
| 31 | + LINES TERMINATED BY '\n' |
| 32 | +LOCATION |
| 33 | + 's3://amazon-reviews-pds/tsv/' |
| 34 | +TBLPROPERTIES ("skip.header.line.count"="1"); |
| 35 | +``` |
| 36 | + |
| 37 | +Run a simple query to preview the data. |
| 38 | + |
| 39 | +```sql |
| 40 | +SELECT * FROM "amazon_reviews_tsv" |
| 41 | +WHERE marketplace = 'US' |
| 42 | +limit 10; |
| 43 | +``` |
| 44 | + |
| 45 | +Create a table over the Parquet dataset. It's partitioned by `product_category`. |
| 46 | + |
| 47 | +Run a couple aggregation queries to see the amount of data scanned is minimal (kb-mb) compared to the full size of the data on S3 (~50 GiB). |
| 48 | + |
| 49 | +```sql |
| 50 | +CREATE EXTERNAL TABLE `amazon_reviews_parquet`( |
| 51 | + `marketplace` string, |
| 52 | + `customer_id` string, |
| 53 | + `review_id` string, |
| 54 | + `product_id` string, |
| 55 | + `product_parent` string, |
| 56 | + `product_title` string, |
| 57 | + `star_rating` int, |
| 58 | + `helpful_votes` int, |
| 59 | + `total_votes` int, |
| 60 | + `vine` string, |
| 61 | + `verified_purchase` string, |
| 62 | + `review_headline` string, |
| 63 | + `review_body` string, |
| 64 | + `review_date` bigint, |
| 65 | + `year` int) |
| 66 | +PARTITIONED BY ( `product_category` string ) |
| 67 | +STORED AS PARQUET |
| 68 | +LOCATION 's3://amazon-reviews-pds/parquet'; |
| 69 | +``` |
| 70 | + |
| 71 | +```sql |
| 72 | +SELECT product_id, product_title, count(*) as num_reviews, avg(star_rating) as avg_stars |
| 73 | +FROM amazon_reviews_parquet where product_category='Toys' |
| 74 | +GROUP BY 1, 2 |
| 75 | +ORDER BY 3 DESC |
| 76 | +limit 100; |
| 77 | + |
| 78 | +SELECT COUNT(*) FROM amazon_reviews_parquet where product_category='Toys'AND year >= 2012 |
| 79 | + |
| 80 | +SELECT * FROM amazon_reviews_parquet |
| 81 | +WHERE product_category='Toys' |
| 82 | +LIMIT 100; |
| 83 | +``` |
| 84 | + |
| 85 | +## CTAS Example |
| 86 | + |
| 87 | +Re-partition by marketplace and year to allow for efficient queries. (This takes ~5 minutes to run),. |
| 88 | + |
| 89 | +By default, the results are stored in a bucket automatically created in your account for Athena output: `aws-athena-query-results-<account_id>-<region>`. |
| 90 | + |
| 91 | +See [Athena CTAS examples](https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html) for how to specify a specific S3 location with the `external_location` parameter. |
| 92 | + |
| 93 | +```sql |
| 94 | +CREATE TABLE amazon_reviews_by_marketplace |
| 95 | +WITH ( |
| 96 | + format='PARQUET', |
| 97 | + partitioned_by = ARRAY['marketplace', 'year'] |
| 98 | +) AS |
| 99 | +SELECT customer_id, review_id, product_id, product_parent, product_title, product_category, star_rating, helpful_votes, total_votes, verified_purchase, review_headline, review_body, review_date, |
| 100 | + marketplace, |
| 101 | + year(review_date) as year |
| 102 | +FROM amazon_reviews_tsv |
| 103 | +WHERE "$path" LIKE '%tsv.gz' |
| 104 | +-- Run time: 4 minutes 43 seconds, Data scanned: 32.24 GB |
| 105 | +``` |
| 106 | + |
| 107 | +Show different query times and data scanned |
| 108 | + |
| 109 | +```sql |
| 110 | +SELECT product_id, COUNT(*) FROM amazon_reviews_by_marketplace |
| 111 | +GROUP BY 1 ORDER BY 2 DESC LIMIT 10 |
| 112 | +-- Run time: 6.7 seconds, Data scanned: 790.26 MB |
| 113 | +``` |
| 114 | + |
| 115 | +vs. |
| 116 | + |
| 117 | +```sql |
| 118 | +SELECT product_id, COUNT(*) FROM amazon_reviews_by_marketplace |
| 119 | +WHERE marketplace='US' AND year = 2013 |
| 120 | +GROUP BY 1 ORDER BY 2 DESC LIMIT 10 |
| 121 | +-- Run time: 3.87 seconds, Data scanned: 145 MB |
| 122 | +``` |
| 123 | + |
| 124 | +## Optimization Techniques |
| 125 | + |
| 126 | +### Sorting by a specific field |
| 127 | + |
| 128 | +If you frequently query data based on an ID and expect a limited amount of data to be returned, you can sort the original dataset by that ID and write it out to a limited number of objects on S3. Athena will use the [parquet metadata](#parquet-metadata) to determine if it should read the underlying data. |
| 129 | + |
| 130 | +One option is to use CTAS to create a derivative dataset and sort on the specific fields. This can take a while to run thanks to the sorting and the execution plan. |
| 131 | + |
| 132 | +```sql |
| 133 | +CREATE TABLE amazon_reviews_sorted |
| 134 | +WITH ( |
| 135 | + format='PARQUET' |
| 136 | +) AS |
| 137 | +SELECT product_id, customer_id, product_parent, star_rating, helpful_votes, total_votes, verified_purchase, marketplace, product_category, review_date |
| 138 | +FROM amazon_reviews_by_marketplace |
| 139 | +ORDER BY product_id ASC |
| 140 | +-- Run time: 18 minutes 13 seconds, Data scanned: 2.44 GB |
| 141 | +``` |
| 142 | + |
| 143 | +Note that this only outputs seven heavily-skewed files, but all rows for a specific `product_id` should be in one file. |
| 144 | + |
| 145 | +```sql |
| 146 | +SELECT "$path", product_id, COUNT(*) FROM amazon_reviews_sorted |
| 147 | +WHERE product_id = 'B00E8KLWB4' |
| 148 | +GROUP BY 1, 2 ORDER BY 1 ASC |
| 149 | +-- Run time: 4.18 seconds, Data scanned: 81.9 MB) |
| 150 | +``` |
| 151 | + |
| 152 | +vs. |
| 153 | + |
| 154 | +```sql |
| 155 | +CREATE TABLE amazon_reviews_unsorted |
| 156 | +WITH ( |
| 157 | + format='PARQUET', |
| 158 | + bucketed_by = ARRAY['review_id'], |
| 159 | + bucket_count = 30 |
| 160 | +) AS |
| 161 | +SELECT review_id, product_id, customer_id, product_parent, star_rating, helpful_votes, total_votes, verified_purchase, marketplace, product_category, review_date |
| 162 | +FROM amazon_reviews_by_marketplace |
| 163 | +-- Run time: 40.04 seconds, Data scanned: 2.44 GB |
| 164 | +``` |
| 165 | + |
| 166 | +We used the bucketing functionality to distribute the data evenly across 30 buckets. We used `review_id` as it is high cardinality and will allow for an even distribution. |
| 167 | + |
| 168 | +```sql |
| 169 | +SELECT "$path", product_id, COUNT(*) FROM amazon_reviews_unsorted |
| 170 | +WHERE product_id = 'B00E8KLWB4' |
| 171 | +GROUP BY 1, 2 ORDER BY 1 ASC |
| 172 | +-- Run time: 4.39 seconds, Data scanned: 834.36 MB |
| 173 | +``` |
| 174 | + |
| 175 | +Initially I tried to bucket by `product_id`, but that still puts `product_id` in one file. |
| 176 | +It's not sorted across all files, though, as the field is hashed and the hash is used. |
| 177 | +Instead, we'll bucket on `review_id`, which will effectively randomize the `product_id`s. |
| 178 | + |
| 179 | +It's tough to control sorting and # of output files using CTAS, but Spark can do this well. Using something like EMR Notebooks or AWS Glue, we read the original dataset and use `repartitionByRange` to sort `product_id` into 30 different output files. |
| 180 | + |
| 181 | +```python |
| 182 | +(spark.read.parquet("s3://amazon-reviews-pds/parquet/") |
| 183 | +.select("marketplace", "customer_id", "review_id", "product_id", "product_parent", "star_rating") |
| 184 | +.repartitionByRange(30, "product_id") |
| 185 | +.write.mode("overwrite") |
| 186 | +.parquet("s3://<bucket>/amazon-reviews-sorted-subset/", compression="gzip") |
| 187 | +) |
| 188 | +``` |
| 189 | + |
| 190 | +And then back in Athena... |
| 191 | + |
| 192 | +```sql |
| 193 | +CREATE EXTERNAL TABLE amazon_reviews_spark_sorted ( |
| 194 | + marketplace string, |
| 195 | + customer_id string, |
| 196 | + review_id string, |
| 197 | + product_id string, |
| 198 | + product_parent string, |
| 199 | + star_rating int |
| 200 | +) |
| 201 | +STORED AS PARQUET |
| 202 | +LOCATION 's3://<bucket>/amazon-reviews-sorted-subset/' |
| 203 | +``` |
| 204 | + |
| 205 | +```sql |
| 206 | +SELECT "$path", COUNT(*) FROM amazon_reviews_spark_sorted |
| 207 | +GROUP BY 1 ORDER BY 1 ASC |
| 208 | +-- About 5-6M records per file |
| 209 | +``` |
| 210 | + |
| 211 | +## Parquet metadata |
| 212 | + |
| 213 | +[parquet-tools](https://github.com/apache/parquet-mr/tree/master/parquet-tools) is a fantastic utility for analyzing the content of Parquet files. |
| 214 | + |
| 215 | +If you're on a mac, it's available via homebrew: `brew install parquet-tools` |
| 216 | + |
| 217 | +Download a sample Parquet file and print out the metadata: |
| 218 | + |
| 219 | +```shell |
| 220 | +curl -O https://s3.amazonaws.com/amazon-reviews-pds/parquet/product_category=Watches/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet |
| 221 | +parquet-tools meta part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet |
| 222 | +``` |
| 223 | + |
| 224 | +You'll see a bunch of detailed information about the file including number of rows, minimum and maximum values, and the schema. |
| 225 | + |
| 226 | +_Some rows left out for brevity_ |
| 227 | + |
| 228 | +``` |
| 229 | +file: file:/private/tmp/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet |
| 230 | +creator: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c) |
| 231 | +
|
| 232 | +file schema: spark_schema |
| 233 | +-------------------------------------------------------------------------------- |
| 234 | +product_title: OPTIONAL BINARY O:UTF8 R:0 D:1 |
| 235 | +star_rating: OPTIONAL INT32 R:0 D:1 |
| 236 | +helpful_votes: OPTIONAL INT32 R:0 D:1 |
| 237 | +review_date: OPTIONAL INT32 O:DATE R:0 D:1 |
| 238 | +year: OPTIONAL INT32 R:0 D:1 |
| 239 | +
|
| 240 | +row group 1: RC:97608 TS:39755962 OFFSET:4 |
| 241 | +-------------------------------------------------------------------------------- |
| 242 | +product_title: BINARY SNAPPY DO:0 FPO:3243045 SZ:3170609/6450771/2.03 VC:97608 ENC:PLAIN,PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[no stats for this column] |
| 243 | +star_rating: INT32 SNAPPY DO:0 FPO:6413654 SZ:36016/36709/1.02 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1, max: 5, num_nulls: 0] |
| 244 | +helpful_votes: INT32 SNAPPY DO:0 FPO:6449670 SZ:48348/93031/1.92 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 0, max: 753, num_nulls: 0] |
| 245 | +review_date: INT32 SNAPPY DO:0 FPO:23689606 SZ:35674/146381/4.10 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2001-04-05, max: 2015-08-31, num_nulls: 0] |
| 246 | +year: INT32 SNAPPY DO:0 FPO:23725280 SZ:2004/37279/18.60 VC:97608 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2001, max: 2015, num_nulls: 0] |
| 247 | +``` |
| 248 | + |
| 249 | + |
| 250 | +More detailed information on the different fields for each column is [here](https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend). |
| 251 | + |
| 252 | +Note that current versions of the tool may not show string statistics by default as they could be incorrect: [PARQUET-686](https://issues.apache.org/jira/browse/PARQUET-686). |
0 commit comments