Expedia has integrated Alluxio support into our datalakes and into our query tools to allow efficient cached cross-region querying of data stored in AWS S3 buckets. For this demo, we will be showing how our Analytics datalake and platform can access data in the Hotels.com datalake in a different region. We do this transparently to the query engine, in this case Databricks, with the following methodology:
- Databricks is in AWS region
. Hotels.com datalake is inus-west-2
. - Create an Alluxio cluster in AWS region
. This cluster is configured with a list of S3 buckets inus-west-2
that it should mount. - Install the Alluxio client JAR into the classpath of all Databricks clusters so that the URI scheme
is supported by the query engine. - Set the Hive Metastore URI for Databricks to our Hive federation process Waggledance. Waggledance is an Hive/Thrift API-compatible proxy that can federate multiple Hive metastores into one logical metastore.
- Waggledance is configured with the same list of S3 buckets. When Waggledance get an HMS response where a table or partition path contains one of those buckets, it will replace the location with a corresponding
URI pointing to our Alluxio cluster. - When Databricks makes a call to the Hive metastore, any tables that reference those buckets will be processed with the returned
URI, and the Alluxio client jar will be used to read that data from the Alluxio cluster.
- Apiary: Expedia Groups's open source collection of Terraform, Docker, and Java to create a federated cloud data lake with Ranger authorization.
- Waggledance: Open-source Hive Metastore Federation Proxy
- Responsible for presenting a federated view of various Expedia datalakes into one logical datalake, allowing cross-datalake queries, joins, etc.
- This is the piece of code doing the actual S3 to Alluxio path translations on Hive tables and partitions
- Apiary Hive Metastore Filter: Open-source Hive metastore filter that can do arbitrary path replacements.
- We install this into the container running Waggledance.
- It is configured to replace
paths of our target bucket list with the correspondingalluxio://
- Apiary Waggledance Docker: Open-source Docker image to configure Waggledance to run in Expedia's
datalake.- Hive filter is configured here.
- Apiary Federation: Open-source Terraform module to configure Waggledance via our Docker image.
- Alluxio replacements are configured here.
: Inner-source Terraform module to deploy an Alluxio cluster- List of S3 buckets for path replacement set in Terraform and mounted using AWS State Manager (Ansible playbook):
- name: script to mount s3 buckets
dest: /tmp/mount-s3-buckets.sh
mode: 0755
owner: hdfs
group: hdfs
content: |
for s3_bucket in $$s3_mounts_ro
/opt/alluxio/bin/alluxio fs mount --shared --readonly /$$s3_bucket s3://$$s3_bucket
- name: mount s3 buckets
command: /tmp/mount-s3-buckets.sh
become: yes
become_user: hdfs
ignore_errors: yes
egdp-tf-app-egap-apiary Federation component
: Inner-source Terraform application used to configure Waggledance, Alluxio, and the federation layer for our Analytics Platform Data Lake.- List of cross-region S3 buckets to access via Alluxio specified as a Terraform variable.
- Alluxio created/configured by using the
Terraform module and configuring it with the above list of buckets. - Waggledance instance created by using the
Terraform module, which also configures the PathConversion Hive filter in Waggledance to use the same list of buckets.
: Inner-source Terraform application to deploy Databricks to the EG Analytics Platform.- Alluxio client jar configured in each Databricks cluster using code similar to:
# alluxio client
aws s3 cp "${TFVAR_ALLUXIO_CLIENT_PACKAGE}" "/databricks/jars/alluxio-${TFVAR_ALLUXIO_VERSION}-client.jar"