This StreamSets Transformer pipeline runs on Apache Spark deployed on an Amazon EMR cluster and it's designed to perform clickstream analysis. It ingests raw clickstream logs from Amazon S3, perform aggregations and store those on Amazon Redshift for analysis and the pipeline also sends raw logs to Elasticsearch for querying and quick visualizations.
- StreamSets Transformer 3.14.0 or higher. You can deploy Transformer on your choice of cloud provider or download it for local development.
- Access to Amazon EMR with Spark cluster
- Ensure the prerequisites for Amazon EMR are satisfied
- Access to Amazon S3
- Access to Amazon Redshift cluster
- Download and import the pipeline into your instance of Transformer
- Download the sample dataset and upload it to your Amazon S3 bucket
- After importing the pipeline into your environment and before running the pipeline, update the following pipeline parameters:
[
{
"key": "EMR_STAGING",
"value": ""
},
{
"key": "EMR_CLUSTER_ID",
"value": ""
},
{
"key": "AWS_DATA_BUCKET",
"value": ""
},
{
"key": "ES_URL",
"value": ""
},
{
"key": "REDSHIFT_ENDPOINT",
"value": ""
},
{
"key": "AWS_TEMP_BUCKET",
"value": ""
},
{
"key": "ES_INDEX",
"value": ""
},
{
"key": "REDSHIFT_USER",
"value": ""
},
{
"key": "REDSHIFT_SCHEMA",
"value": ""
}
]
These pipeline parameter are used by various stages in the pipleine, such as, Amazon S3 buckets, Amazon Redshift endpoint and credentials, Elasticsearch URL and index name, etc.
For techincal info, detailed explanation of this use case and to watch demo video, read this blog.