Skip to content
This repository has been archived by the owner on Feb 17, 2025. It is now read-only.

Latest commit

 

History

History

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

This StreamSets Transformer pipeline runs on Apache Spark deployed on an Amazon EMR cluster and it's designed to perform clickstream analysis. It ingests raw clickstream logs from Amazon S3, perform aggregations and store those on Amazon Redshift for analysis and the pipeline also sends raw logs to Elasticsearch for querying and quick visualizations.

Prerequisites

  • StreamSets Transformer 3.14.0 or higher. You can deploy Transformer on your choice of cloud provider or download it for local development.
  • Access to Amazon EMR with Spark cluster
  • Access to Amazon S3
  • Access to Amazon Redshift cluster

Setup

[
  {
    "key": "EMR_STAGING",
    "value": ""
  },
  {
    "key": "EMR_CLUSTER_ID",
    "value": ""
  },
  {
    "key": "AWS_DATA_BUCKET",
    "value": ""
  },
  {
    "key": "ES_URL",
    "value": ""
  },
  {
    "key": "REDSHIFT_ENDPOINT",
    "value": ""
  },
  {
    "key": "AWS_TEMP_BUCKET",
    "value": ""
  },
  {
    "key": "ES_INDEX",
    "value": ""
  },
  {
    "key": "REDSHIFT_USER",
    "value": ""
  },
  {
    "key": "REDSHIFT_SCHEMA",
    "value": ""
  }
]

These pipeline parameter are used by various stages in the pipleine, such as, Amazon S3 buckets, Amazon Redshift endpoint and credentials, Elasticsearch URL and index name, etc.

Technical Details & Demo Video

For techincal info, detailed explanation of this use case and to watch demo video, read this blog.