Important: These instructions assume you have access to StreamSets Transformer
- For help installing StreamSets Transformer, see StreamSets Transformer Installation.
Here is a link to a short video on using this pipeline template: Video Link
This pipeline demonstrates how to create, register, and use a User-Defined Function in Scala using StreamSets Transformer.
The source data for this pipeline is included in the Dev Raw Data Source
as an example. Typically, you would replace these with your actual source data (JDBC/Files/etc...). This template writes data to a file on the local file system, but you would typically replace this with your actual destination.
Disclaimer: This pipeline is meant to serve as a template for creating, registering and using a User-Defined Function in Scala
NOTE: Templates are supported in StreamSets Control Hub. If you do not have Control Hub, you can import the template pipeline in Data Collector but will need to do that each time you want to use the template.
Stage | Description |
---|---|
Dev Raw Data Source |
Generates records based on user-supplied data |
Create UDFs | Creates a small example function and registers it with SparkSQL as a column function |
Use UDF | Leverages created UDF as a SparkSQL Expression Function |
Write udf | Writes data to a local file system |
Click Here to download the pipeline and save it to your drive.
Click the down arrow next to the "Create New Pipeline" and select "Import Pipeline From Archive".
Click "Browse" and locate the pipeline file you just downloaded, click "OK", then click "Import"
Click on the pipeline you just imported to open it and click on the "Parameters" tab and fill in the appropriate information for your environment.
Important: For this pipeline, you only need to specify the output directory for the file. This is on the local file system where Transformer is installed. Make sure the directory is created and proper permissions are set so that the transformer user can create files. By default, the directory /data/udf
is used. You can change it to anything you want.
The following parameters are set up for this pipeline:
destination_directory
|
Path to the directory for
the output files.
Use the following format:
|
Click the "START" button to run the pipeline.