Configure ETL Application

On the Pipeline Definition page, you can tailor and refine numerous settings for your ETL application.

These configurations are crucial for defining the behavior of your ETL application during runtime.

Application Name

Please provide a unique name to save the ETL application. This name will be used to save and identify your pipeline. ETL applications must commence with a letter and can incorporate alphanumeric symbols and special characters like !@$-;:()-_?=~/*<>’ for naming.


Application Deployment

Choose where your ETL application will run: either on a Gathr cluster or an EMR cluster.

  • Local Mode: This is the default option. Choosing this means your application will run on a locally-managed cluster. Gathr takes care of the cluster infrastructure, ensuring seamless execution of your applications.

  • Registered Cluster: If you prefer to run your applications on clusters managed by you, you can select this option.

    The prerequisite to utilizing registered clusters for running applications is to establish a virtual private connection from the User Settings > Compute Setup → tab.

    To understand the steps for setting up PrivateLink connections, see Compute Setup →


Additional configuration fields for Registered Clusters with Apache Spark compute setup:

Runtime Account and Name

Select the Apache Spark account to be used for deploying the application.

Additional configuration fields for Registered Clusters with AWS compute setup:

AWS Region

Option to select the preferred region associated with the compute environment.

AWS Account

Option to select the registered AWS account ID associated with the compute environment.

DNS Name

Option to select the DNS name linked to the VPC endpoint for Gathr.

EMR Cluster Config

A saved EMR cluster configuration is to be selected out of the list, or it can be created with the Add New Config for EMR Cluster option.

For more details on how to save EMR cluster configurations in Gathr, see EMR Cluster Configuration →

The application will be deployed on the EMR cluster using the custom configuration that is selected from this field.


Continue with the pipeline definition configuration after providing the deployment preferences.

Skip validating connections

Skip validating connections before starting application.


Auto Restart on Failure

Enable/disable restarting of failed streaming ETL applications.

If Auto Restart on Failure is enabled for the ETL application deployment, additional fields will be displayed as given below:

Max Restart Count

It is required to specify the number of maximum restart count of the ETL application (streaming), in case it fails to run.

Wait time before Restart

The time (in minutes) i.e. the wait duration before the pipeline attempts to auto-restart is to be provided here.

Pending Restart Attempts

The total number of pending restart attempts should be provided here.

If Auto Restart on Failure is disabled, then proceed by updating the following fields.


Enable Monitoring Graph

Select the checkbox to enable monitoring graph.

Components

Select the components for which monitoring graphs need to be enabled.


Store Raw Data in Error Logs

Enable this option to capture raw data coming from corrupt records in error logs along with the error message.


Error Handler

If this option is disabled, the error monitoring graphs will not be visible.

Error Log Target

Select the target where you want to move the data that failed to process in the application.


Status Alert

Select the check box to enable the status alert option.

Target Status

An alert will be triggered whenever status of the application gets updated to Active, Starting, Stopped or Error as per the selection(s) made in Target Status field.

Status Alert Target

By default, the Kafka component is supported as a target for status alerts.

Connection

Select a connection name from the list of saved connections from the drop-down. To know more about creating connections, see Manage Connections.

Topic Name

Enter a Kafka topic on which alert/message should be sent.

Partitions

Enter the number of partitions to be made in the Kafka Topic.

Replication Factor

Number of replications to be created for the Kafka topic for stronger durability and higher availability.


Create Version

This option is visible in case if existing ETL applications are edited and updated. Creates new version for the pipeline. The current version is called the Working Copy and rest of the versions are numbers with n+1.


Tags

Option to assign customized tags to the application for better organization and filtering.


Description

Option to write notes specific to the ETL application.


Add Detailed Notes

A modal window opens for the user to add notes.


More Configurations

Additional configurations for ETL deployment

Configure Email

Enable it to receive notifications when an application is stopped or failed.

Provide Comma-Separated Email Ids

Log Level

It controls the logs generated by the application based on the selected log level.

Trace: View information of trace log levels.

Debug: View information of debug and trace log levels.

Info: View information of trace, debug and info log levels.

Warn: View information of trace, debug, warn and info log levels.

Error: View information of trace, debug, warn, info and error log levels.

Driver Cores

Number of cores to be used for the driver processes.

Driver Memory

Amount of memory to use for the driver processes.

Executor Cores

Number of cores to be used on each executor.

Executor Memory

Amount of memory to be used per executor process.

Dynamic Allocation Enabled

Option to enable dynamic allocation of executor instances.

Initial Executors

Set the initial number of executor instances allowed to be allocated dynamically.

Maximum Executors

Set the maximum number of executor instances allowed to be allocated dynamically.

Executor Instances

If dynamic allocation is not enabled then enter value for executor instances.

Extra Driver Java Options

A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.

For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Extra Executor Java Options

A string of extra JVM options to pass to executors. For instance, GC settings or other logging.

For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Extra Spark Submit Options

The configuration provided in this field will be passed to Spark when the job is submitted.

Please ensure that the configuration follows the exact format shown:

--conf <key>=<value>

You can also configure extra Java options for both the Spark driver and executor in this field. It’s recommended to enclose these parameters in double quotes to avoid any potential issues, like application failure.

For Example: Setting Spark Driver and Executor Extra Java Options

--conf "spark.driver.extraJavaOptions=-Duser.timezone=GMT -Dengine=spark" 
--conf "spark.executor.extraJavaOptions=-Duser.timezone=GMT -Dengine=spark" 
--conf spark.sql.session.timeZone=UTC 
--conf spark.sql.parquet.int96RebaseModeInRead=CORRECTED

Save and exit

Once the pipeline deployment configurations are set, save and exit the Pipeline Definition page.

Top