Configure ETL Application
- Application Name
- Application Deployment
- Runtime Account and Name
- AWS Region
- AWS Account
- DNS Name
- EMR Cluster Config
- Skip validating connections
- Auto Restart on Failure
- Enable Monitoring Graph
- Store Raw Data in Error Logs
- Error Handler
- Status Alert
- Create Version
- Tags
- Description
- Add Detailed Notes
- More Configurations
- Save and exit
In this article
- Application Name
- Application Deployment
- Runtime Account and Name
- AWS Region
- AWS Account
- DNS Name
- EMR Cluster Config
- Skip validating connections
- Auto Restart on Failure
- Enable Monitoring Graph
- Store Raw Data in Error Logs
- Error Handler
- Status Alert
- Create Version
- Tags
- Description
- Add Detailed Notes
- More Configurations
- Save and exit
On the Pipeline Definition page, you can tailor and refine numerous settings for your ETL application.
These configurations are crucial for defining the behavior of your ETL application during runtime.
Application Name
Please provide a unique name to save the ETL application. This name will be used to save and identify your pipeline. ETL applications must commence with a letter and can incorporate alphanumeric symbols and special characters like !@$-;:()-_?=~/*<>’ for naming.
Application Deployment
Choose where your ETL application will run: either on a Gathr cluster or an EMR cluster.
- Local Mode: This is the default option. Choosing this means your application will run on a locally-managed cluster. Gathr takes care of the cluster infrastructure, ensuring seamless execution of your applications. 
- Registered Cluster: If you prefer to run your applications on clusters managed by you, you can select this option. - The prerequisite to utilizing registered clusters for running applications is to establish a virtual private connection from the User Settings > Compute Setup → tab. - To understand the steps for setting up PrivateLink connections, see Compute Setup → 
Additional configuration fields for Registered Clusters with Apache Spark compute setup:
Runtime Account and Name
Select the Apache Spark account to be used for deploying the application.
Additional configuration fields for Registered Clusters with AWS compute setup:
AWS Region
Option to select the preferred region associated with the compute environment.
AWS Account
Option to select the registered AWS account ID associated with the compute environment.
DNS Name
Option to select the DNS name linked to the VPC endpoint for Gathr.
EMR Cluster Config
A saved EMR cluster configuration is to be selected out of the list, or it can be created with the Add New Config for EMR Cluster option.
For more details on how to save EMR cluster configurations in Gathr, see EMR Cluster Configuration →
The application will be deployed on the EMR cluster using the custom configuration that is selected from this field.
Continue with the pipeline definition configuration after providing the deployment preferences.
Skip validating connections
Skip validating connections before starting application.
Auto Restart on Failure
Enable/disable restarting of failed streaming ETL applications.
If Auto Restart on Failure is enabled for the ETL application deployment, additional fields will be displayed as given below:
Max Restart Count
It is required to specify the number of maximum restart count of the ETL application (streaming), in case it fails to run.
Wait time before Restart
The time (in minutes) i.e. the wait duration before the pipeline attempts to auto-restart is to be provided here.
Pending Restart Attempts
The total number of pending restart attempts should be provided here.
If Auto Restart on Failure is disabled, then proceed by updating the following fields.
Enable Monitoring Graph
Select the checkbox to enable monitoring graph.
Components
Select the components for which monitoring graphs need to be enabled.
Store Raw Data in Error Logs
Enable this option to capture raw data coming from corrupt records in error logs along with the error message.
Error Handler
If this option is disabled, the error monitoring graphs will not be visible.
Error Log Target
Select the target where you want to move the data that failed to process in the application.
Status Alert
Select the check box to enable the status alert option.
Target Status
An alert will be triggered whenever status of the application gets updated to Active, Starting, Stopped or Error as per the selection(s) made in Target Status field.
Status Alert Target
By default, the Kafka component is supported as a target for status alerts.
Connection
Select a connection name from the list of saved connections from the drop-down. To know more about creating connections, see Manage Connections.
Topic Name
Enter a Kafka topic on which alert/message should be sent.
Partitions
Enter the number of partitions to be made in the Kafka Topic.
Replication Factor
Number of replications to be created for the Kafka topic for stronger durability and higher availability.
Create Version
This option is visible in case if existing ETL applications are edited and updated. Creates new version for the pipeline. The current version is called the Working Copy and rest of the versions are numbers with n+1.
Tags
Option to assign customized tags to the application for better organization and filtering.
Description
Option to write notes specific to the ETL application.
Add Detailed Notes
A modal window opens for the user to add notes.
More Configurations
Additional configurations for ETL deployment
Configure Email
Enable it to receive notifications when an application is stopped or failed.
Provide Comma-Separated Email Ids
Log Level
It controls the logs generated by the application based on the selected log level.
Trace: View information of trace log levels.
Debug: View information of debug and trace log levels.
Info: View information of trace, debug and info log levels.
Warn: View information of trace, debug, warn and info log levels.
Error: View information of trace, debug, warn, info and error log levels.
Driver Cores
Number of cores to be used for the driver processes.
Driver Memory
Amount of memory to use for the driver processes.
Executor Cores
Number of cores to be used on each executor.
Executor Memory
Amount of memory to be used per executor process.
Dynamic Allocation Enabled
Option to enable dynamic allocation of executor instances.
Initial Executors
Set the initial number of executor instances allowed to be allocated dynamically.
Maximum Executors
Set the maximum number of executor instances allowed to be allocated dynamically.
Executor Instances
If dynamic allocation is not enabled then enter value for executor instances.
Extra Driver Java Options
A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.
For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Extra Executor Java Options
A string of extra JVM options to pass to executors. For instance, GC settings or other logging.
For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Extra Spark Submit Options
The configuration provided in this field will be passed to Spark when the job is submitted.
Please ensure that the configuration follows the exact format shown:
--conf <key>=<value>
You can also configure extra Java options for both the Spark driver and executor in this field. It’s recommended to enclose these parameters in double quotes to avoid any potential issues, like application failure.
For Example: Setting Spark Driver and Executor Extra Java Options
--conf "spark.driver.extraJavaOptions=-Duser.timezone=GMT -Dengine=spark" 
--conf "spark.executor.extraJavaOptions=-Duser.timezone=GMT -Dengine=spark" 
--conf spark.sql.session.timeZone=UTC 
--conf spark.sql.parquet.int96RebaseModeInRead=CORRECTED
Save and exit
Once the pipeline deployment configurations are set, save and exit the Pipeline Definition page.
If you have any feedback on Gathr documentation, please email us!