Create Jobs
- Steps to Add Validation
- Entity Types
- Save Data Validation
- Application Name
- Application Deployment
- Runtime Account and Name
- Skip validating connections
- Auto Restart on Failure
- Store Raw Data in Error Logs
- Error Handler
- Status Alert
- Create Version
- Tags
- Description
- More Configurations
- Configure Email
- Provide comma separated email id’s
- Log Level
- Driver Cores
- Driver Memory
- Executor Cores
- Executor Memory
- Executor Instances
- Extra Driver Java Options
- Extra Executor Java Options
- Extra Spark Submit Options
In this article
- Steps to Add Validation
- Entity Types
- Save Data Validation
- Application Name
- Application Deployment
- Runtime Account and Name
- Skip validating connections
- Auto Restart on Failure
- Store Raw Data in Error Logs
- Error Handler
- Status Alert
- Create Version
- Tags
- Description
- More Configurations
- Configure Email
- Provide comma separated email id’s
- Log Level
- Driver Cores
- Driver Memory
- Executor Cores
- Executor Memory
- Executor Instances
- Extra Driver Java Options
- Extra Executor Java Options
- Extra Spark Submit Options
Steps to Add Validation
- Create a data validation application by clicking at the Add Validation button on the listing page. 
- Option to add multiple validations within a job are available. To add further validations click the + Add button. 
- Provide a unique name for the data validation job. Description (optional) of the validation that is to be created can be provided. 
- Data Validation jobs that are created in one project cannot be accessed in any other project. 
Entity Types
Entity 1 and Entity 2 are the data sources that can be configured for validation and comparison. The user can create a validation for job by using Data Sources, Ingestion, ETL, Data Assets. Click on one of the options and select the entity that will be used as a data source for validation.
Pipelines that are created within the project are listed here as an entity. By selecting a particular pipeline it’s channel and emitter will be listed.
The user can configure the entities (source and target) during run time.
Note: The Pipeline and Data Assets show the list of pre-configured channel and emitters called as entities.
While the Data Sources are those entities that the user needs to configure on its own. If the user does not have a pre-configured entity the user can opt for Data Source.
The Data Assets that are created within the workspace or the project will be listed here as an entity.
Note: The Data Assets with a project scope will not be listed as entities in any other project of a workspace for performing Data Validation jobs.
Configure and provide the details of selected Entity 1.
Click on the + Add Validate Criteria button.
Filter Criteria
Select the options from the drop-down list of the columns that needs to be filtered out. Click VALIDATE CRITERIA.
View Both Entity’s Schema
Click the button to view both Entity 1 and Entity 2’s schema.
Apply Validation Strategy
Click this button to select the Validation Strategy Type.
Count
Upon checking this option, the record count of both the entities are compared. If the count matches then the validation will pass, else it will fail.
Profile
Option to view the aggregated profile stats of all the mapped columns. Compares aggregated metrics, (for. example: Min, Max, Avg, Distinct count etc. depending upon the data type) of mapped columns of both entities. If all the individual metrics comparison passes, then validation will pass, else it will fail. User needs to map (one to one) columns of both the entities for comparison.
Options available are:
None, Basic and Advanced
None
Select None option if Profile is not to be updated.
Basic
Select Basic option to view the aggregated profile stats of all the mapped columns.
Advanced
Select advanced option to compare advance aggregated metrics of mapped columns of both the entities.
Capture Difference
Option to compare individual record values of both entities. Captures different records and stores in the selected store type.
The difference is evaluated in 2 ways:
- All records that are in Entity 1 but not in Entity 2. 
- All records that are in Entity 2 but not in Entity 1. 
- Duplicate records will also be considered as a difference. 
Options available are: None, Count and Count with Data
If Count with Data is selected, then the count along with data will be stored.
The below options will be available upon selecting Count with Data.
Store Type
Row-to-row value based comparison to capture the difference between records. The record difference captured is stored either in HDFS or S3.
Upon selecting HDFS as Store Type, the below options will be available:
Connection Name
Select the connection in which the schema has to be stored.
Path
File or directory path from where data is to be read. The path must end with * in case of directory. Example: outdir/* In case of incremental read the exact directory path should be provided.
Compression Type
Select the compression type for HDFS schema compression from the drop down list. Available options are: NONE, DEFLATE, GZIP, BZIP2.
Upon selecting S3 as Store Type, the below options will be available:
Connection Name
Select the connection in which the schema has to be stored.
Bucket Name
Select or enter the S3 bucket name.
Path
File or directory path from where data is to be read. The path must end with * in case of directory. Example: outdir/* In case of incremental read the exact directory path should be provided.
Click Next for Schema Mapping.
Schema Mapping
While a profile is selected, a column wise metrics can be generated. The user can map the schema against two entities (columns). If all the columns are identical, then auto map option will be active or else, the user can drag and drop the data against the two entities columns.
Click Done and save the job.
Save Data Validation
Application Name
Provide a unique name for application.
Application Deployment
Select an application deployment mode. Available option are: Local Mode and Registered Cluster.
Local Mode
Deploys the application on Gathr server. Not recommended for production environment.
Registered Cluster
Utilizes compute clusters from registered accounts in Gathr for application deployment.
Runtime Account and Name
Select the runtime account and associated name.
Skip validating connections
Check the option to skip validating the connections before starting the application.
Auto Restart on Failure
Automatically restart the ETL application on failure at runtime. Configure restart count and wait time between restarts.
Max Restart Count
Specify the maximum number of times the ETL application will automatically restart if it fails.
Wait time before Restart
Enter the time (in minutes) to wait before each automatic restart attempt.
Pending Restart Attempts
Displays the total number of restart attempts that are currently pending for the ETL application.
Store Raw Data in Error Logs
Enable this option to capture raw data coming from corrupt records in error logs along with the error message.
Error Handler
Checkmark this option to enable error handler. If this option is disabled then the error monitoring graphs will not be visible.
Error Log Target
Select the target where the data has to be moved that has failed to process in the application. By default application errors will be logged in the logger.
If RabbitMQ is selected as Error Log Target then provide the below details:
Connection
Select the connection or add a new connection where error logs will be saved.
Queue Name
Provide the name of RabbitMQ Queue, where error to be published in case of exception.
Channels
Select the channel from the drop down list.
Processors/Emitters
Select the processor/emitter from the drop down list.
If Kafka is selected as the Error Log Target then the below field will be available:
Connection
Select the connection or add a new connection where error logs will be saved.
Channels
Select the channel from the drop down list.
Processors/Emitters
Select the processor/emitter from the drop down list.
Partitions
Enter the number of partitions to be made in the Kafka Topic.
Replication Factor
Number of replications to be created for the Kafka topic for stronger durability and higher availability.
Status Alert
Select the checkbox to enable the status alert option. By selecting this option you can send Alert/Message to a Kafka Topic for any change in the pipeline status.
Target Status
An alert will be triggered whenever status of the pipeline gets updated to Active, Starting, Stopped or Error as per the selection(s) made in Target Status field.
Status Alert Target
By default, the Kafka component is supported as a target for status alerts.
Connection
Select a connection name from the list of saved connections from the drop-down.
Topic Name
Enter a Kafka topic on which alert/message should be sent.
Partitions
Enter the number of partitions to be made in the Kafka Topic.
Replication Factor
Number of replications to be created for the Kafka topic for stronger durability and higher availability.
Create Version
Allows you to create a version of the pipeline while updating or from the pipeline listing page using pipeline ellipsis. This option is available when Version Control under SETUP is selected as Gathr Metastore.
Tags
Provide tags for the application. (Optional)
Description
Option to add detailed notes. (Optional)
More Configurations
Configure Email
Enable option to receive notifications in case if an application stops or fails.
Provide comma separated email id’s
When email is configured, provide comma separated email id’s.
Log Level
It controls the logs generated by the pipeline based on the selected log level.
Trace: View information of trace log levels.
Debug: View information of debug and trace log levels.
Info: View information of trace, debug and info log levels.
Warn: View information of trace, debug, warn and info log levels.
Error: View information of trace, debug, warn, info and error log levels.
Driver Cores
Number of cores to be used for the driver processes.
Driver Memory
Amount of memory to use for the driver processes.
Executor Cores
Number of cores to be used on each executor.
Executor Memory
Amount of memory to be used per executor process.
Executor Instances
Enter value for executor instances.
Extra Driver Java Options
A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Extra Executor Java Options
A string of extra JVM options to pass to executors. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Extra Spark Submit Options
A string with –conf option for passing all the above configuration to a spark application. For example: –conf ‘spark.executor.extraJavaOptions=-Dconfig.resource=app’ –conf ‘spark.driver.extraJavaOptions=-Dconfig.resource=app’
Click SAVE.
If you have any feedback on Gathr documentation, please email us!