Create Jobs

Steps to Add Validation

  • Create a data validation application by clicking at the Add Validation button on the listing page.

  • Option to add multiple validations within a job are available. To add further validations click the + Add button.

  • Provide a unique name for the data validation job. Description (optional) of the validation that is to be created can be provided.

  • Data Validation jobs that are created in one project cannot be accessed in any other project.

Entity Types

Entity 1 and Entity 2 are the data sources that can be configured for validation and comparison. The user can create a validation for job by using Data Sources, Ingestion, ETL, Data Assets. Click on one of the options and select the entity that will be used as a data source for validation.

Pipelines that are created within the project are listed here as an entity. By selecting a particular pipeline it’s channel and emitter will be listed.

The user can configure the entities (source and target) during run time.

Note: The Pipeline and Data Assets show the list of pre-configured channel and emitters called as entities.

While the Data Sources are those entities that the user needs to configure on its own. If the user does not have a pre-configured entity the user can opt for Data Source.

The Data Assets that are created within the workspace or the project will be listed here as an entity.

Note: The Data Assets with a project scope will not be listed as entities in any other project of a workspace for performing Data Validation jobs.

Configure and provide the details of selected Entity 1.

Click on the + Add Validate Criteria button.

Filter Criteria

Select the options from the drop-down list of the columns that needs to be filtered out. Click VALIDATE CRITERIA.

View Both Entity’s Schema

Click the button to view both Entity 1 and Entity 2’s schema.

Apply Validation Strategy

Click this button to select the Validation Strategy Type.

Count

Upon checking this option, the record count of both the entities are compared. If the count matches then the validation will pass, else it will fail.

Profile

Option to view the aggregated profile stats of all the mapped columns. Compares aggregated metrics, (for. example: Min, Max, Avg, Distinct count etc. depending upon the data type) of mapped columns of both entities. If all the individual metrics comparison passes, then validation will pass, else it will fail. User needs to map (one to one) columns of both the entities for comparison.

Options available are:

None, Basic and Advanced

None

Select None option if Profile is not to be updated.

Basic

Select Basic option to view the aggregated profile stats of all the mapped columns.

Advanced

Select advanced option to compare advance aggregated metrics of mapped columns of both the entities.

Capture Difference

Option to compare individual record values of both entities. Captures different records and stores in the selected store type.

The difference is evaluated in 2 ways:

  • All records that are in Entity 1 but not in Entity 2.

  • All records that are in Entity 2 but not in Entity 1.

  • Duplicate records will also be considered as a difference.

Options available are: None, Count and Count with Data

If Count with Data is selected, then the count along with data will be stored.

The below options will be available upon selecting Count with Data.

Store Type

Row-to-row value based comparison to capture the difference between records. The record difference captured is stored either in HDFS or S3.

Upon selecting HDFS as Store Type, the below options will be available:

Connection Name

Select the connection in which the schema has to be stored.

Path

File or directory path from where data is to be read. The path must end with * in case of directory. Example: outdir/* In case of incremental read the exact directory path should be provided.

Compression Type

Select the compression type for HDFS schema compression from the drop down list. Available options are: NONE, DEFLATE, GZIP, BZIP2.

Upon selecting S3 as Store Type, the below options will be available:

Connection Name

Select the connection in which the schema has to be stored.

Bucket Name

Select or enter the S3 bucket name.

Path

File or directory path from where data is to be read. The path must end with * in case of directory. Example: outdir/* In case of incremental read the exact directory path should be provided.

Click Next for Schema Mapping.

Schema Mapping

While a profile is selected, a column wise metrics can be generated. The user can map the schema against two entities (columns). If all the columns are identical, then auto map option will be active or else, the user can drag and drop the data against the two entities columns.

Click Done and save the job.

Save Data Validation

Application Name

Provide a unique name for application.

Application Deployment

Select an application deployment mode. Available option are: Local Mode and Registered Cluster.

Local Mode

Deploys the application on Gathr server. Not recommended for production environment.

Registered Cluster

Utilizes compute clusters from registered accounts in Gathr for application deployment.

Runtime Account and Name

Select the runtime account and associated name.

Skip validating connections

Check the option to skip validating the connections before starting the application.

Auto Restart on Failure

Automatically restart the ETL application on failure at runtime. Configure restart count and wait time between restarts.

Max Restart Count

Specify the maximum number of times the ETL application will automatically restart if it fails.

Wait time before Restart

Enter the time (in minutes) to wait before each automatic restart attempt.

Pending Restart Attempts

Displays the total number of restart attempts that are currently pending for the ETL application.

Store Raw Data in Error Logs

Enable this option to capture raw data coming from corrupt records in error logs along with the error message.

Error Handler

Checkmark this option to enable error handler. If this option is disabled then the error monitoring graphs will not be visible.

Error Log Target

Select the target where the data has to be moved that has failed to process in the application. By default application errors will be logged in the logger.

If RabbitMQ is selected as Error Log Target then provide the below details:

Connection

Select the connection or add a new connection where error logs will be saved.

Queue Name

Provide the name of RabbitMQ Queue, where error to be published in case of exception.

Channels

Select the channel from the drop down list.

Processors/Emitters

Select the processor/emitter from the drop down list.

If Kafka is selected as the Error Log Target then the below field will be available:

Connection

Select the connection or add a new connection where error logs will be saved.

Channels

Select the channel from the drop down list.

Processors/Emitters

Select the processor/emitter from the drop down list.

Partitions

Enter the number of partitions to be made in the Kafka Topic.

Replication Factor

Number of replications to be created for the Kafka topic for stronger durability and higher availability.

Status Alert

Select the checkbox to enable the status alert option. By selecting this option you can send Alert/Message to a Kafka Topic for any change in the pipeline status.

Target Status

An alert will be triggered whenever status of the pipeline gets updated to Active, Starting, Stopped or Error as per the selection(s) made in Target Status field.

Status Alert Target

By default, the Kafka component is supported as a target for status alerts.

Connection

Select a connection name from the list of saved connections from the drop-down.

Topic Name

Enter a Kafka topic on which alert/message should be sent.

Partitions

Enter the number of partitions to be made in the Kafka Topic.

Replication Factor

Number of replications to be created for the Kafka topic for stronger durability and higher availability.

Create Version

Allows you to create a version of the pipeline while updating or from the pipeline listing page using pipeline ellipsis. This option is available when Version Control under SETUP is selected as Gathr Metastore.

Tags

Provide tags for the application. (Optional)

Description

Option to add detailed notes. (Optional)

More Configurations

Configure Email

Enable option to receive notifications in case if an application stops or fails.

Provide comma separated email id’s

When email is configured, provide comma separated email id’s.

Log Level

It controls the logs generated by the pipeline based on the selected log level.

Trace: View information of trace log levels.

Debug: View information of debug and trace log levels.

Info: View information of trace, debug and info log levels.

Warn: View information of trace, debug, warn and info log levels.

Error: View information of trace, debug, warn, info and error log levels.

Driver Cores

Number of cores to be used for the driver processes.

Driver Memory

Amount of memory to use for the driver processes.

Executor Cores

Number of cores to be used on each executor.

Executor Memory

Amount of memory to be used per executor process.

Executor Instances

Enter value for executor instances.

Extra Driver Java Options

A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Extra Executor Java Options

A string of extra JVM options to pass to executors. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Extra Spark Submit Options

A string with –conf option for passing all the above configuration to a spark application. For example: –conf ‘spark.executor.extraJavaOptions=-Dconfig.resource=app’ –conf ‘spark.driver.extraJavaOptions=-Dconfig.resource=app’

Click SAVE.

Top