Create Data Validation Jobs

Creating Data Validation Jobs

Steps to Add Validation

Initiate Data Validation: Create a data validation application using the “Add Validation” button on the listing page.
Add Multiple Validations: Click the “+ Add” button to incorporate multiple validations within a single job.
Name and Describe: Assign a unique name to the data validation job and optionally provide a description.
Project Restriction: Data validation jobs created in one project are not accessible in others.

Configuring Entity Types

Entity Selection: Choose between Data Sources, Ingestion, ETL, and Data Assets as entities for validation. Select the specific entity to act as a data source.
Pipeline Integration: Pipelines within the project are available as entities. Select a pipeline to list its channel and emitter.
Runtime Configuration: Configure entities (source and target) at runtime.
Data Source Customization: Manually configure Data Sources if no pre-configured entity is available.
Project-Specific Data Assets: Data Assets with project scope are exclusive to their respective projects.

Validation Criteria

Add Validation Criteria: Click “+ Add Validate Criteria” to set up filtering and validation strategies.

Filter Criteria

Column Filtering: Select columns to filter from a dropdown list and click “VALIDATE CRITERIA”.

Schema Viewing and Strategy Application

View Schemas: Access the schema for both Entity 1 and Entity 2.
Apply Validation Strategy: Select the type of validation strategy to apply.

Validation Strategies

Count

Record Count Comparison: Compare record counts between entities. The validation passes if counts match.

Profile

Profile Metrics: View and compare aggregated metrics (e.g., Min, Max, Avg) across mapped columns.
Profile Options: Choose between “None”, “Basic”, and “Advanced” for profiling needs.

Capture Difference

Record Value Comparison: Identify and capture differences in individual records.
Difference Evaluation: Capture differences in records unique to each entity or duplicates.
Storage Options: Choose between “None”, “Count”, and “Count with Data” for storage.

Data Storage Options

Store Type

HDFS Storage: Configure options for storing differences in HDFS, including connection name, path, and compression type.
S3 Storage: Configure options for S3 storage, including connection name, bucket name, and path.

Schema Mapping

Column Mapping: Map schema columns for both entities. Use auto-map if columns are identical or manually drag and drop.
Save and Complete: Click “Done” and save to finalize the job.

Application Configuration

Naming and Deployment

Application Name: Assign a unique name to the application.
Deployment Mode: Choose between “Local Mode” and “Registered Cluster” for deployment.

Runtime and Error Handling

Runtime Account: Select the account and associated name for runtime.
Connection Validation: Optionally skip connection validation before starting the application.
Auto Restart: Configure automatic restart on failure with specified retry count and wait time.
Error Logging: Enable options to capture raw data in error logs and configure error handlers.
Error Log Target: Choose between RabbitMQ and Kafka for error log targets and configure accordingly.

Alerts and Notifications

Status Alerts: Enable status alerts to send notifications via Kafka for pipeline status changes.
Email Notifications: Configure email notifications for application failures and provide recipient emails.

Advanced Configurations

Logging and Resource Allocation: Set log levels, driver and executor cores, memory, and Java options.
Spark Submit Options: Include additional Spark submit options as needed.

Finalize and Save

Save Configuration: Ensure all configurations are complete, then click “SAVE” to finalize the data validation job.

If you have any feedback on Gathr documentation, please email us!

Create Data Validation Jobs

Creating Data Validation Jobs #

Steps to Add Validation #

Configuring Entity Types #

Validation Criteria #

Filter Criteria #

Schema Viewing and Strategy Application #

Validation Strategies #

Count #

Profile #

Capture Difference #

Data Storage Options #

Store Type #

Schema Mapping #

Application Configuration #

Naming and Deployment #

Runtime and Error Handling #

Alerts and Notifications #

Advanced Configurations #

Finalize and Save #

Creating Data Validation Jobs

Steps to Add Validation

Configuring Entity Types

Validation Criteria

Filter Criteria

Schema Viewing and Strategy Application

Validation Strategies

Count

Profile

Capture Difference

Data Storage Options

Store Type

Schema Mapping

Application Configuration

Naming and Deployment

Runtime and Error Handling

Alerts and Notifications

Advanced Configurations

Finalize and Save