Create Data Validation Jobs
Creating Data Validation Jobs
Steps to Add Validation
Initiate Data Validation: Create a data validation application using the “Add Validation” button on the listing page.
Add Multiple Validations: Click the “+ Add” button to incorporate multiple validations within a single job.
Name and Describe: Assign a unique name to the data validation job and optionally provide a description.
Project Restriction: Data validation jobs created in one project are not accessible in others.
Configuring Entity Types
Entity Selection: Choose between Data Sources, Ingestion, ETL, and Data Assets as entities for validation. Select the specific entity to act as a data source.
Pipeline Integration: Pipelines within the project are available as entities. Select a pipeline to list its channel and emitter.
Runtime Configuration: Configure entities (source and target) at runtime.
Data Source Customization: Manually configure Data Sources if no pre-configured entity is available.
Project-Specific Data Assets: Data Assets with project scope are exclusive to their respective projects.
Validation Criteria
- Add Validation Criteria: Click “+ Add Validate Criteria” to set up filtering and validation strategies.
Filter Criteria
- Column Filtering: Select columns to filter from a dropdown list and click “VALIDATE CRITERIA”.
Schema Viewing and Strategy Application
View Schemas: Access the schema for both Entity 1 and Entity 2.
Apply Validation Strategy: Select the type of validation strategy to apply.
Validation Strategies
Count
- Record Count Comparison: Compare record counts between entities. The validation passes if counts match.
Profile
Profile Metrics: View and compare aggregated metrics (e.g., Min, Max, Avg) across mapped columns.
Profile Options: Choose between “None”, “Basic”, and “Advanced” for profiling needs.
Capture Difference
Record Value Comparison: Identify and capture differences in individual records.
Difference Evaluation: Capture differences in records unique to each entity or duplicates.
Storage Options: Choose between “None”, “Count”, and “Count with Data” for storage.
Data Storage Options
Store Type
HDFS Storage: Configure options for storing differences in HDFS, including connection name, path, and compression type.
S3 Storage: Configure options for S3 storage, including connection name, bucket name, and path.
Schema Mapping
Column Mapping: Map schema columns for both entities. Use auto-map if columns are identical or manually drag and drop.
Save and Complete: Click “Done” and save to finalize the job.
Application Configuration
Naming and Deployment
Application Name: Assign a unique name to the application.
Deployment Mode: Choose between “Local Mode” and “Registered Cluster” for deployment.
Runtime and Error Handling
Runtime Account: Select the account and associated name for runtime.
Connection Validation: Optionally skip connection validation before starting the application.
Auto Restart: Configure automatic restart on failure with specified retry count and wait time.
Error Logging: Enable options to capture raw data in error logs and configure error handlers.
Error Log Target: Choose between RabbitMQ and Kafka for error log targets and configure accordingly.
Alerts and Notifications
Status Alerts: Enable status alerts to send notifications via Kafka for pipeline status changes.
Email Notifications: Configure email notifications for application failures and provide recipient emails.
Advanced Configurations
Logging and Resource Allocation: Set log levels, driver and executor cores, memory, and Java options.
Spark Submit Options: Include additional Spark submit options as needed.
Finalize and Save
- Save Configuration: Ensure all configurations are complete, then click “SAVE” to finalize the data validation job.
If you have any feedback on Gathr documentation, please email us!