Create Data Validation Jobs

Creating Data Validation Jobs

Steps to Add Validation

  1. Initiate Data Validation: Create a data validation application using the “Add Validation” button on the listing page.

  2. Add Multiple Validations: Click the “+ Add” button to incorporate multiple validations within a single job.

  3. Name and Describe: Assign a unique name to the data validation job and optionally provide a description.

  4. Project Restriction: Data validation jobs created in one project are not accessible in others.

Configuring Entity Types

  • Entity Selection: Choose between Data Sources, Ingestion, ETL, and Data Assets as entities for validation. Select the specific entity to act as a data source.

  • Pipeline Integration: Pipelines within the project are available as entities. Select a pipeline to list its channel and emitter.

  • Runtime Configuration: Configure entities (source and target) at runtime.

  • Data Source Customization: Manually configure Data Sources if no pre-configured entity is available.

  • Project-Specific Data Assets: Data Assets with project scope are exclusive to their respective projects.

Validation Criteria

  • Add Validation Criteria: Click “+ Add Validate Criteria” to set up filtering and validation strategies.

Filter Criteria

  • Column Filtering: Select columns to filter from a dropdown list and click “VALIDATE CRITERIA”.

Schema Viewing and Strategy Application

  • View Schemas: Access the schema for both Entity 1 and Entity 2.

  • Apply Validation Strategy: Select the type of validation strategy to apply.

Validation Strategies

Count

  • Record Count Comparison: Compare record counts between entities. The validation passes if counts match.

Profile

  • Profile Metrics: View and compare aggregated metrics (e.g., Min, Max, Avg) across mapped columns.

  • Profile Options: Choose between “None”, “Basic”, and “Advanced” for profiling needs.

Capture Difference

  • Record Value Comparison: Identify and capture differences in individual records.

  • Difference Evaluation: Capture differences in records unique to each entity or duplicates.

  • Storage Options: Choose between “None”, “Count”, and “Count with Data” for storage.

Data Storage Options

Store Type

  • HDFS Storage: Configure options for storing differences in HDFS, including connection name, path, and compression type.

  • S3 Storage: Configure options for S3 storage, including connection name, bucket name, and path.

Schema Mapping

  • Column Mapping: Map schema columns for both entities. Use auto-map if columns are identical or manually drag and drop.

  • Save and Complete: Click “Done” and save to finalize the job.

Application Configuration

Naming and Deployment

  • Application Name: Assign a unique name to the application.

  • Deployment Mode: Choose between “Local Mode” and “Registered Cluster” for deployment.

Runtime and Error Handling

  • Runtime Account: Select the account and associated name for runtime.

  • Connection Validation: Optionally skip connection validation before starting the application.

  • Auto Restart: Configure automatic restart on failure with specified retry count and wait time.

  • Error Logging: Enable options to capture raw data in error logs and configure error handlers.

  • Error Log Target: Choose between RabbitMQ and Kafka for error log targets and configure accordingly.

Alerts and Notifications

  • Status Alerts: Enable status alerts to send notifications via Kafka for pipeline status changes.

  • Email Notifications: Configure email notifications for application failures and provide recipient emails.

Advanced Configurations

  • Logging and Resource Allocation: Set log levels, driver and executor cores, memory, and Java options.

  • Spark Submit Options: Include additional Spark submit options as needed.

Finalize and Save

  • Save Configuration: Ensure all configurations are complete, then click “SAVE” to finalize the data validation job.
Top