Data Validation Introduction

When you move data from one place to another, unexpected problems can occur, causing only part of the data to transfer. Even if the process finishes, there might still be differences between the original and new locations.

Data Validation helps you check if the data in both places matches and shows if there’s any difference between what you expected and what you got.

As part of an ETL (Extract, Transform, Load) solution, Data Validation helps compare two sets of data.

In Gathr, we call these data sets “entities.”

You can set up and run validations to compare results.

Create Data Validation jobs in Gathr to map the data structure of the chosen entities (like Data Sources, Ingestion, ETL, and Data Assets) according to the validation method you want. You can view and run these jobs on the Data Validation listing page to get a detailed comparison report.

You can make several validations and group them into a single job.


Navigate to Data Validation from the left navigation menu.

01-navigate-to-data-validation.png


Data Validation Listing page

You can view existing data validations on the Data Validation listing page in Gathr. The following options are available on this page:

Filter By

Option available to filter the data validation job by Name, Owner, and Entity.

Option available to search for a specific application using the search bar.

Sort By

Option available to sort the application in ascending or descending order based on various options, including Updated Date, Created Date, Owner, and Name.

Save User Preferences

Option available to save user preferences based on the choices selected in the Filter By or Sort By fields.

Refresh List

Option available to refresh the listed data validation jobs.

Add Validation

Click here to read more about how to add validation(s).

Upon clicking the ellipses of the listed Data Validation job, the below options are available:

Play

Option to start/play the data validation job created.

Stop

Option to stop the data validation job.

Options available on the ellipses of the data validation job

View

Option to view the details of the created application. Upon clicking this option, the below tabs are available.

Detail

Under this tab, the detailed configuration of Entity 1 and Entity 2 is available, including Component Name, Entity Type, Path, Database/Table, Number Of Columns, Format, Connection Name, and Connection Type.

The details of the Validation Strategy selected and the description of the created application are available.

The other options available are:

  • Edit button to change the Data Validation Name.

  • Ellipses to Cluster Change, Edit, and Delete application.

  • Profile button to start a profile run of the job.

  • CHANGE VALIDATION STRATEGY button to update the job validation.

Validation Status

Under this tab, various details, including Count, Difference Count, and Profile, are available.

Data Count

The total data count of Entity 1 and Entity 2 and its percentage captured in statistical graphic.

Difference Count

The comparison of matched and mismatched schema of Entity 1 and Entity 2 displayed in statistical graphic.

Profile

The profile section displays the profile run record for each version of the data asset, along with its credit utilization details.

Option to Filter (ALL, PASS, or Fail) profile details are available and can be selected from the drop-down list based on user’s requirement.

The profile details can be downloaded locally by clicking the download button.

Validation Report

Under this tab, the Count and Difference Count are displayed.

Count

The Schema count of Entity 1 and Entity 2 and its percentage captured in statistical graphic upon running the profile.

Difference Count

The comparison of matched and mismatched schema of Entity 1 and Entity 2 displayed in statistical graphic derived from each profile run.

History Tab

The history tab reflects the data validation job history, including status, start time, end time, duration of job run, vendor type details, and validation status.

Error Search Tab

A range of information on the Error Search tab can be viewed. The error search feature can identify erroneous data and pinpoint causes of data validation profiling failure.

Additionally, it provides a full stack trace corresponding to any errors detected. Using the error search tab, you can easily see a distribution of errors over time for a data validation job.

This feature also allows you to pan and zoom into a specific time range or duration and display results as a histogram distribution.

Mark As Favorite

Option to mark/unmark the existing application as a favorite.

Edit

Option available to edit the job.

Cluster Change

Option available to change the cluster on which the data validation job is saved.

Upon clicking this option, the Change Cluster window opens. Provide the below details as per the requirement.

Application Deployment

Select one of the options available.

Local: If Local option is selected, then the application is deployed on the local server of the Gathr application. However, this mode of deployment is not recommended for production environments.

Registered Cluster: If this option is selected, then the compute clusters from registered accounts in Gathr application deployment are utilized.

Runtime Account and Name: Select the runtime account and the associated name from the dropdown list.

Click the UPDATE button to save the changes.

Delete

Option to delete the existing data validation job. When the data profile job is running, the application cannot be deleted.

Top