Data Validation Introduction
In this article
During the process of data movement unintentional business logic could be introduced, errors can be encountered, resulting in partial movement of data.
In such scenarios, even though the process completes there might still be discrepancies in the source and target stores.
Data Validation allows users to compare data stores and helps in understanding if there is any mismatch between the expected and actual records in the target.
Thus, as a part of ETL solution, Data Validation solves the problem of comparing two data sources.
In Gathr application, these data sources are termed as entities.
The user can configure the validations and execute them to see the comparative results.
Create the Data Validation jobs in Gathr for mapping schema of the selected entities (Data Sources, Ingestion, ETL and Data Assets) based on the desired validation strategy. View and run jobs on Data Validation listing page to get a comprehensive comparative report.
The user can create multiple validations and group them into a single job.
Let’s begin to look into the details of the functionality below:
Data Validation Listing page
The existing data validations can be viewed in the Data Validation listing page of Gathr. On this page you can find the below options:
Filter By
Option available to filter the data validation job by Name, Owner and Entity.
Search Bar
Option available to search a specific application by using the search bar.
Sort By
Option available to sort the application in ascending or descending order based on various options available including: Updated Date, Created Date, Owner, Name.
Save User Preferences
Option available to save user preferences based on the choices selected in Filter By or Sort By fields.
Refresh List
Option available to refresh the listed data validation jobs.
Add Validation
Click here to read more about how to add validation(s).
Upon clicking the ellipses of the listed Data Validation job the below options are available:
Play
Option to start/play the data validation job created.
Stop
Option to stop the data validation job.
Options available on the ellipses of data validation job
View
Option to view the details of the created application. Upon clicking this option the below tabs are available.
Detail
Under this tab, the detailed configuration of Entity 1 and Entity 2 are available including Component Name, Entity Type, Path, Database/Table, Number Of Columns, Format, Connection Name and Connection Type.
The details of Validation Strategy selected and description of the created application are available.
The other options available are:
- Edit button to change the Data Validation Name. 
- Ellipses to Cluster Change, Edit and Delete application. 
- Profile button to start profile run of the job. 
- CHANGE VALIDATION STRATEGY button to update the job validation. 
Validation Status
Under this tab various details including Count, Difference Count and Profile are available.
Data Count
Total data count of Entity 1 and Entity 2 and its percentage captured in statistical graphic.
Difference Count
The comparison of matched and mismatched schema of Entity 1 and Entity 2 displayed in statistical graphic.
Profile
The profile section displays the profile run record for each version of the data asset along with its credit utilization details.
Option to Filter (ALL, PASS or Fail) profile details are available and can be selected from the drop down list based on user’s requirement.
The profile details can be downloaded locally by clicking at the download button.
Validation Report
Under this tab the Count and Difference Count are displayed.
Count
The Schema count of Entity 1 and Entity 2 and its percentage captured in statistical graphic upon running the profile.
Difference Count
The comparison of matched and mismatched schema of Entity 1 and Entity 2 displayed in statistical graphic derived from each profile run.
History Tab
The history tab reflects the data validation job history including status, start time, end time, duration of job run, vendor type details and validation status.
Error Search Tab
A range of information on the Error Search tab can be viewed. The error search feature can identify erroneous data, pinpoint causes of data validation profiling failure.
Additionally, it provides a full stack trace corresponding to any errors detected. Using the error search tab, you can easily see a distribution of errors over time for a data validation job.
This feature also allows you to pan and zoom into a specific time range or duration and display results as a histogram distribution.
Mark As Favorite
Option to mark/unmark the existing application as favorite.
Edit
Option available to edit the job.
Cluster Change
Option available to change the cluster on which the data validation job is saved.
Upon clicking this option the Change Cluster window opens. Provide the below details as per the requirement.
Application Deployment
Select one of the options available.
Local: If Local option is selected then the application is deployed on local server of Gathr application. However, this mode of deployment is not recommended for production environment.
Registered Cluster: If this option is selected then the compute clusters from registered accounts in Gathr application deployment is utitilzed.
Runtime Account and Name: Select the runtime account and the associated name from the dropdown list.
Click the UPDATE button to save the changes.
Delete
Option to delete the existing data validation job. when the data profile job is running, the application cannot be deleted.
If you have any feedback on Gathr documentation, please email us!