Data Quality

Actions Available

There are various actions that can be performed on each tab of the view data asset, in addition to the listing page.

common_headers

Edit Data Asset Name: Modify the name of the data asset to better suit your needs.

Additional Options: Access a range of actions including deletion, utilization in Ingestion or ETL Applications, marking as a favorite, and configuring the data asset.

Start Profiling: Initiate data profiling to gain insights into your data’s characteristics and quality.

Back to Data Assets Listing: Return to the list of all data assets for an overview of your data

common_options


data_asset_quality

Data Quality

The data quality of the source is measured to assess the accuracy, completeness, consistency, and overall reliability of the data asset.

  • If the data quality is not available for a data asset, the below message will be shown:

    Data Quality is not available for this Data Asset. Do a profile run (use the play button at the top-right section) to calculate the overall data quality.

  • If a new verion is created for a data asset, but its profile run is not done, the earlier version’s data quality for which the profile run has been done will get displayed.

    In order to get the data quality of the latest version, a profile run should be done.

It is divided into the following sections:

Poor: Falls between 0-25% of the overall data quality score. A poor data asset cannot be trusted due to inaccuracies, inconsistencies, or a lack of credibility.

Average: Falls between 25-50% of the overall data quality score. An average data asset is insufficient in terms of quality, quantity, or relevance and lacks the necessary attributes to support effective analysis.

Fair: Falls between 50-75% of the overall data quality score. A fair data asset meets acceptable standards of accuracy, and is free from major errors and inconsistencies.

Good: Falls between 75-90% of the overall data quality score. A good data asset is accurate, and can be trusted for analysis or decision-making.

Excellent: Falls between 90-100% of the overall data quality score. A data asset that is exceptionally good and of high quality. It signifies data that stands out due to its quality and reliability.

The percentage change in data quality is explicitly shown after the latest profiling of data assets. It can go down, up, or remain unchanged from the last percentage.


Data Completeness

A comprehensive source data analysis is conducted to ensure a reliable single source of truth.

  • If the data completeness is not available for a data asset, the below message will be shown:

    Data Completeness is not available for this Data Asset. Do a profile run (use the play button at the top-right section) to calculate the data completeness.

  • If a new verion is created for a data asset, but its profile run is not done, the earlier version’s data completeness for which the profile run has been done will get displayed.

    In order to get the data completeness of the latest version, a profile run should be done.

Data completeness is expressed as a percentage and measured based on the following factors:

Accuracy: Indicates the proportion of accurate versus inaccurate data (including redundant and null rows).

Uniqueness: Determines how much of the data is unique versus duplicated.

Completeness: Calculates the proportion of complete versus incomplete data (including null rows and empty strings).


Profile

The profile section displays the assigned cluster and data asset scheduling details.

Configure Profiling

Option to select the data asset version on which the profiling should run and configure deployment settings on Gathr.

Schedule Profiling

Scheduling profile runs enables you to automate the data asset profiling at a required frequency, reducing the need for manual intervention.

ScheduleApplication

Once you click on Profile Scheduling, you will have the option to schedule a profile run frequency, and once it is scheduled, an UN-SCHEDULE and RESCHEDULE button will be available to manage scheduling needs.


Profile History

A tabular form of profile history is shown with details of the Data Asset profile:

DatasetProfiling

Field NameDescription
VersionVersion number of the data asset.
StatusThe current state of the data asset.
Start TimeThe timestamp record when the data asset profile run was started.
End TimeThe timestamp record when the data asset profile run stopped.
Number of ColumnsNumber of columns in the data asset.
Number of RecordsNumber of records in the data asset.
Last Profile RunThe date and time when the last profile run got completed successfully.
Credit Points UsedTotal credit points consumed for the data asset profiling.
Cluster TypeThe cluster details assigned to the data asset for profile run.
ActionOption to view the data asset’s profiling results.

View Run Profile

The Profile Run window shows various statistical insights on each variable like Avg, Min, Max, Percentile etc.

You can also click on the Frequency Distribution Details Label to see the frequency distribution corresponding to every variable.

Frequency Distribution Details:

Frequency distribution of any attribute/field is the count of individual values for that field in whole data asset.

For Numeric type fields, it is shown in terms of counts only.

For String/Date/Timestamp, you can view the frequency/counts along with its percentage.

The Frequency Distribution Graph is generated for every variable in the data asset.

ProfileRunFrequency

You can filter or sort variables for which you need to see the data profile.

Top