Dataset is a saved entity that contains the schema and rules.
Rules can be configured in dataset to transform its schema or results.
Channels in pipelines can re-use dataset schema and rules, this transforms data from channel with corresponding transitions.
Dataset is used to analyze data by seeing its generated profile results (mainly used by Data Scientists) based on actual data existing in the source defined for it.
The result which is the statistical analysis of the columns of a Dataset is a Profile.
A history of profile results is maintained under Profile History.
You can view association or flow of dataset between pipelines in the system. This is Dataset Lineage. You can expand the lineage view to any level (parent/child)
Also, Dataset Versions can be created based on schema and rule changes. The schema, rules and lineage is then listed version wise.
Datasets can be created externally, from Channel and from Emitter.
A dataset can be created while configuring a Data Source or Emitters within a data pipeline.
While configuring a Data Source, under Detect Schema, you will get an option to create a dataset. (as shown in image below)
Select, Save As Dataset, and enter a unique name for the Dataset.
While configuring an Emitter, the option to save a dataset is under Configuration, as shown:
The name of the dataset has to be unique, else an error message is shown, “Dataset name already exists”.
For both Data Source and Emitter, the user has an option to override credentials for the connection field on the configuration tab.
If opted, the user will need to provide any other valid credential details for connection authentication.
Note: Dataset creation is supported on the following Data sources and Data Sinks:
l Batch DFS Data Source and Batch DFS Emitter
l Hive Data Source and Hive Emitter
l JDBC Data Source and JDBC Emitter
l S3 Data Source
Note: Dataset creation is also supported on GCS and BigQuery for Gathr Google cloud platform.
Dataset Version and Disassociation
Whenever a dataset is created, a version is created. If you edit the components (that enabled dataset versioning), a new version is created. The only change you can make in a dataset is that you can edit the rules.
Also, Dataset Versions can be created based on schema and rule changes. The schema, rules and lineage is then listed version wise.
The following changes will create a new version of a dataset:
l Modifying rules of dataset will create a new version
l Changing the dataset schema at pipeline emitter will create new version (the dataset is created from emitter)
l Changing the configuration of a dataset at pipeline emitter will disassociate the dataset.
While you create a Dataset on an existing channel or edit the configuration of the above mentioned components, the system prompts relevant warning message before you accidentally overwrite an already created dataset:
The final confirmation of Dataset creation is reflected in Green:
Once a new version is created, you can view the version list, under View Dataset > Summary. A sub-section opens in the same window with version list and the Schema details.
Schema Details consists of Alias and Datatype.
Alias is the name of the field and Datatype is the data type of the field value, as shown in above screen shot.
Note:
l Changing Dataset name will create a new Dataset.
l Dataset names should be unique all over the workspace.
A dataset is a combination of schema and a set of rules on the schema. You can create and re-use a dataset so that the pipeline runs with a pre-defined schema and a set of rules that are already configured in the dataset.
You can create a dataset externally from the Dataset option within a project.
External dataset is helpful in getting insights on the data and getting an analysis on each attribute of the dataset. This dataset can further be used in the desired pipelines.
Below are the steps to create and view datasets externally:
To create a new dataset, click the plus icon on the top right screen.
Enter the name of the dataset to be created under General, followed by the Description and click Next (as shown below).
- The dataset name has to be unique.
Note: The user can define the scope of the dataset by selecting either Project or Workspace. User can use the created dataset within the project or anywhere in the workspace. However, if the user selects a project, then the created dataset will be visible only in the specific project.
You can create an external Dataset through the below mentioned data sources: HDFS, Hive, JDBC, S3, File System or SFTP, ADLS.
Note: Dataset creation is also supported on GCS and BigQuery for Gathr Google cloud platform.
Create a Dataset through ADLS
Provide values for the following properties:
Field Description Source The source for which dataset is to be created, i.e., ADLS. Connection Name Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data. Container ADLS data name from which the connection should be read. ADLS Directory Path Provide directory path for ADLS file system. Type of Data Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length, Avro, ORC and Parquet file. Header Included Check mark this option to include the header.
Create a Dataset through HDFS
Enter values for the following properties.
Field | Description |
---|---|
Connection Name | Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credential | Option to override existing connection credentials. This option is unchecked by default. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the HDFS data source connection can be authenticated: Username, KeyTab Select Option and KeyTab File Path. Note: Make sure to provide valid credentials if Override Credentials option is chosen. |
HDFS File Path | HDFS path from where data is read. |
Type of Data | Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length and Parquet file. |
Create a Dataset through Hive
Proceed with entering values for the below properties.
Field | Description |
---|---|
Connection Name | Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credential | Option to override existing connection credentials. This option is unchecked by default. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the Hive data source connection can be authenticated: Username, KeyTab Select Option and KeyTab File Path. Note: Make sure to provide valid credentials if Override Credentials option is chosen. |
Query | Write a Hive compatible SQL query to be executed in the component. |
Refresh Table Metadata | If you want to refresh table metadata or sync updated dfs partition files with metastore. |
Create a Dataset through JDBC
Create a dataset through JDBC by selecting JDBC and below mentioned properties.
Field | Description |
---|---|
Connection Name | Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credential | Option to override existing connection credentials. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the JDBC data source connection can be authenticated: Username and Password. Note: Make sure to provide valid credentials if Override Credentials option is chosen. |
Query | Write a JDBC compatible SQL query to be executed. |
Enable Query Partitioning | Tables will be partitioned and loaded in RDDs if this check-box is checked. This enables parallel reading of data from the table. |
Create a Dataset through S3
To create a dataset through S3 Source.
Field | Description |
---|---|
Connection Name | Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credential | Option to override existing connection credentials. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the S3 data source connection can be authenticated: AWS KeyId and Secret Access Key. Note: Make sure to provide valid credentials if Override Credentials option is chosen. |
S3 Protocol | Protocols available are S3, S3n, S3a |
Bucket Name | Buckets are storage units used to store objects, which consists of data and meta-data that describes the data. |
Path | File or directory path from where data is to be read. |
Type of Data | Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length and Parquet file. |
Is Header Included | Is header included in the CSV file. |
Create a Dataset through File System
Create a dataset through the File System option. Upload a file and select the Type of Data.
Create a Dataset through SFTP
Create a dataset through the SFTP option.
Create a Dataset through GCS
Create a Dataset through BigQuery
After the source is configured, the data from the source is represented as a Schema. This process is called detect schema. The schema is then divided in Columns with Column Name, Column Alias, Data Type and Sample Values.
Column Alias and Data Type is editable at this page.
Once the schema is detected, you can click on Create from the top right corner of the UI, as shown in the image above. You will be notified that the Dataset has been successfully created.
As soon as you click on Create, you are navigated to the Data window under Explore (View Dataset page), where the schema is listed on the RULES window.
The Rules are displayed under the DATA window shows all the records of the dataset, as shown below. The below screen is divided in the sections:
l Columns of Data
l Actions on Columns
l Rules
l Analyze
l Statistics
l Unique Values
Let us go through each section in detail.
A column is divided in four parts: Every column has the following sections and you can sort the data by using the operations.
1. Data Type:
Supported data types are:
l Date and Timestamp
l Numeric
l String
l Boolean
For example, in the image shown above, the supported data type is denoted by abc for String data type.
2. Operations
This option is at the top right of the column. The gear icon, which opens a slide window with the following operations that can be performed on the respective column of the schema.
Whenever any operation is applied on a column (s), a set of Rules is created and reflected in the right section of Rules.
a. Filter the values
Filter the values based either of the shown filters, such as, Equals, defining a range in Range and so on. Custom filter can be used for a custom value.
b. Transform
Transform filter can be applied to transform and filter the data based on Uppercase, Lowercase, Trim the values and so on, as shown in the image below.
c. Missing Value Replacement:
Replace the missing or null values with either Literal or Expression value.
Literal: Replaces Null and/or empty string values with a specified string literal
Expression: Replaces Null and/or empty string values with a specified expression value
.
d. Pivot:
Pivot the columns, where PIVOT is a relational operator that converts data from row level to column level.
e. Group By
To group the columns together, this filter can be used
.
f. Rename Column
Rename the column name.
g. Create New Column
Create a new column using this filter
.
h. Remove Column
Remove the selected column.
3. Sort
Along with Filters, you can also sort the columns Alphabetically, from A to Z or Z to A.
The actions that can be performed on all the columns together are as follows:
l Create Column: Create a column using this icon.
l Keep/Remove Columns: Keep or remove a column in the schema.
l Display Columns: Display the selected columns in the schema (Note that the hidden columns will be still a part of the schema, just not displayed.)
l Toggle Rules: Toggle the list of the displayed rules.
l Search Value: Search for a value in the schema.
Rules are conditions/actions applied on the columns to modify dataset as per the requirements. You can view the Rules applied in the right navigation panel.
The first time a dataset is created, the version of the dataset is version 0.
As soon as you create a rule and save the changes, the version keeps going n+1.
Once you have applied the changes in the rules, click on the Save icon in the Rules navigation bar to save the changes.
To edit an existing dataset, open the dataset and click the Explore button, you can configure new rules or reconfigure the existing rule.
For example, in the below image, you can see the two rules applied with two sub-sections: Analyze and Unique Values.
The two applied rules are:
l Selling Price less than 5
l Profit Percentage greater than 1
These rules can be deleted by clicking on cross, next to each rule.
All the rules can be deleted by clicking on the Delete icon above Rules.
As per the rules applied on the Columns, the Analyze window and Unique Values window also change their values.
Under Analyze, you can view the Null values of the column on which the last rule was applied.
Under Statistics, you can view the mathematical statistic/value of the entire column in the form of:
l Minimum
l Maximum
l Mean
l Median
l Standard Deviation
l Mode
l Distinct
l Sum
l Range
Under Unique Values, you will get the unique values in the column and when you hover your mouse on either bar, it will show the count of the unique value.
You can also sort the count in Descending, Ascending or alphabetically.
After applying the changed rules, a new version of the Dataset will be created.
Note:
- You can view a list of all the saved dataset in a workspace.
- We are fetching data from HDFS in all cases by using Fetch From Source (hidden property).
The Dataset homepage shows a list of all the Datasets created. The below screenshot shows the same with properties and description table below:
Property | Description |
Name | Name of the dataset. |
Description | Description of the dataset. |
Source Type | Data source type. |
Actions | When you click on the eye icon, a view-dataset window opens. This is the page where you can view the schema of the dataset, versions of the dataset and other options explained below in hierarchal form. |
Delete | A new option to delete a Dataset is being introduced. Datasets that are not being used in any pipeline or validation jobs can be deleted with this new option. Note: An existing Dataset that is being consumed by any pipeline or validation jobs cannot be deleted unless it is removed from its associated pipeline or validation job. |
Click on the view dataset eye icon under the Actions tab of Dataset homepage. A window will open with the details of the dataset, with two following tabs; Summary and Explore.
Note: You can view both, existing external Dataset and the Dataset created from the pipeline canvas.
The Summary button shows the summary of your Dataset. The right panel is the description window, which can be edited and saved in the same window. Left Panel has the Data Source or Emitter Dataset details, where you can view schema details, connection detail and other properties, explained in the table below:
The screenshot under View Dataset, shows the Dataset details and active version of DFS Data Source, under the About button. Below mentioned table describes all the properties of the Summary tab button.
Property | Description |
Dataset Name | |
Last Read From Source | Last time the data was read from the data source. The date and time are mentioned here. |
Last Profile Run | It shows when the last Dataset profile was generated. The profile is generated when you run the dataset profile from Run Profile. |
File Path | The file path and the format of data type, where the data will be read while generating profile. In few components, HDFS path is replaced by Query/ Database name or Table name and the configured query is reflected here. For example, in case of JDBC, the query mentioned here will run while generating the profile. In case of HDFS, the configured path is where the data will be read from. Note: All the above properties should be re-configured when a new dataset version is created. |
Number of Columns | Number of columns from the data. |
Number of Records | The total number of records when the last profile was generated in the data source. |
Schema and Rules | This click-able button opens a window at the bottom of this pane and displays the versions and corresponding rules applied on the dataset. In the Schema window, you can view the Alias and the Datatype of the schema. |
Profile History | Number of times the profile was generated. You can view the respective results. |
Version | The latest version of the dataset. |
Run Profile Status | This tab shows the current state of the profile, for if it is in execution state or stopped state. There are two buttons, Play and Stop. These buttons allow you to play and stop the profile. The menu button, has two options: Schedule Job Configure Job They are explained below. |
Tags | Associate tags with the dataset. Tags can also be updated from the same window. |
Description | The description provided while creating a dataset. You can edit the description within this window, and a “Description updated successfully” message will pop-up. |
Dataset Lineage | |
Select Version | Select the version of the dataset that you want to view. |
Alias | Name of the fields |
Datatype | Datatype of the field (Int, string) |
When you do a mouse-hover on the Path details, it gives the name of the connection, shown as below:
Schema opens a new window beneath the Schema panel. Select the Versions of the dataset and it will list the Alias with their Data Types.
Profile History opens a new window beneath the View Schema panel. A tabular form of profile history is shown with details of the Dataset profile:
Property | Description |
Version | Version number of the Dataset. |
Number of Columns | Number of columns in the Dataset. |
Number of Records | Number of records in the Dataset |
Last Profile Run | The date and time on which the Profile was run. |
Action | View the profile results. |
Run Profile Status shows the current state of the profile execution, for if it is in Starting/Active/or Stopped mode.
You can Stop and Play the profile using the respective buttons as well.
A pipeline gets submitted on the cluster. This pipeline will have a nomenclature as explained below:
System prefix of the Pipeline_Dataset Name_DatasetVersion_Timestamp.
For example, SAx_DatasetProfileGenerator_IrisInputData_0_1559296535220
This pipeline will be submitted as a batch job in the cluster.
Along with executing the profile, you can also configure the job and schedule the job, as explained below:
Configure Job
User can tune the job in this window by providing driver and executer related parameters. To know more, click Configure Job
To check the errors on this job, you can also configure error properties from this window.
Schedule Job
Schedule job enables a dataset to run a Job as per the defined cron expression.
Once you are defining a cron expression, you will have the option to Schedule a job and once it is scheduled, then an UN-SCHEDULE and RESCHEDULE button will be available.
Dataset Lineage window lists all the versions of the dataset and its complete life-cycle.
You can view the dataset lineage by selecting the version. Data Lineage represents association of dataset in pipeline. It shows the used channel or emitter used within the datapipeline.
An association is defined, if dataset schema and rules are used in channel. This helps to use the same entities into multiple pipeline channels, as Use Existing Dataset.
In case of emitter, only the schema part of dataset is associated.
So on, the life-cycle of the dataset is shown under the lineage page.
Represents flow of dataset in the system with pipelines
Initially a basic lineage is shown. Then you have the option to expand the dataset or pipeline lineage to get more parent child associations and flows.
It is visible on Summary screen. Below example shows the lineage as follows:
l JDBC_UserDS is used in PipelineBB, PipelineAA as channel
l HDFS_EmitterDS01 is created by save as dataset in emitter of PipelineAA. Used as channel in PipelineCC and Pipeline DD
Dataset to pipeline arrow signify that the dataset is used in pipeline as channel. Pipeline to dataset arrow represents dataset is saved in emitter of the pipeline.
It generates details about the data in the data window. Under Explore are two tabs:
l Data
l Profile
Under the Data tab, you can view the rules, the dataset with schema. This tab is explained under Rules.
Profile pane lists all the variables in your Dataset. This section also shows various statistical insights on each variable like Avg, Min, Max, Percentile etc. You can also click on the ‘Frequency Distribution Details’ Label to see the frequency distribution corresponding to every variable.
Frequency Distribution Details:
Frequency distribution of any attribute/field is the count of individual values for that field in whole Dataset.
For Numeric type fields, it is shown in terms of counts only.
For String/Date/Timestamp: You can view the frequency/counts along with its percentage.
By default only 10 distinct values are being shown. But, it can be changed by updating "sax.datasets.profile.frequency.distribution.count.limit" from the Superuser Configuration.
As shown below, you can click on the bar of Frequency Distribution and it expands with a graph.
The Frequency Distribution Graph is generated for every variable in the dataset.
A dataset can be reused using the Re-use option.
You can reuse any dataset, on any channel if the dataset schema matches with channel source schema. Otherwise will get a warning and a no- association error message.
You can choose any existing dataset from the listed datasets. Once you choose the selected dataset, the schema is automatically applied, and the results are populated during inspect.
For an emitter, only the schema of dataset should match with emitter's output data, otherwise re-use option will be converted to disassociation. Re-use will associate the dataset to the pipeline which will be reflected in data lineage window.
How to use an Existing Dataset
To use an existing dataset, you can create a pipeline and while you are configuring the Data Source, choose the option “Use Existing Dataset”, as shown in below image.
We will take an example of a pipeline that has an RMQ DS + DFS DS > Union > DFS Emitter. This example will show you different possibilities of a data pipeline, where you can use existing dataset.
RabbitMQ Data Source- We will use existing dataset.
DFS Data Source- We will use option “Save as dataset”.
DFS Data Sink- We will use “Save as dataset”.
Now, using these options we will create a Data Pipeline and those datasets will be listed in the Dataset homepage (dataset list page). When we edit the pipeline, the option to re-use the existing dataset will be reflected for RMQ data source as well.
Under the configuration of the RabbitMQ Data Source, select “Use Existing Dataset”.
Then select dataset and version.
.
In the Select Dataset section (Note: The dataset source can be different from current channel where it is used), you can see the selected dataset’s schema and rules to identify advanced dataset details, which would be applied in the Data Source.
Click Next, to continue configuration of RMQ. Configure RabbitMQ as required and click Next to go on the Detect Schema page; where the schema will be applied.
If source schema and dataset schema matches or is applied successfully, then you will see the Detect schema results with sample values.
Note: All the results of Detect Schema will be read only mode, since you can’t change the name or data types, under the “Use Existing Dataset” option.
If there is a mismatch in the schema, you will see warning messages on Detect Schema window. As a result, the dataset will not be associated with pipeline and that pipeline will not be a part of lineage (dataset lineage part is in later implementation). However, you can continue with failed conditions.
After configuring RMQ with a successful match, you can view the inspect results. In the image below, you can see the results with applied dataset schema as well as rules.
Configure HDFS Data Source:
Under Schema Type, we choose, Fetch From Source, and then go to Configuration page.
Go to detect schema, you can see “Save As Dataset” option, as it is one of the channels where dataset can be created through pipeline.
Check “Save As Dataset” option and give the dataset name (“dfs_channel_dataset” for eg). Save the channel.
Configure Union, then go for configuring DFS emitter in e.g., you can see “Save As Dataset” option, as it is one of the emitters where dataset can be created through pipeline.
Check “Save As Dataset” option and give the dataset name (“dfs_emitter_dataset” for eg). Save the emitter.
Now, save the pipeline (let say with name “pipeline_with_reuse_datasets”). Go to Datasets list. User can see 2 datasets created, one from DFS channel and other from DFS emitter.
Now edit the pipeline(“pipeline_with_reuse_datasets”).
Open DFS channel component, you will see “Use Existing Dataset” option selected, new dataset name, its 0-version selected and its details. Reason is, you have saved “dfs_channel_dataset” dataset on channel and now it has become a part of it in the pipeline. Same will be reflected in the lineage association.
Now under the detect schema window (with auto inspect on) of DFS channel, you will not see the “Save As Dataset” option but will see read-only results with applied “dfs_channel_dataset” results. Reason is that currently channel has become “Use Existing Dataset” case.
Note: This means that if channel has “Use Existing Dataset” option, then “Save As Dataset” option should not be visible (either during create or edit channel in the pipeline)
Now edit DFS emitter the “Save As Dataset” option will be shown same with saved name. This means the dataset is still associated with pipeline and will be part of lineage (later implementation).
Now remove one of the output fields(like matchid) in DFS emitter, to change its schema. Click Next, you will be warned with message and emitter dataset will be disassociated and will not be a part of lineage (later implementation)
Note: “Use Existing Dataset” is available in all channels except Data Generator.
When you upload or download a data pipeline, the Dataset corresponding to the pipeline is exported and imported with it.
The imported datasets can be viewed on the Dataset homepage.
Steps to export and import Datasets, with example:
Step 1: Create a pipeline ds_pl with 2 datasets:
l DFS channel Dataset named: dfs_ch1
l DFS emitter Dataset named: dfs_em1
Now on the Datasets page, two Datasets will be listed.
Step 2: To export ds_pl pipeline, navigate to Pipeline Download Version page.
This page will get you the download version list:
l Click Download, it will ask you to save the pipeline.
l Click Save File to save pipeline binary.
Step 3: To upload the ds_pl pipeline, navigate to Data Pipeline homepage and choose, Upload Pipeline option.
Step 4: Click Upload pipeline. Select the ds_pl pipeline binary. A warning window will ask you to Overwrite pipeline or create a new Pipeline.
Note: Dataset created from local file system and used in pipeline, will not be imported.
Step 5: Choose New Pipeline and click on Proceed. Provide a new pipeline name as ds_pl1 and click Upload.
Step 6: Then it asks you to give a new dataset name for DFS channel; as dfs_ch1 is existing, we can name this one as dfs_ch2
Step 7: Then select DFS emitter and give it a new dataset name as dfs_em2, since dfs_em1 already exists. Once the names are saved, click Upload.