DataPipeline

Data Pipelines

Pipeline Configuration

The home page of data pipeline is explained below:

Figure No.	Name	Description
1	Pipelines	All data pipelines fall under this category.
2	Workflow	Workflows are defined under this tab.
3	Filter	You can filter applications based on their status (Starting, Active or Stopped). There is a reset button to bring the settings to default.
4	Actions for Pipelines	Create new pipeline, integrate pipelines, audit pipelines, download sample project, and upload a pipeline from same or different workspace.

With a StreamAnalytix Data Pipeline, you can perform the following actions:

• Design, import and export real-time data pipeline.

• Drag, drop and connect operators to create applications.

• Create Datasets and reuse them in different pipelines.

• Create Workflows using pipelines and apply control nodes and actions on them.

• Create models from the analytics operators.

• Monitor detailed metrics of each task and instance.

• Run PMML-based scripts in real-time on every incoming message.

• Explore real-time data using interactive charts and dashboards.

Creating a Spark Pipeline

To create a pipeline, go to Data Pipeline page.

Make sure that the Local/Livy connection is established. The green arrow above the pipeline canvas denotes that the connection is established.

Drag components on the pipeline builder (pipeline grid-canvas) from the right panel and connect the components.

Select and right click on a component to configure its properties (view Data Sources, Processors, Data Science and Emitters sections, to understand the components).

Once the configuration is done, click on the floppy icon to save the pipeline. (shown above)

Enter the details of the pipeline in Pipeline Definition window. (shown below).

The pipeline definition options are explained below:

When you click on More, the below mentioned configuration appears (they are same for Local or Livy).

Field	Description
Pipeline Name	Name of the pipeline that is created.
HDFS User	Yarn user by which you want to submit pipeline.
Key Tab Option	You can Specify a Key Tab Path or Upload the Key Tab file.
Key Tab File Path	If you have saved the Key tab file to your local machine, specify the path.
Error Handler	Enable error handler if you are using Replay data source in the pipeline. Selecting error handler check-box displays configure link. Clicking on the configure link displays error handler configuration screen.
Publish Lineage to Cloudera Navigator	Publish the pipeline to Cloudera environment. (Only in case of CDH enabled environment).
Log Level	Control the logs generated by the pipeline based on the selected log level. Trace: View information of trace log level. Debug: View information of debug and trace log levels. Info: View information of trace, debug and info log levels. Warn: View information of trace, debug, warn and info log levels. Error: View information of trace, debug, warn, info and error log levels.
Yarn Queue	Then name of the YARN queue to which the application is submitted.
Version	Creates new version for the pipeline. The current version is called the Working Copy and rest of the versions are numbers with n+1.
Comment	Write notes specific to the pipeline.
Deployment Mode	Specifies the deployment mode of the pipeline. Cluster: In cluster mode, driver runs on any of the worker nodes. Client: In client mode, driver runs on the StreamAnalytix admin node. Local: In local mode, driver and the executors run on the StreamAnalytix admin container.
Driver Cores	Number of cores to use for the driver process.
Driver Memory	Amount of memory to use for the driver process.
Application Cores	Number of cores allocated to spark application. It must be more than the number of receivers in the pipeline. It also derives the number of executors for your pipeline. No. of executors = Application Cores/ Executor cores.
Executor Cores	Number of cores to use on each executor.
Executor Memory	Amount of memory to use per executor process.
Dynamic Allocation Enabled	When a pipeline is in running mode, the spark pipeline scale and scale down the number of executors at the runtime. (Only in case of CDH enabled environment).
Extra Driver Java Options	A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Extra Executor Java Options	A string of extra JVM options to pass to executors. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Extra Spark Submit Options	A string with --conf option for passing all the above configuration to a spark application. For example: --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app' --conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
Populate from Local Session	It is application for YARN deployment.

Note:

• When you create a pipeline, the associated or used schema is saved as metadata under message configuration.

• Components are batch and streaming based.

Error Handler Configuration

Error handler configuration feature enables you to handle the pipeline errors.

Error log target are of three types: RabbitMQ, Kafka and Logger.

To log pipeline errors to logger file or on Kafka or RMQ, so that you can refer the log file or message queues to view those errors whenever required. All these errors will be visible on graphs under Summary view of pipelines under Application Errors Tab.

When you save a pipeline, specify whether you want to enable error handler or not. Either Kafka or RMQ is selected as error log target. Whenever an error happens, the pipeline will log the errors to the configured error handler. By default, errors will always be published to log files. If you want to disable logging errors to these queues, disable error handler while saving or updating actions on the pipeline. If error handler is disabled, error graphs will not be visible and errors will not be available on the errors listing page.

Note: If RMQ or Kafka is down, the data pipeline will work. For the visibility of error graph default Elasticsearch connection should be in working state.

Field	Description
Error Log Target	Select the target where you want to move the data that failed to process in the pipeline. If you select RabbitMQ, following tabs appear.
Queue Name	Name of the RabbitMQ Queue, where error to be published in case of exception.
Select Data Sources	Error handler will be configured against the selected Data Source.
Select Processors/Emitters	Error handler will be configured against the selected processor/emitters.
Connection	Select the connection where failed data would be pushed.
Enable TTL	TTL Value discards messages from RabbitMQ Error Queue, expected value is in minutes.
Back to Definition	Toggle back to the pipeline definition screen.
Apply Configuration	Toggle back to the Pipeline definition screen after saving the Error handler configurations.

Based on error handler target, you will be asked for target specific configuration like connection name and partitions and topic administration in case of Kafka and ttl in case of RabbitMQ.

Field	Description
Error Log Target	If you select Kafka, following tabs appear.
Topic Name	Name of the Kafka Topic where error to be published in case of exception.
Select Data Sources	Error handler will be configured against the selected Data Source.
Select Processors/Emitters	Error handler will be configured against the selected processor/emitters.
Connection	Select the connection where failed data would be pushed.
Partition	Number of partitions on the topic.
Replication Factor	The replication count of the topic.

Field	Description
Error Log Target	If you select Logger, following tabs appear.
Select Data Sources	Logger is configured against the selected Data Source.
Select Processors/Emitters	Logger is configured against the selected processor/emitters.

Once you save or update the pipeline, and start the pipeline, you can view errors in the configured queue or topic as well as in pipeline logs.

Note: By default pipeline errors will be logged in Logger.

For viewing error graphs, go to Data pipeline tab on Data Pipeline Home page, Click on the three dots of the pipeline tile and click on View Summary.

Actions on Pipeline

To view the screen below, go to the home page of Data pipeline and click on the three dots of a pipeline’s widget.

As shown in the snippet below, the following actions can be performed on a pipeline

Action	Description
Monitor	This feature enables you to Monitor error metrics of a pipeline and error search tab allows to search errors using keywords and filters.
View Audit Activity	Open a sliding winding window to view the audit activities performed on the Pipeline.
History	This feature enables you to view the pipeline details i.e. start time, end time, status and the application id.
Scheduling	It is a feature of Batch Data Pipeline. You can recognize if the pipeline is a batch or streaming by this icon. Cron Expression: The details of the schedule with sub expressions. (Shown below)
Stats Notification	Statistics of the pipeline can be emailed with details. Explained below in detail.
Test Suite	Test Suite is a collection of Test Cases associated with pipeline.
Create Version	Create a version of the pipeline while updating or from the pipeline tile.
Download Version	Download a version of the pipeline.
Pipeline Configuration	Update the pipeline configuration
Pipeline Submission Logs	Logs of Pipeline can be viewed by either clicking on Application ID or Pipeline Submission Logs.
Edit	Edit the pipeline.
Delete	Delete the pipeline.
Start/Stop Pipelines	To start and stop the pipeline.

Pipeline Management

StreamAnalytix provides tools, such as Auto Inspect/Connect, Import/Export, and Monitor, to manage pipelines at all the stages of its lifecycle.

Note: For Livy and Local, If you do not create a connection explicitly to either, the connection for Livy, would be initialized automatically when you start Auto-Inspection.

Livy

Livy connection can be created at any time and from anywhere in the workspace.

However, you can disconnect from it after the pipeline has been created. It is not mandatory for Livy to be in initialized state for a pipeline to be edited and saved.

While connecting the Livy session, you will be prompted for Extra Java Driver Options, Extra Java Executor Options and Extra Spark Submit Options. All are non-mandatory fields.

There is a place holder property displayed in the text areas of Extra Java Driver Options and Extra Java Executor Options i.e. -Djava.io.tmpdir:/tmp. You can configure this property to change the target folder for Spark Client temporary files.

While saving the pipeline, these parameters can be populated by clicking on “Populate from Livy session” link given on Pipeline Definition window.

Logs will take you to the Livy Inspect Session logs. These logs are not live logs.

These parameters are disabled for a superuser.

Local Inspect

Local Inspect connection can be created at any time and from anywhere in the workspace. However, If you do not want to create a connection explicitly, the connection would be initialized automatically when you start Auto-Inspection. However, you can disconnect from it after the pipeline has been created. It is not mandatory for Local Inspect to be in initialized state for a pipeline to be edited and saved.

While connecting the Local Inspect session, you can specify additional extra java options by clicking the Spark Parameters option. This parameter is optional.

In Spark Parameter, you can supply Extra Java Options for the inspect JVM process. These parameters are disabled for a superuser.

Auto Inspection

When your component is inspect ready, slide the Auto Inspect button.

During Auto inspection, the component’s data is verified with the schema as per the configuration of the component. This happens as soon as the component is saved.

Auto inspect is manual therefore while editing the pipeline, you can choose to not inspect the pipeline. When you edit a pipeline, auto inspect will be off by default.

While editing, you can add/remove new components in the pipeline without auto inspect.

You can auto inspect the properly configured components of the pipeline if the pipeline is not entirely configured, as shown below:

Also, auto inspection can be stopped at any point, even while schema is being detected, and pipeline can be saved even if the inspection hangs or inspect fails.

When you stop the Auto Inspection, while it is inspecting any component, two buttons, Resume and Reload Inspection are populated.

Note: Once the Auto Inspect is enabled, a Livy connection is initialized. (If it is not connected).

If one of the component is not configured, the auto inspect will run for the configured components only. The auto inspect window will show a notification in Red for the unconfigured component.

A single component can be inspected by clicking on the eye icon on top if the components as shown below:

If a pipeline is saved without inspection, the pipeline tile on the data pipeline homepage will show the warning-icon of not inspected, as shown in figure below:

Once the data is loaded from the source it is automatically played through the pipeline.

During auto inspect, if the schema gets changed due to any configuration changes, it will not reset the next components however will show exclamation mark.

This is how corresponding configuration, logic and results from each operator is validated while the pipeline is being designed.

Data Preparation in StreamAnalytix

Data preparation is the process of collecting, cleaning and organizing data for analysis. Prepare the data with operations and expressions to generate customized data. Graphs and statistics are also used to present the prepared data.

To perform data preparation, configure a Data Source (For example, Kafka). Once the Data Source is configured and saved, you can run inspection on the columns of the schema that are reflected in the Data Preparation window. This allows you to build the pipeline while you are interacting with the data.

In the example below, an Agriculture data file is uploaded. It holds information about crop, soil, climate, pests, season, average_yield, cost, high_production_state, crop_wastage and fertilizer.

By default, the data is displayed in Summary view with Profile Pane on the screen. (as shown below).

Data Pane: The schema takes the form of columns and is divided in records (as shown below). You can apply operations on an entry of a column or any possible combination, corresponding to which a new expression is added, which keeps creating your pipeline as in when you apply more actions.

The data preparation window has the following options:

Property	Description
Create Column	Create a new column by clicking on Create Column button . You can specify the column name, expression, and a new column will be created with the given values.
Remove Column	Remove the columns, which are no longer required. Click on the select/remove button and selected columns will be removed.
Profile Pane	Click on the Profile Pane button . The profile pane shows distribution of data in each column. It allows to create an interactive data pipeline. The bar graphs corresponding to the values in each column will be shown in tabular format.
Data Pane	Click on the Data Pane button . This pane shows the data in its original form. You can arrange the columns in ascending and descending order by clicking on the column headers.
Query Execution Plan	The Query Execution plan contains parsed logical plan, analyzed logical plan, optimized logical plan and physical plan.
Display Columns	Select the columns, which are to be displayed at a specific time.
Reload Inspect	Clicking on this button reloads the data of different columns.
Maximize	This button is used to maximize the auto inspect window.
Close	This button is used to close the auto inspect window.
Search value	Search for a value.

Data preparation window has two more views: Detailed View and Summary View

The summary view is the landing view of data preparation. It shows the summary of all the grouped column-records present in the data set.

The detailed view shows details of a selected column value and its corresponding rows.

In the below figure (Summary view) you can view a unique value being shown at the top of every column (for example, Order Number). This is the number of unique records, and corresponding to the same is a scrubber, which has the following conditions:

The scrubber is only available in case there are more than 10 unique values.

The scrubber’s height can only accommodate 50 records therefore, if the records are above 50 then, it shows the highest values in a batch.

Data Preparation steps in a pipeline include following processors:

• Expression Filter

• Expression Evaluator

• Rename

• Select

• Drop

Data Preparation steps appears on left side of inspect window, as a slide panel.

While the operations are being performed, they keep getting added to the pipeline and are reflected in data preparation steps.

Even when data preparation steps are added manually in a processor they are reflected in data preparation steps.

Expression Filter:

Expression Filter enables filter operations on the data column.

This processor gets automatically added on the canvas once you apply the filter operation.

For example, let us add the filter criteria as “crop” equals “cereals”.

Field	Description
VALIDATE	Validate this expression by clicking on validate button. If the expression applied is valid, you will get the message “Valid Expression”, else “Failed to execute given expression”.
ADD EXPRESSION	Additional expressions can be added using ADD EXPRESSION button. For example, if you want to filter out “wheat” from the given crops, you can write the following expression: crop <=> 'wheat'

Field

Description

VALIDATE

Validate this expression by clicking on validate button. If the expression applied is valid, you will get the message “Valid Expression”, else “Failed to execute given expression”.

ADD EXPRESSION

Additional expressions can be added using ADD EXPRESSION button.

For example, if you want to filter out “wheat” from the given crops, you can write the following expression:

crop <=> 'wheat'

Expression Evaluator

Expression Evaluator is used when you wish to perform transform operations on the data columns. This gets automatically added to the canvas when you perform transform operations on the data columns.

Field	Description
VALIDATE	Click on this button to validate the Transform operation applied on the column. If the expression applied is valid, you will get the message “Valid Expression” otherwise you will get the message “Failed to execute given expression”.
ADD EXPRESSION	Additional expressions can be added using ADD EXPRESSION button. For example, write below expression to change the column name Soil to uppercase. Expression: Upper(soil)

Field

Description

VALIDATE

Click on this button to validate the Transform operation applied on the column.

If the expression applied is valid, you will get the message “Valid Expression” otherwise you will get the message “Failed to execute given expression”.

ADD EXPRESSION

Additional expressions can be added using ADD EXPRESSION button.

For example, write below expression to change the column name Soil to uppercase.

Expression: Upper(soil)

Query Execution Plan

To understand Query Execution Plan, we can use an example of a pipeline. It is reading data from RabbitMQ, then performing some data quality checks using Data Quality Processor and storing data to RabbitMQ.

Once Auto Inspection runs on the RabbitMQ emitter the Query execution plan icon is visible on inspect panel.

The Query Execution plan contains parsed logical plan, analyzed logical plan, optimized logical plan and physical plan.

1.Parsed logical plan

a. Raw interpretation of a query by spark is termed as parsed logical plan.

2.Analyzed logical plan

a. The parsed logical plan is converted to Analyzed logical plan by applying set of rules.

b. Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a Data Frame object constructed using the API. In both cases, the relation may contain unresolved attribute references or relations. For example, in the SQL query SELECT col FROM sales, the type of column or a valid column name, is not known until we look up the table sales. An attribute is called unresolved if its type is unknown or it does not match to an input table (or an alias). Spark SQL uses Catalyst rules and a catalog object that tracks the tables in all data sources to resolve these attributes. Next the Analyzed logical plan is produced

3.Optimized logical plan

a. Analyzed logical plans go through a series of rules to resolve. Then, the optimized logical plan is produced. The optimized logical plan allows Spark to plug in a set of optimization rules.

4.Physical plan

a. This optimized logical plan is converted to a physical plan for further execution. In the physical planning phase, Spark SQL takes a logical plan and generates one or more physical plans, using physical operators that match the Spark execution engine. Then it selects a plan using a cost model. Now, cost-based optimization is only used to select join algorithms: for relations that are known to be small, Spark SQL uses a broadcast join, using a peer-to-peer broadcast facility available in Spark. The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. We thus intend to implement richer cost-based optimization in the future.

Below is the image, which shows how a Query Execution plan for a query over StreamAnalytix will be visualize .

Auto Connect Component

Drag, drop and connect a component with auto-connect feature.

Drag a component near an existing valid component, and as the dragged component gets closer to the available component on canvas, they are highlighted. Drop the component and it will be connected to the data pipeline.

Scheduling

To execute a workflow periodically, click on the Scheduling tab and select any of the following option to schedule.

A pipeline can be scheduled:

l Minutes

l Hourly

l Daily

l Monthly

l Yearly

If Monthly is selected, then you can choose from within the options shown below:

Note: A pipeline can only be scheduled/unscheduled in the stopped stage.

SCHEDULE/UN-SCHEDULE: This action will un-schedule/re-schedule any scheduled workflow and mark all the coordinated actions as killed which are in running or waiting state.

The History of the pipeline run can be viewed under the History tab:

Download/Upload Pipeline

Download Pipeline

A pipeline can be downloaded along with its component’s configurations such as Schema, Alert configuration, and Registered entities.

Upload Pipeline

A pipeline can be uploaded with its component configurations such as schema, alert definitions and registered entities.

To import a pipeline, click on the Upload Pipeline button on the pipeline page.

If any duplicity exists in the pipeline components, a confirmation page will appear as shown below.

Click on Overwrite or New Pipeline buttons to see the Exported Pipeline Details page that provides the details of the pipeline and the environment from where it was downloaded.

Click on Overwrite button and all the existing components will be overwritten with the current configuration of the pipeline, which were import.

Click on Proceed, and update the pipeline configuration details and expand the pipeline configurations by clicking on More button (which appears on the next screen).

As shown in the diagram below, click on the Component tab to update pipeline components configuration and connection details if required. Once done, click on the upload button to upload the pipeline with the changes configured.

While uploading the pipeline, if the components used in the pipeline are not configured in the environment in which it is uploaded, no connection will be available for that particular component. This will not abort the upload process but the start button of the pipeline will remain disabled until a connection is configured for the respective component.

Once connection is created, update the pipeline and select a corresponding connection to play the pipeline.

New Pipeline button will allow to save the uploaded pipeline as a new pipeline.

First page will be the exported pipeline’s details page and after clicking on Proceed button, Pipeline configurations are displayed where the pipeline’s name and other configurations can be changed.

Stats Notification

This tab enables sending emails with the statistics of the data pipeline, to a user with customized details.

Click on Stats Notification from the pipeline tile of the Data Pipeline homepage.

This page allows you to enter the email address, on which you will receive the emails. Choose the duration of the emails, as show below:

The other tab Metrics shows all the metrics of the pipeline that will be sent in the email.

View Summary

Once a pipeline has been started and is in Active Mode, you can view its Summary.

To view the real-time monitoring graphs, click on the pipeline summary (go to the Data Pipeline page< Pipeline tile< click on the View Summary icon) as shown below:

To enable real time monitoring graphs, go to

Configuration >defaults >spark > EnableMonitoringGraphs and set its value to true

By default, this value is set to False.

In addition, you can view the Spark Job page from the application_1525…. link as shown below:

The View Summary tab provides the following Tabs:

Monitoring

Under the Monitoring Tab, you will find the following graphs, as shown below:

• Query Monitor

• Error Monitor

• Application Error

Query Monitor

The graph will be plotted in panels, these panels will be equal to the number of emitters used in structure streaming pipeline. Each panel will have three graphs as per below discretion. These graphs can be seen through view summary option.

Input vs. Processing Rate
The input rate specifies rate at which data is arriving from a source (Data Source) e.g. Kafka.

Processing rate specify the rate at which data from this source is being processed by Spark

Here number of input rows defines number of records processed in a trigger execution.
Batch Duration
Approximate time in to process this over all micro-batch(trigger) in seconds.

Aggregation State
Shows aggregation state of current streaming query. It shows information about operator query that store the aggregation state. The graph will be seen in case if pipeline has done some aggregation.

Continuous Integration Continuous Delivery

StreamAnalytix supports continuous integration and continuous delivery of pipeline, in form of test suites and by deploying pipelines on destination.

Continuous Integration

Continuous integration is done by executing test cases (as part of a test-suite) on the incremental state of a pipeline that ensures the pipeline is not impacted by new changes

Test Suite is a collection of Test Cases associated with a pipeline.

You may want to create a test case to verify multiple permutations on continuous data which will then be deployed as soon as they pass the tests.

How to create a test case?

A test case comprises of a set of source data files (one data file for each Data Source) and inspection outputs of each component in the pipeline. These inspection outputs of each component can be thought as the expected data for the component

Every test case will have its source data associated with each Data Source. When a test suite is executed, all its associated test cases are executed. A test suite run is uniquely identified by a run id. When a test case is executed as part of a run id, each component is inspected again and the output data can be referred as actual data.

How a test case is evaluated to pass or fail?

A test case is passed if both the conditions mentioned below are fulfilled:

1. Number of records are same for actual and expected data.

2. Every field value of every record in actual data matches with the expected data.

Let’s take an example to understand this.

Your source data may have integers or some blank values and you may want to perform certain tests to make sure that the invalid or null values are omitted. These test cases can be created with different source data files to test.

Note: When you upload a source file via Upload Data, that file is inspected and the output is saved. However when you use, Fetch From Source on a Data Source, StreamAnalytix internally generates a source data file using the data from the source.

To create a test case, you need to configure, save and inspect the pipeline.

Create Test Case

Once the pipeline is saved and inspected, a test case can be created. Go to the menu on the data pipeline canvas, as shown below:

When you click on Test Suite, it opens a New Test Case window

To create a new test case, enter details in the New Test Case window fields, as shown below:

Property	Description
Details	Details about the Test case, which is: Name: Name of the test case. Two test cases cannot have the same name, within a pipeline that names have to be unique. Description: Enter a description for the test case (optional)
Do Not Compare	Select the columns of either component or the component itself, which will not be compared and tested during inspection. The test case execution will skip data comparison of these component field(s) and component(s). Components, the test case will run on, will save the output file (schema) of the component and compare or not compare a column for the script that runs on the component. For example: The test case is to check the entries of employees, where a field of Random User ID is generated every time an action happens. This random UID will be different in every field. Therefore, it will fail the test. Here you can use to not compare either the output field or the component itself. NOTE: Only the output fields configured in the Emitter will be populated under fields of a component.

Property

Description

Details

Details about the Test case, which is:

Name: Name of the test case.

Two test cases cannot have the same name, within a pipeline that names have to be unique.

Description: Enter a description for the test case (optional)

Do Not Compare

Select the columns of either component or the component itself, which will not be compared and tested during inspection. The test case execution will skip data comparison of these component field(s) and component(s). Components, the test case will run on, will save the output file (schema) of the component and compare or not compare a column for the script that runs on the component.

For example: The test case is to check the entries of employees, where a field of Random User ID is generated every time an action happens. This random UID will be different in every field. Therefore, it will fail the test.

Here you can use to not compare either the output field or the component itself.

NOTE: Only the output fields configured in the Emitter will be populated under fields of a component.

Click on Create, and the success notification message will be shown, with a new entry in the table, as shown below:

Test Suite properties

Every test case is listed in a tabular form under the components of Data pipeline canvas.

This table has two windows (shown above), Definition and Run History with properties defined below

Definition

The definition tab will show all the details and enables you to create a test case, execute it, run multiple at the same time, and other actions as defined below:

Property	Description
Status	Status signifies the last execution status of a test case. To view the Status of the of test case,i.e., last run status, hover the mouse over the status icon. This shows the status of the test case with a RUN ID and the duration of execution. Possible values are ‘never executed’, ‘In progress’, ‘Success’, ‘Failure’, and ‘Error. When the Test case is running, In Progress is denoted by yellow status icon. In case of completion, Success is notified by green status icon and Failure/Error is notified via a Red Exclamation.
Test Case Name	Name of the test case.
Description	A description of the test case. This is an optional property.
Source File(s)	The source file used for test creation.
Coverage	The percentage of the covered components under the test case configuration. (Shown below the table with image and more detail)
Actions	The actions that can be defined on the test case. (Click for more details)
Refresh Table	Refresh Test case list allows you to refresh the test case list.
Run selected test cases	Once all the test case are created you can run the selected ones or all with just one click.
Create new	To create new test cases.
Update existing	Edit the existing selected test case and re configure the fields. The same can be done by using Configure under Actions.
Search	Search for any test case suing the search column.

Under the Coverage tab, ‘Click for more details’, will show complete configuration of the test case.

Actions on use case

Property	Description
Configure	Edit the configuration by clicking on Configure button of test case.
Load Data	It allows you to load pipeline’s Data Source(s) data and run inspection. Any consistency caused due to changes in existing pipeline design will be shown by alert icon in Status column. You need to Load Data again and update existing test-case to sync that test case with latest configuration of pipeline.
Download	Download the uploaded sample data file(s) via clicking on Download icon. Click on test case Download button and a new window opens with every component’s output data, which can be downloaded.
Delete	Delete any test case via Delete button

Run History

You can view the test case Run History. The table will show the run statistics along with test case outputs. View the reason of failed test cases.

Stale Test Case

Multiple test cases can be configured and re-configured from the same window. Now, there may be cases when you have re-configured the pipeline and a few components have been removed.

On removing any component from existing pipeline and then creating new test case may lead to some inconsistencies to the test cases created earlier.

Incase of Stale Test Cases, such inconsistencies are shown as a warning or error under the Status of the test case.

Click on Load Data again and update existing test-case to sync that test case with latest configuration of pipeline.

Each entry corresponds to a single Test Suite run. Single test suite run may contain one or more test cases.

Download Test Results

Complete test suite results can be downloaded via Download Test Results, under Results< Run History. The Test Report is a HTML file with the run Id and test run details (shown below):

Individual test case result can be downloaded via Download Output Results, corresponding to Duplicate Ids. The output result is in the JSON file format.

Updated Test Cases

If you want to update or the source file of expected result, update test case enables the same. Updating existing test case data will overwrite previous captured inspection data with new inspection data. You can also update the Do Not Compare section, while editing the existing pipeline

Version Control

Version control allows you to create a version of the Test Suite. StreamAnalytix supports two types of version control system:

l GIT

l StreamAnalytix Metastore

Before understanding how version control works, we need to understand what a Working Copy is.

A Working Copy is the current version of the copy or the first(0) version of the copy.

The pipeline editing is performed on the working copy, always.

Note: Any pipeline once created becomes history and cannot be edited. If you try to edit a version which is a working copy and you switch it to a previous version, it overwrites the existing working copy from selected version.

There are three ways by which you can create a version:

First way is when you re-save an edited pipeline, you get an option to Create Version.

Second way to enable version on your pipelines is when you edit and save the pipeline. Amongst the option of saving the pipeline is Create Version and Switch version button.

Third option to create a version is by the pipeline tile.

With both the last two options shown above, to create a version, same window opens.

Mention a description or a comment for the version and click on Create Version. Once you save the version, the last version on the pipeline changes from ‘Working Copy’ to Version 1.

A pipeline remains Working copy as long as no other version is created. Once you save a version, next to the title ‘Create Subsystem version’, is the notification of the next version that will be created.

When a new superuser logs in, they will have to configure the Version Control properties under Setup.

Switch Version

If you want to switch the version of a pipeline, click on Switch Version on the pipeline editor page and choose a version. It will change the pipeline as per the selected version.

It is the Working Copy that is loaded to a newer version. Editing is always performed on the Working Copy of the pipeline.

Note: If you want to switch to an older version, save the current state as a version.

Download Version

Download Version from pipeline tile allows you to download either version of the pipeline.

Click on the Download arrow to download the pipeline.

Property	Description
Version	Version number of the pipeline.
Commit Message	Commit message description.
Artifacts	The artifact uploaded.
Commit Time	The commit time that the pipeline was committed as a version.
Download	Download the pipeline.

Continuous Delivery

While you continuously integrate data in the pipelines, you can also deliver the same.

Continuous delivery requires configuration of three environments:

l Source Environment

l Test Environment

l Destination Environment

Source environment is where the pipeline is incrementally developed.

The CD scripts download the pipeline from source environment and upload it on the Test Environment.

Then on the Test environment, test suite for the pipeline is executed. Based on user selection, a inspect session is created an then on test environment the test suite of the pipeline is executed.

If the test suite passes the execution, the pipeline is then promoted to the Destination environment.

Prerequisites

Pre-requisites for CD script

1. Installation of jq

2. SMTP server configured

Open the folder <StreamAnalytix_home>/bin/CD_script and run StartCD.sh to run the pipeline on the destination environment. Integration is done by creating Test Suites and delivery is performed by running the StartCD.sh on StreamAnalytix metastore.

Configure the below mentioned properties to deliver the test cases on the environment:

Property	Description
sax.testserver.user.token	The token of workspace where the pipeline will get imported.
sax.testserver.url	StreamAnalytix URl for test environment
sax.testserver.testsuite.execute	Possible values are 1 or 0. If test suite execution is required or not.
sax.testserver.inspect.type	Inspect session to be used for test suite execution. Possible values- livy or local.
sax.destination.user.token	StreamAnalytix user token for destination. It can be obtained by log in as workspace user> Manage user> Edit User.
sax.destination.url	StreamAnalytix URL for destination.
sax.source.pipeline.name	Pipeline name to promote.
sax.source.pipelline.version	Pipeline version to promote.
sax.promotion.skiponfailure	Specify if pipeline promotion should be stopped on test suite failure.
sax.promotion.emailadress.list	Space separated list of email ids where report will be sent.

Configuring Cloudera Navigator Support in Data Lineage

Publishing Lineage to Cloudera Navigator

For publishing lineage to Cloudera Navigator, configure StreamAnalytix with the following properties.

Configuration Properties

To go to the configuration properties, go To Superuser < Configuration < Others< Cloudera.

These properties are required to enable publishing of lineage to Cloudera Navigator:

Property	Description
Navigator URL	The http URL to the Cloudera Navigator UI
Navigator API version	The Navigator SDK API version (read below)
Navigator Admin User-name	The Cloudera Navigator Admin user
Navigator Admin user Password	The Cloudera Navigator Admin password
Autocommit enabled	Specifies if auto commit of entities is required.

The Navigator API Version can be extracted from the Cloudera Navigator UI by clicking on the question mark, located on the top right corner of the web-page, next to user name.

Select ‘About’.

Mention the Cloudera Navigator version as the Navigator API version, as show in the image below:

NOTE: The configuration ‘security.profile’ should be CDH for publishing lineage to Cloudera Navigator. This is a deployment time configuration and can be found in <StreamAnalytix_Installation_Dir>/conf/yaml/common.yaml.

Configuring Lineage in Data Pipeline

The next step to enable publishing Lineage to Cloudera Navigator is while saving the Pipeline.

Click on the checkbox shown below:

You can publish Lineage for both – Batch and Streaming type of pipelines in StreamAnalytix.

The ‘Publish lineage to Cloudera Navigator’ checkbox is enabled by default for every pipeline.

You need to update the pipeline and start.

Viewing a pipeline’s lineage on Navigator

The lineage of a pipeline is published to Cloudera Navigator once the pipeline is started.

Shown below is an example to view the lineage of the pipeline ‘sales_aggregation’.

Where the RabbitMQ Data Source configuration is as follows:

The Cloudera Navigator URL is as follows: with credentials (usually admin/admin).

Once the pipeline is active, go to the Cloudera Navigator UI and search for the pipeline name ‘sales_aggregation’.

The components of the pipeline are listed as shown above. Please note the source and target RabbitMQ components are shown as two entities in Navigator – Rabbitmq and Rabbitmq_dataset. The Rabbitmq_dataset shows the schema of the data flowing in StreamAnalytix and the Rabbitmq shows the metadata of the actual RabbitMQ component.

Click on ‘Aggregation’ entity listed above and once the 'Aggregation' entity page opens, click on ‘Lineage’ tab, as shown below:

Similarly, you can go to Cloudera Navigator search page and search by RabbitMQ Data Source's queue name or exchange name and view the lineage by clicking on the RabbitMQ entity.

You can also view the schema of the message configured on the Data Source using the ‘Details’ tab of the RabbitMQ entity.

These are ‘Columns’ of the entity.

Publishing lineage to Cloudera Navigator helps you easily integrate with the existing entities.

Consider a pipeline that processes data from RabbitMQ or Kafka, enriches the data and inserts the resultant data in a hive table.

Now, an external job (not a StreamAnalytix pipeline) reads from this Hive table and does further processing of data. If lineage is published by this external job as well, you will be able to see a combined enterprise lineage on Cloudera Navigator.

Viewing lineage of HDFS and Hive

The lineage of HDFS and Hive Data Sources and emitters are native entities in Cloudera Navigator if the configured HDFS path or Hive table already exists.

The lineage of native HDFS entity is shown in green and the lineage of native Hive entity is shown in yellow.

If the HDFS path or Hive table do not exist, the lineage is represented by creating custom dataset which is a Grey dataset.