Data Pipelines

Pipeline Configuration

The home page of data pipeline is explained below:

DPHomepage1.PNG

Figure

No.

Name

Description

1

Pipelines

All data pipelines fall under this category.

2

Workflow

Workflows are defined under this tab.

3

Filter

You can filter applications based on their status (Starting, Active or Stopped). There is a reset button to bring the settings to default.

4

Actions for Pipelines

Create new pipeline, integrate pipelines, audit pipelines, download sample project, and upload a pipeline from same or different workspace.

With a StreamAnalytix Data Pipeline, you can perform the following actions:

• Design, import and export real-time data pipeline.

• Drag, drop and connect operators to create applications.

• Create Datasets and reuse them in different pipelines.

• Create Workflows using pipelines and apply control nodes and actions on them.

• Create models from the analytics operators.

• Monitor detailed metrics of each task and instance.

• Run PMML-based scripts in real-time on every incoming message.

• Explore real-time data using interactive charts and dashboards.

Creating a Spark Pipeline

To create a pipeline, go to Data Pipeline page.

Make sure that the Local/Livy connection is established. The green arrow above the pipeline canvas denotes that the connection is established.

Drag components on the pipeline builder (pipeline grid-canvas) from the right panel and connect the components.

inspectconnectGreenarrow.PNG

Select and right click on a component to configure its properties (view Data Sources, Processors, Data Science and Emitters sections, to understand the components).

Once the configuration is done, click on the floppy icon to save the pipeline. (shown above)

Enter the details of the pipeline in Pipeline Definition window. (shown below).

The pipeline definition options are explained below:

pipelinedefinition.PNG

 

pipelinedefinition2.PNG

 

When you click on More, the below mentioned configuration appears (they are same for Local or Livy).

Field

Description

Pipeline Name

Name of the pipeline that is created.

HDFS User

Yarn user by which you want to submit pipeline.

Key Tab Option

You can Specify a Key Tab Path or Upload the Key Tab file.

Key Tab File Path

If you have saved the Key tab file to your local machine, specify the path.

Error Handler

Enable error handler if you are using Replay data source in the pipeline. Selecting error handler check-box displays configure link. Clicking on the configure link displays error handler configuration screen.

Publish Lineage to Cloudera Navigator

Publish the pipeline to Cloudera environment. (Only in case of CDH enabled environment).

Log Level

Control the logs generated by the pipeline based on the selected log level.

Trace: View information of trace log level.

Debug: View information of debug and trace log levels.

Info: View information of trace, debug and info log lev­els.

Warn: View information of trace, debug, warn and info log levels.

Error: View information of trace, debug, warn, info and error log levels.

Yarn Queue

Then name of the YARN queue to which the application is submitted.

Version

Creates new version for the pipeline. The current ver­sion is called the Working Copy and rest of the versions are numbers with n+1.

Comment

Write notes specific to the pipeline.

Deployment Mode

Specifies the deployment mode of the pipeline.

Cluster: In cluster mode, driver runs on any of the worker nodes.

Client: In client mode, driver runs on the StreamAnalytix admin node.

Local: In local mode, driver and the executors run on the StreamAnalytix admin container.

Driver Cores

Number of cores to use for the driver process.

Driver Memory

Amount of memory to use for the driver process.

Application Cores

Number of cores allocated to spark application. It must be more than the number of receivers in the pipeline. It also derives the number of executors for your pipeline. No. of executors = Application Cores/ Executor cores.

Executor Cores

Number of cores to use on each executor.

Executor Memory

Amount of memory to use per executor process.

Dynamic Allocation Enabled

When a pipeline is in running mode, the spark pipeline scale and scale down the number of executors at the runtime. (Only in case of CDH enabled environment).

Extra Driver Java Options

A string of extra JVM options to pass to the driver. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Extra Executor Java Options

A string of extra JVM options to pass to executors. For instance, GC settings or other logging. For example: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Extra Spark Submit Options

A string with --conf option for passing all the above configuration to a spark application. For example: --conf 'spark.executor.extraJavaOptions=-Dcon­fig.resource=app' --conf 'spark.driver.extraJavaOp­tions=-Dconfig.resource=app'

Populate from Local Session

It is application for YARN deployment.

morePropertiesCDH.PNG

Note:

• When you create a pipeline, the associated or used schema is saved as metadata under message configuration.

• Components are batch and streaming based.

Error Handler Configuration

Error handler configuration feature enables you to handle the pipeline errors.

Error log target are of three types: RabbitMQ, Kafka and Logger.

To log pipeline errors to logger file or on Kafka or RMQ, so that you can refer the log file or message queues to view those errors whenever required. All these errors will be visible on graphs under Summary view of pipelines under Application Errors Tab.

When you save a pipeline, specify whether you want to enable error handler or not. Either Kafka or RMQ is selected as error log target. Whenever an error happens, the pipeline will log the errors to the configured error handler. By default, errors will always be published to log files. If you want to disable logging errors to these queues, disable error handler while saving or updating actions on the pipeline. If error handler is disabled, error graphs will not be visible and errors will not be avail­able on the errors listing page.

Note: If RMQ or Kafka is down, the data pipeline will work. For the visibility of error graph default Elasticsearch connection should be in working state.

errorhandler.PNG

Field

Description

Error Log Target

Select the target where you want to move the data that failed to process in the pipeline. If you select RabbitMQ, following tabs appear.

Queue Name

Name of the RabbitMQ Queue, where error to be pub­lished in case of exception.

Select Data Sources

Error handler will be configured against the selected Data Source.

Select Processors/Emitters

Error handler will be configured against the selected processor/emitters.

Connection

Select the connection where failed data would be pushed.

Enable TTL

TTL Value discards messages from RabbitMQ Error Queue, expected value is in minutes.

Back to Definition

Toggle back to the pipeline definition screen.

Apply Configuration

Toggle back to the Pipeline definition screen after sav­ing the Error handler configurations.

Based on error handler target, you will be asked for target specific configuration like connection name and partitions and topic administration in case of Kafka and ttl in case of RabbitMQ.

Field

Description

Error Log Target

If you select Kafka, following tabs appear.

Topic Name

Name of the Kafka Topic where error to be published in case of exception.

Select Data Sources

Error handler will be configured against the selected Data Source.

Select Processors/Emitters

Error handler will be configured against the selected processor/emitters.

Connection

Select the connection where failed data would be pushed.

Partition

Number of partitions on the topic.

Replication Factor

The replication count of the topic.

Field

Description

Error Log Target

If you select Logger, following tabs appear.

Select Data Sources

Logger is configured against the selected Data Source.

Select Processors/Emitters

Logger is configured against the selected processor/emitters.

Once you save or update the pipeline, and start the pipeline, you can view errors in the configured queue or topic as well as in pipeline logs.

Note: By default pipeline errors will be logged in Logger.

For viewing error graphs, go to Data pipeline tab on Data Pipeline Home page, Click on the three dots of the pipeline tile and click on View Summary.

Actions on Pipeline

To view the screen below, go to the home page of Data pipeline and click on the three dots of a pipeline’s widget.

As shown in the snippet below, the following actions can be performed on a pipeline

ActionsOnPipeline.PNG

.

Action

Description

Monitor

This feature enables you to Monitor error metrics of a pipeline and error search tab allows to search errors using keywords and filters.

View Audit Activity

Open a sliding winding window to view the audit activi­ties performed on the Pipeline.

History

This feature enables you to view the pipeline details i.e. start time, end time, status and the application id.

Scheduling

It is a feature of Batch Data Pipeline. You can recognize if the pipeline is a batch or streaming by this icon.

Cron Expression: The details of the schedule with sub expressions. (Shown below)

Stats Notification

Statistics of the pipeline can be emailed with details. Explained below in detail.

Test Suite

Test Suite is a collection of Test Cases associated with pipeline.

Create Version

Create a version of the pipeline while updating or from the pipeline tile.

Download Version

Download a version of the pipeline.

Pipeline Configuration

Update the pipeline configuration

Pipeline Submission Logs

Logs of Pipeline can be viewed by either clicking on Application ID or Pipeline Submission Logs.

Edit

Edit the pipeline.

Delete

Delete the pipeline.

Start/Stop Pipelines

To start and stop the pipeline.

Pipeline Management

StreamAnalytix provides tools, such as Auto Inspect/Connect, Import/Export, and Monitor, to manage pipelines at all the stages of its lifecycle.

Note: For Livy and Local, If you do not create a connection explicitly to either, the connection for Livy, would be initialized automatically when you start Auto-Inspec­tion.

Livy

Livy connection can be created at any time and from anywhere in the workspace.

However, you can disconnect from it after the pipeline has been created. It is not mandatory for Livy to be in initialized state for a pipeline to be edited and saved.

livymoresave.PNG

While connecting the Livy session, you will be prompted for Extra Java Driver Options, Extra Java Executor Options and Extra Spark Submit Options. All are non-mandatory fields.

There is a place holder property displayed in the text areas of Extra Java Driver Options and Extra Java Executor Options i.e. -Djava.io.tmpdir:/tmp. You can configure this property to change the target folder for Spark Client temporary files.

While saving the pipeline, these parameters can be populated by clicking on “Popu­late from Livy session” link given on Pipeline Definition window.

Logs will take you to the Livy Inspect Session logs. These logs are not live logs.

These parameters are disabled for a superuser.

Local Inspect

Local Inspect connection can be created at any time and from anywhere in the work­space. However, If you do not want to create a connection explicitly, the connection would be initialized automatically when you start Auto-Inspection. However, you can disconnect from it after the pipeline has been created. It is not mandatory for Local Inspect to be in initialized state for a pipeline to be edited and saved.

localExtraJava.PNG

While connecting the Local Inspect session, you can specify additional extra java options by clicking the Spark Parameters option. This parameter is optional.

In Spark Parameter, you can supply Extra Java Options for the inspect JVM process. These parameters are disabled for a superuser.

Auto Inspection

When your component is inspect ready, slide the Auto Inspect button.

During Auto inspection, the component’s data is verified with the schema as per the configuration of the component. This happens as soon as the component is saved.

Auto inspect is manual therefore while editing the pipeline, you can choose to not inspect the pipeline. When you edit a pipeline, auto inspect will be off by default.

While editing, you can add/remove new components in the pipeline without auto inspect.

You can auto inspect the properly configured components of the pipeline if the pipe­line is not entirely configured, as shown below:

AutoInsoectionInProgress.PNG

Also, auto inspection can be stopped at any point, even while schema is being detected, and pipeline can be saved even if the inspection hangs or inspect fails.

When you stop the Auto Inspection, while it is inspecting any component, two but­tons, Resume and Reload Inspection are populated.

stopInspection.PNG

Note: Once the Auto Inspect is enabled, a Livy connection is initialized. (If it is not con­nected).

If one of the component is not configured, the auto inspect will run for the configured components only. The auto inspect window will show a notification in Red for the uncon­figured component.

auroinspect1.PNG

A single component can be inspected by clicking on the eye icon on top if the com­ponents as shown below:

autoinspect1.PNG

 

If a pipeline is saved without inspection, the pipeline tile on the data pipeline homep­age will show the warning-icon of not inspected, as shown in figure below:

inspect.PNG 

Once the data is loaded from the source it is automatically played through the pipe­line.

During auto inspect, if the schema gets changed due to any configuration changes, it will not reset the next components however will show exclamation mark.

This is how corresponding configuration, logic and results from each operator is vali­dated while the pipeline is being designed.

Data Preparation in StreamAnalytix

Data preparation is the process of collecting, cleaning and organizing data for analy­sis. Prepare the data with operations and expressions to generate customized data. Graphs and statistics are also used to present the prepared data.

To perform data preparation, configure a Data Source (For example, Kafka). Once the Data Source is configured and saved, you can run inspection on the columns of the schema that are reflected in the Data Preparation window. This allows you to build the pipeline while you are interacting with the data.

In the example below, an Agriculture data file is uploaded. It holds information about crop, soil, climate, pests, season, average_yield, cost, high_production_state, crop_wastage and fertilizer.

By default, the data is displayed in Summary view with Profile Pane on the screen. (as shown below).

Data Pane: The schema takes the form of columns and is divided in records (as shown below). You can apply operations on an entry of a column or any possible combination, corresponding to which a new expression is added, which keeps creating your pipeline as in when you apply more actions.

dataprep1.PNG

The data preparation window has the following options:

Property

Description

Create Column

Create a new column by clicking on Create Column button

createcolumn.PNG

.

You can specify the column name, expression, and a new column will be created with the given values.

Remove Column

Remove the columns, which are no longer required. Click on the select/remove button and selected col­umns will be removed.

KeepRemovecolumn.PNG

Profile Pane

Click on the Profile Pane button prfoile_pane_btton.JPG. The profile pane shows distribution of data in each column. It allows to create an interactive data pipeline. The bar graphs cor­responding to the values in each column will be shown in tabular format.

Data Pane

Click on the Data Pane button profilepane.PNG. This pane shows the data in its original form. You can arrange the columns in ascending and descending order by clicking on the column headers.

Query Execution Plan

The Query Execution plan contains parsed logical plan, analyzed logical plan, optimized logical plan and physical plan.

Display Columns

Select the columns, which are to be displayed at a spe­cific time.

displaycolumn.PNG

Reload Inspect

Clicking on this button reloads the data of different col­umns.

Maximize

This button is used to maximize the auto inspect win­dow.

Close

This button is used to close the auto inspect window.

Search value

Search for a value.

Data preparation window has two more views: Detailed View and Summary View

The summary view is the landing view of data preparation. It shows the summary of all the grouped column-records present in the data set.

The detailed view shows details of a selected column value and its corresponding rows.

In the below figure (Summary view) you can view a unique value being shown at the top of every column (for example, Order Number). This is the number of unique records, and corresponding to the same is a scrubber, which has the following condi­tions:

dataprepDetailedView.PNG

The scrubber is only available in case there are more than 10 unique values.

The scrubber’s height can only accommodate 50 records therefore, if the records are above 50 then, it shows the highest values in a batch.

Data Preparation steps in a pipeline include following processors:

• Expression Filter

• Expression Evaluator

• Rename

• Select

• Drop

dataprepnewexample.PNG

Data Preparation steps appears on left side of inspect window, as a slide panel.

While the operations are being performed, they keep getting added to the pipeline and are reflected in data preparation steps.

Even when data preparation steps are added manually in a processor they are reflected in data preparation steps.

Expression Filter:

Expression Filter enables filter operations on the data column.

This processor gets automatically added on the canvas once you apply the filter operation.

For example, let us add the filter criteria as “crop” equals “cereals”.

filter.PNG 

Field

Description

VALIDATE

Validate this expression by clicking on validate button. If the expression applied is valid, you will get the message “Valid Expression”, else “Failed to execute given expres­sion”.

ADD EXPRESSION

Additional expressions can be added using ADD EXPRES­SION   button.

For example, if you want to filter out “wheat” from the given crops, you can write the following expression:

crop <=> 'wheat'

 

expression_filter.PNG 

Expression Evaluator

Expression Evaluator is used when you wish to perform transform operations on the data columns. This gets automatically added to the canvas when you perform trans­form operations on the data columns.

exprssion_evaluator.PNG 

Field

Description

VALIDATE

Click on this button to validate the Transform operation applied on the column.

If the expression applied is valid, you will get the message “Valid Expression” otherwise you will get the message “Failed to execute given expression”.

ADD EXPRESSION

Additional expressions can be added using ADD EXPRES­SION button.

For example, write below expression to change the col­umn name Soil to uppercase.

Expression: Upper(soil)

 expression_Evalutor_2.PNG

Query Execution Plan

To understand Query Execution Plan, we can use an example of a pipeline. It is read­ing data from RabbitMQ, then performing some data quality checks using Data Qual­ity Processor and storing data to RabbitMQ.

Once Auto Inspection runs on the RabbitMQ emitter the Query execution plan icon is visible on inspect panel.

queryexecution.PNG

The Query Execution plan contains parsed logical plan, analyzed logical plan, opti­mized logical plan and physical plan.

1.Parsed logical plan

a. Raw interpretation of a query by spark is termed as parsed logical plan.

2.Analyzed logical plan

a. The parsed logical plan is converted to Analyzed logical plan by applying set of rules.

b. Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a Data Frame object constructed using the API. In both cases, the relation may contain unresolved attribute refer­ences or relations. For example, in the SQL query SELECT col FROM sales, the type of column or a valid column name, is not known until we look up the table sales. An attribute is called unresolved if its type is unknown or it does not match to an input table (or an alias). Spark SQL uses Catalyst rules and a catalog object that tracks the tables in all data sources to resolve these attributes. Next the Analyzed logical plan is produced

3.Optimized logical plan

a. Analyzed logical plans go through a series of rules to resolve. Then, the optimized logical plan is produced. The optimized logical plan allows Spark to plug in a set of optimization rules.

4.Physical plan

a. This optimized logical plan is converted to a physical plan for further execution. In the physical planning phase, Spark SQL takes a logical plan and generates one or more physical plans, using physical operators that match the Spark execution engine. Then it selects a plan using a cost model. Now, cost-based optimization is only used to select join algorithms: for relations that are known to be small, Spark SQL uses a broadcast join, using a peer-to-peer broadcast facility available in Spark. The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. We thus intend to implement richer cost-based optimization in the future.

Below is the image, which shows how a Query Execution plan for a query over Strea­mAnalytix will be visualize    .

query_execution_image.PNG 

Auto Connect Component

Drag, drop and connect a component with auto-connect feature.

Drag a component near an existing valid component, and as the dragged compo­nent gets closer to the available component on canvas, they are highlighted. Drop the component and it will be connected to the data pipeline.

Scheduling

To execute a workflow periodically, click on the Scheduling tab and select any of the following option to schedule.

A pipeline can be scheduled:

l Minutes

l Hourly

l Daily

l Monthly

l Yearly

Scheduling_Options.PNG

If Monthly is selected, then you can choose from within the options shown below:

Scheduling_Options_Monthly.PNG

 

Note: A pipeline can only be scheduled/unscheduled in the stopped stage.

SCHEDULE/UN-SCHEDULE: This action will un-schedule/re-schedule any sched­uled workflow and mark all the coordinated actions as killed which are in running or waiting state.

The History of the pipeline run can be viewed under the History tab:

HistoryOfScheduledRun.PNG

Download/Upload Pipeline

Download Pipeline

A pipeline can be downloaded along with its component’s configurations such as Schema, Alert configuration, and Registered entities.

Upload Pipeline

A pipeline can be uploaded with its component configurations such as schema, alert definitions and registered entities.​

To import a pipeline, click on the Upload Pipeline button on the pipeline page.

upload_pipe.PNG 

If any duplicity exists in the pipeline components, a confirmation page will appear as shown below.

overwriter_button.PNG 

Click on Overwrite or New Pipeline buttons to see the Exported Pipeline Details page that provides the details of the pipeline and the environment from where it was downloaded.

Click on Overwrite button and all the existing components will be overwritten with the current configuration of the pipeline, which were import.

exported.PNG 

Click on Proceed, and update the pipeline configuration details and expand the pipe­line configurations by clicking on More button (which appears on the next screen).

As shown in the diagram below, click on the Component tab to update pipeline components configuration and connection details if required. Once done, click on the upload button to upload the pipeline with the changes configured.

update_conf.PNG 

While uploading the pipeline, if the components used in the pipeline are not config­ured in the environment in which it is uploaded, no connection will be available for that particular component. This will not abort the upload process but the start button of the pipeline will remain disabled until a connection is configured for the respective component.

Once connection is created, update the pipeline and select a corresponding connec­tion to play the pipeline.

New Pipeline button will allow to save the uploaded pipeline as a new pipeline.

First page will be the exported pipeline’s details page and after clicking on Proceed button, Pipeline configurations are displayed where the pipeline’s name and other configurations can be changed.

export.PNG 

update_1.PNG 

Stats Notification

This tab enables sending emails with the statistics of the data pipeline, to a user with customized details.

Click on Stats Notification from the pipeline tile of the Data Pipeline homepage.

This page allows you to enter the email address, on which you will receive the emails. Choose the duration of the emails, as show below:

statsnotification.PNG

The other tab Metrics shows all the metrics of the pipeline that will be sent in the email.

stats_3.PNG 

View Summary

Once a pipeline has been started and is in Active Mode, you can view its Summary.

To view the real-time monitoring graphs, click on the pipeline summary (go to the Data Pipeline page< Pipeline tile< click on the View Summary icon) as shown below:

To enable real time monitoring graphs, go to

Configuration >defaults >spark > EnableMonitoringGraphs and set its value to true

By default, this value is set to False.

In addition, you can view the Spark Job page from the application_1525…. link as shown below:

The View Summary tab provides the following Tabs:

Monitoring

Under the Monitoring Tab, you will find the following graphs, as shown below:

• Query Monitor

• Error Monitor

• Application Error

Query Monitor

The graph will be plotted in panels, these panels will be equal to the number of emit­ters used in structure streaming pipeline. Each panel will have three graphs as per below discretion. These graphs can be seen through view summary option.

Input vs. Processing Rate
The input rate specifies rate at which data is arriving from a source (Data Source) e.g. Kafka.

Processing rate specify the rate at which data from this source is being processed by Spark

Here number of input rows defines number of records processed in a trigger execu­tion.
Batch Duration
Approximate time in to process this over all micro-batch(trigger) in seconds.

Aggregation State
Shows aggregation state of current streaming query. It shows information about operator query that store the aggregation state. The graph will be seen in case if pipeline has done some aggregation.

Continuous Integration Continuous Delivery

StreamAnalytix supports continuous integration and continuous delivery of pipeline, in form of test suites and by deploying pipelines on destination.

Continuous Integration

Continuous integration is done by executing test cases (as part of a test-suite) on the incremental state of a pipeline that ensures the pipeline is not impacted by new changes

Test Suite is a collection of Test Cases associated with a pipeline.

You may want to create a test case to verify multiple permutations on continuous data which will then be deployed as soon as they pass the tests.

How to create a test case?

A test case comprises of a set of source data files (one data file for each Data Source) and inspection outputs of each component in the pipeline. These inspection outputs of each component can be thought as the expected data for the component

Every test case will have its source data associated with each Data Source. When a test suite is executed, all its associated test cases are executed. A test suite run is uniquely identified by a run id. When a test case is executed as part of a run id, each component is inspected again and the output data can be referred as actual data.

How a test case is evaluated to pass or fail?

A test case is passed if both the conditions mentioned below are fulfilled:

1. Number of records are same for actual and expected data.

2. Every field value of every record in actual data matches with the expected data.

Let’s take an example to understand this.

Your source data may have integers or some blank values and you may want to perform certain tests to make sure that the invalid or null values are omitted. These test cases can be created with different source data files to test.

Note: When you upload a source file via Upload Data, that file is inspected and the output is saved. However when you use, Fetch From Source on a Data Source, StreamAnalytix internally generates a source data file using the data from the source.

To create a test case, you need to configure, save and inspect the pipeline.

Create Test Case

Once the pipeline is saved and inspected, a test case can be created. Go to the menu on the data pipeline canvas, as shown below:

testsuite.PNG

 When you click on Test Suite, it opens a New Test Case window

createtestcase.PNG

To create a new test case, enter details in the New Test Case window fields, as shown below:

Property

Description

Details

Details about the Test case, which is:

Name: Name of the test case.

Two test cases cannot have the same name, within a pipeline that names have to be unique.

Description: Enter a description for the test case (optional)

Do Not Com­pare

Select the columns of either component or the component itself, which will not be compared and tested during inspec­tion. The test case execution will skip data comparison of these component field(s) and component(s). Components, the test case will run on, will save the output file (schema) of the component and compare or not compare a column for the script that runs on the component.

For example: The test case is to check the entries of employ­ees, where a field of Random User ID is generated every time an action happens. This random UID will be different in every field. Therefore, it will fail the test.

Here you can use to not compare either the output field or the component itself.

NOTE: Only the output fields configured in the Emitter will be populated under fields of a component.

deatilstest.PNG

 

donotcompare.PNG

 

Click on Create, and the success notification message will be shown, with a new entry in the table, as shown below:

savedtestcase.PNG

Test Suite properties

Every test case is listed in a tabular form under the components of Data pipeline can­vas.

This table has two windows (shown above), Definition and Run History with proper­ties defined below

Definition

The definition tab will show all the details and enables you to create a test case, exe­cute it, run multiple at the same time, and other actions as defined below:

Property

Description

Status

Status signifies the last execution status of a test case. To view the Status of the of test case,i.e., last run status, hover the mouse over the status icon.

This shows the status of the test case with a RUN ID and the duration of execution.

 

Possible values are ‘never executed’, ‘In progress’, ‘Success’, ‘Failure’, and ‘Error.

When the Test case is running, In Progress is denoted by yel­low status icon. In case of completion, Success is notified by green status icon and Failure/Error is notified via a Red Excla­mation.

Test Case Name

Name of the test case.

Description

A description of the test case. This is an optional property.

Source File(s)

The source file used for test creation.

Coverage

The percentage of the covered components under the test case configuration. (Shown below the table with image and more detail)

Actions

The actions that can be defined on the test case. (Click for more details)

Refresh Table

Refresh Test case list allows you to refresh the test case list.

Run selected test cases

Once all the test case are created you can run the selected ones or all with just one click.

Create new

To create new test cases.

Update exist­ing

Edit the existing selected test case and re configure the fields. The same can be done by using Configure under Actions.

Search

Search for any test case suing the search column.

Under the Coverage tab, ‘Click for more details’, will show complete configuration of the test case.

Coverage.PNG

Actions on use case

actions.PNG

Property

Description

Configure

Edit the configuration by clicking on Configure button of test case.

Load Data

It allows you to load pipeline’s Data Source(s) data and run inspection. Any consistency caused due to changes in existing pipeline design will be shown by alert icon in Status column. You need to Load Data again and update existing test-case to sync that test case with latest configuration of pipeline.

Download

Download the uploaded sample data file(s) via clicking on Download icon.

Click on test case Download button and a new window opens with every component’s output data, which can be down­loaded.

Delete

Delete any test case via Delete button

Run History

You can view the test case Run History. The table will show the run statistics along with test case outputs. View the reason of failed test cases.

runhistory.PNG

Stale Test Case

Multiple test cases can be configured and re-configured from the same window. Now, there may be cases when you have re-configured the pipeline and a few com­ponents have been removed.

On removing any component from existing pipeline and then creating new test case may lead to some inconsistencies to the test cases created earlier.

stale.png

Incase of Stale Test Cases, such inconsistencies are shown as a warning or error under the Status of the test case.

Click on Load Data again and update existing test-case to sync that test case with latest configuration of pipeline.

Each entry corresponds to a single Test Suite run. Single test suite run may contain one or more test cases.

Download Test Results

Complete test suite results can be downloaded via Download Test Results, under Results< Run History. The Test Report is a HTML file with the run Id and test run details (shown below):

results.PNG
testsuiterunreport1.PNG

 

Individual test case result can be downloaded via Download Output Results, corre­sponding to Duplicate Ids. The output result is in the JSON file format.

Updated Test Cases

If you want to update or the source file of expected result, update test case enables the same. Updating existing test case data will overwrite previous captured inspec­tion data with new inspection data. You can also update the Do Not Compare sec­tion, while editing the existing pipeline

updateexisting.PNG

Version Control

Version control allows you to create a version of the Test Suite. StreamAnalytix sup­ports two types of version control system:

l GIT

l StreamAnalytix Metastore

Before understanding how version control works, we need to understand what a Working Copy is.

A Working Copy is the current version of the copy or the first(0) version of the copy.

The pipeline editing is performed on the working copy, always.

Note: Any pipeline once created becomes history and cannot be edited. If you try to edit a version which is a working copy and you switch it to a previous version, it over­writes the existing working copy from selected version.

There are three ways by which you can create a version:

First way is when you re-save an edited pipeline, you get an option to Create Version.

createdefinition.PNG

Second way to enable version on your pipelines is when you edit and save the pipe­line. Amongst the option of saving the pipeline is Create Version and Switch version button.

switchandcreteversion.PNG

Third option to create a version is by the pipeline tile.

actions00016.PNG

With both the last two options shown above, to create a version, same window opens.

createversion.PNG

Mention a description or a comment for the version and click on Create Version. Once you save the version, the last version on the pipeline changes from ‘Working Copy’ to Version 1.

workingcopy.PNG
version1.PNG

 

A pipeline remains Working copy as long as no other version is created. Once you save a version, next to the title ‘Create Subsystem version’, is the notification of the next version that will be created.

When a new superuser logs in, they will have to configure the Version Control prop­erties under Setup.

Switch Version

If you want to switch the version of a pipeline, click on Switch Version on the pipeline editor page and choose a version. It will change the pipeline as per the selected ver­sion.

It is the Working Copy that is loaded to a newer version. Editing is always performed on the Working Copy of the pipeline.

Note: If you want to switch to an older version, save the current state as a version.

Download Version

Download Version from pipeline tile allows you to download either version of the pipeline.

versionList.PNG

Click on the Download arrow to download the pipeline.

Property

Description

Version

Version number of the pipeline.

Commit Message

Commit message description.

Artifacts

The artifact uploaded.

Commit Time

The commit time that the pipeline was committed as a version.

Download

Download the pipeline.

Continuous Delivery

While you continuously integrate data in the pipelines, you can also deliver the same.

Continuous delivery requires configuration of three environments:

l Source Environment

l Test Environment

l Destination Environment

Source environment is where the pipeline is incrementally developed.

The CD scripts download the pipeline from source environment and upload it on the Test Environment.

Then on the Test environment, test suite for the pipeline is executed. Based on user selection, a inspect session is created an then on test environment the test suite of the pipeline is executed.

If the test suite passes the execution, the pipeline is then promoted to the Destination environment.

Prerequisites

Pre-requisites for CD script

1. Installation of jq

2. SMTP server configured

Open the folder <StreamAnalytix_home>/bin/CD_script and run StartCD.sh to run the pipeline on the destination environment. Integration is done by creating Test Suites and delivery is performed by running the StartCD.sh on StreamAnalytix metastore.

Configure the below mentioned properties to deliver the test cases on the environ­ment:

CD1.PNG

Property

Description

sax.test­server.user.token

The token of workspace where the pipeline will get imported.

sax.testserver.url

StreamAnalytix URl for test environment

sax.testserver.test­suite.execute

Possible values are 1 or 0. If test suite execution is required or not.

sax.test­server.inspect.type

Inspect session to be used for test suite execution. Pos­sible values- livy or local.

sax.destina­tion.user.token

StreamAnalytix user token for destination. It can be obtained by log in as workspace user> Manage user> Edit User.

sax.destination.url

StreamAnalytix URL for destination.

sax.source.pipe­line.name

Pipeline name to promote.

sax.source.pipel­line.version

Pipeline version to promote.

sax.promotion.skipon­failure

Specify if pipeline promotion should be stopped on test suite failure.

sax.promo­tion.emailadress.list

Space separated list of email ids where report will be sent.

 

Configuring Cloudera Navigator Sup­port in Data Lineage

Publishing Lineage to Cloudera Navigator

For publishing lineage to Cloudera Navigator, configure StreamAnalytix with the fol­lowing properties.

Configuration Properties

To go to the configuration properties, go To Superuser < Configuration < Others< Cloudera.

These properties are required to enable publishing of lineage to Cloudera Navigator:

Property

Description

Navigator URL

The http URL to the Cloudera Navigator UI

Navigator API version

The Navigator SDK API version (read below)

Navigator Admin User-name

The Cloudera Navigator Admin user

Navigator Admin user Password

The Cloudera Navigator Admin password

Autocommit enabled

Specifies if auto commit of entities is required.

cloudera.PNG 

The Navigator API Version can be extracted from the Cloudera Navigator UI by click­ing on the question mark, located on the top right corner of the web-page, next to user name.

Select ‘About’.

admin.PNG 

Mention the Cloudera Navigator version as the Navigator API version, as show in the image below:

cloud_version.PNG 

NOTE: The configuration ‘security.profile’ should be CDH for publishing lineage to Cloudera Navigator. This is a deployment time configuration and can be found in <StreamAnalytix_Installation_Dir>/conf/yaml/common.yaml.

Configuring Lineage in Data Pipeline

The next step to enable publishing Lineage to Cloudera Navigator is while saving the Pipeline.

Click on the checkbox shown below:

pipe_def.PNG 

You can publish Lineage for both – Batch and Streaming type of pipelines in Strea­mAnalytix.

The ‘Publish lineage to Cloudera Navigator’ checkbox is enabled by default for every pipeline.

You need to update the pipeline and start.

Viewing a pipeline’s lineage on Navigator

The lineage of a pipeline is published to Cloudera Navigator once the pipeline is started.

Shown below is an example to view the lineage of the pipeline ‘sales_aggregation’.

sale_aggregat90j.PNG 

Where the RabbitMQ Data Source configuration is as follows:

rabbit.PNG 

The Cloudera Navigator URL is as follows:   with credentials (usually admin/admin).

Once the pipeline is active, go to the Cloudera Navigator UI and search for the pipe­line name ‘sales_aggregation’.

cloudera00017.PNG 

The components of the pipeline are listed as shown above. Please note the source and target RabbitMQ components are shown as two entities in Navigator – Rabbitmq and Rabbitmq_dataset. The Rabbitmq_dataset shows the schema of the data flowing in StreamAnalytix and the Rabbitmq shows the metadata of the actual RabbitMQ component.

Click on ‘Aggregation’ entity listed above and once the 'Aggregation' entity page opens, click on ‘Lineage’ tab, as shown below:

clouderat.PNG 

Similarly, you can go to Cloudera Navigator search page and search by RabbitMQ Data Source's queue name or exchange name and view the lineage by clicking on the RabbitMQ entity.

sale_data.PNG 

You can also view the schema of the message configured on the Data Source using the ‘Details’ tab of the RabbitMQ entity.

These are ‘Columns’ of the entity.

actios.PNG 

Publishing lineage to Cloudera Navigator helps you easily integrate with the existing entities.

Consider a pipeline that processes data from RabbitMQ or Kafka, enriches the data and inserts the resultant data in a hive table.

Now, an external job (not a StreamAnalytix pipeline) reads from this Hive table and does further processing of data. If lineage is published by this external job as well, you will be able to see a combined enterprise lineage on Cloudera Navigator.   

Viewing lineage of HDFS and Hive

The lineage of HDFS and Hive Data Sources and emitters are native entities in Cloud­era Navigator if the configured HDFS path or Hive table already exists.

an_person.PNG 

The lineage of native HDFS entity is shown in green and the lineage of native Hive entity is shown in yellow.

If the HDFS path or Hive table do not exist, the lineage is represented by creating custom dataset which is a Grey dataset.