Register Entities

Register Entities allows you to register custom components i.e. custom parsers, data sources and processors to be used in the pipelines.

There are three types of entities:

Entity

Description

Register Component

Upload a customized jar to create a customized component that can be used in data pipelines.

Functions

A rich library of pre-defined functions and user defined functions.

Variables

Use variables in your pipelines at runtime as per the scope.

Models

Register Spark ML/H2O/Scikit models of supported model types that can be used in data pipeline.

Calendar

Create multiple holiday calendar which can be then used in Workflow.

Each entity is explained below.

Register Components

Use Register Component to register a custom component (Channel and Processor) by uploading a customized jar. Those custom components can be used in data pipe­lines.

Register Components tab comes under Register Entities side bar option.

RegisterEntityhomepage_3.PNG

Download a sample jar from Data Pipelines page, customize it as per your require­ment, and upload the same on Register Components page.

Custom Code Implementation

StreamAnalytix allows you implement your custom code in the platform to extend functionalities for:

Channel: To read from any source.

Processor: To perform any operation on data-in-motion.

Custom code implementation allows importing custom components and versioning.

You can download a Maven based project that contains all the necessary StreamAnalytix dependencies for writing custom code and sample code for reference.

Pre-requisites for custom code development

1. JDK 1.7 or higher

2. Apache Maven 3.x

3. Eclipse or any other IDE

Steps for Custom Code Implementation

stepsforCustomCode_3.png

Build Custom Code

Provide all the dependencies required for the custom components in pom.xml avail­able in the project.

• Build project using mvn clean install.

• Use jar-with-dependencies.jar for component registration.

Register custom code

registercustomcomponent_3.jpg
regiistercustomcomp_3.png

 

The list of custom components is displayed on the page shown below and the prop­erties are described below:

Field

Description

Components

The icon of the component is displayed in this column, which symbolizes a Data Source or a Processor.

Name

Name of Custom Component.

Config

Config link of the component. You can add configuration to a custom component or upload a jar.

Engine

The supported engine which is Spark.

Scope

The component can be used for a Local environment or a Global.

Actions

Add Config (+)

 

Upload Jar

 

Delete

 

Owner

If the custom component was created by a Superuser or work­space user.

Version

The version number of the custom component

Perform following operation on uploaded custom components.

l Change scope of custom components (i.e. Global/Local)

l Change icon of custom components.

l Add extra configuration properties.

l Update or delete registered custom components.

Version Support (Versioning) in component registration

Register multiple versions of a registered component and use any version in your pipeline.

versionlistRegisterComp_3.png

 

NOTE: If you have used any registered component in the pipeline, make sure that all the registered components (ones registered with single jar) should be of the same version. If you have registered a component with a fully qualified name, then that component cannot be registered with another jar in the same workspace.

Same FQN and same jar will create new version of that component.

Functions

Functions enables you to enrich an incoming message with additional data that is not provided by the source.

functions1_3.PNG

System Defined Functions

StreamAnalytix provides a rich library of system-defined functions as explained in the Functions Guide.

Variables

Allows you to use variables in your pipelines at runtime as per the scope.

To add a variable, click on Add New Variable and provide details as explained below.

Variables.PNG

Field

Description

Name

Provide a name to the variable

Value

Value of assigned to the variable (it can be an expression)

Data Type

Select the Data Type of the variable. The options are:

• Number

• Decimal

• String

Scope

Select the Scope of the variable. Following are the types of scope:

Global: The scope of this variable is global throughout the application.

Workspace: The scope of the variable will be within all the topologies of the workspace.

Pipeline: The scope of the variable is within the pipeline. If you select the option as pipeline, a drop-down is listed with all the pipelines of the same workspace.

For example, if you create the following variables: Name, Salary and Average.

Then by calling the following code, you will get all the variables in the varMap in its implementation class.

Map<String, ScopeVariable> varMap = (Map<String, ScopeVariable>) config­Map.get(svMap);

If you want to use the Name variable that you have created by calling the following code you will get all the details of the scope variables.

The variable object has all the details of the variable Name, Value, Datatype and Scope.

ScopeVariable variable = varMap.get(Name);

String value = variable.getValue();

Scope Variable

You can now add Scope Variable so that you can use these variables to reuse and update them as and when needed on pipeline and pipeline components.

Scope Variable Support is added for below components with their respective loca­tion where the scope variable will be populated with the help of @.

Cobol (Data Source) --> copybookPath --> dataPath

Native/Advance HDFS (Data Source) --> file path

HIVE (Data Source) --> Query

JDBC (Data Source) -- > Query

File Reader/Native (Data Source)--> File Path

File Writer (Emitter)--> File Path

Formats supported are:

@{Pipeline.filepath} = /user/hdfs

@{Workspace.filepath} = /user/hdfs

@{Global.filepath}/JSON/demo.json = /user/hdfs/JSON/demo.json

@{Pipeline.filepath + '/JSON/demo.json'} = /user/hdfs/JSON/demo.json

@{Workspace.filepath + “/JSON/demo.json”} = /user/hdfs/JSON/demo.json

@{Global.lastdecimal + 4} // will add number = 14.0

Models

The home page of a Register Models is shown below:

Models1_3.PNG

Field

Description

Model Name

Name of the registered model.

Model API

The chosen Machine Learning API i.e. Spark ML/H20/Scikit.

Model Type

Shows the type of the model.

Pipeline Model

Shows if you have trained the model using Spark ML Pipe­line API or not.

Model Location

Location of the registered model, which could be Strea­mAnalytix DB or HDFS.

Actions

Delete: To delete a Model.

Edit: It allows you to edit Model Details/Model Definition.

Download: You can download the ML Model. Only those models which are registered in SAX dB, will be reflected for download.

NOTE: If you use a registered model in your data pipeline, make sure that the Spark and H20 version must be same.

Register Spark ML Model

Steps to register a Decision Tree Model built on Spark Pipeline API for a classifica­tion use -case.

Step 1: Train the model on sample dataset and save it at any location on HDFS or local file system.

Step 2: Click on Register Model tab and provide a model name.

Step 3: Provide model API that is ML in our case.

Step 4: Provide the model type i.e. Decision Tree.

Step 5: Mention if the model to be registered should be a Spark Pipeline model or not.

Step 6: Mention model algorithm i.e. Classification

Step 7: Select model source. If model is stored on locally, then select Local and if model is stored on HDFS then select HDFS.

Below explained is a table to show you the process flow with HDFS and Local.

Local

HDFS

Step 8: When you select Local, browse to the zip file of the model. Once the model is suc­cessfully uploaded, you will see a tick mark in green.

Step 9: Click on Validate to val­idate the model.

Step 10: Once validation is suc­cessful, a new tab will be shown: Save Model On. Choose HDFS or StreamAna­lytix Database. If you choose HDFS, then provide the HDFS Connection Name and Model Path where the model will be saved.

Step 11: Register the saved model on the Model Path by clicking on Register.

 

Step 8: When you select HDFS, provide the HDFS connection name.

Step 9: Provide the Model Path where the model is located.

Step 10: Validate the model

Step 11: Register the model.

Field

Description

Name

Provide a valid name of the model.

Model API

Spark API on which the model is built.

Feature List

The Feature list field enlists the features that are used to train the model. You can specify the feature names either by entering them manually or by uploading a .csv file. Here, the header row of the file will be used for feature names.

Model Type

Types of models supported for the chosen Spark API.

Pipeline Model

If you have trained the model using Spark ML Pipeline API then select pipeline model otherwise un-check this option and register.

Model Algorithm

Available for ML Models only. Algorithm for the selected Model Type. (only available for Decision Tree, Random Forest, and Gradient Boosted Tree, where they will populate Classifica­tion and Regression algorithm type.)

Model Source

If model lies on local file system then select Local, other­wise if model lies on HDFS then select HDFS.

HDFS

Select this option if model is saved on HDFS.

Upload

Select this option if model is saved on a local file system.

Validate

Validates the model located at the given location. Once Val­idated, you will able to save the model.

Save Model On

If you want to save the model on local file system then select Local otherwise select HDFS.

Connection Name

Choose the HDFS connection name.

Model Path

Provide the path where the model is located on HDFS.

register_3.PNG

 

register_model_2_3.PNG 

When the model is successfully registered with the application, you can avail it in your data pipelines.

Register H2O Models

StreamAnalytix provides its users with the option to register H2O POJO (Plain Old Java Objects) and MOJO (Model Object Optimization) models that are trained outside of StreamAnalytix. After the successful registration, the user can use the model for doing predictions.

In the left navigation pane, click Register Entities. Click the Models tab.

To register a new model, click the + icon on the right side of the screen.

Field

Description

Name

Name of the Model.

Model API

Choose H20 as API.

Model Format

Select model format MOJO (Model Object Optimized) or POJO (Plain Old Java Object)

Model Type

The MOJO or POJO models can be registered with the below mentioned Spark ML algorithms. Choose the one that fits your use case.

1. Distributed Random Forest

2. Gradient Boosting Machine

3. Generalized Linear Modeling

4. Isolation Forest

NOTE: You will get different model types if you selects 'POJO' i.e. - KMeans, NaiveBayes, Deep Learning, Distrib­uted Random Forest, Gradient Boosting Machine, General­ized Linear Modeling

POJO Model Type: Model Class

This field is Mention the POJO Class name.

Model Source

'Register the H2O model either by uploading the zip file of MOJO models or the Java files of POJO models. If the files are placed on HDFS, mention the HDFS connection and location.

In the Register Model window, mention the following properties:

H20_01_3.png

Click the Validate button.

Once, the model is validated successfully, the Register button next to Validate will be enabled.

Click Register. After the model is registered successfully, you can view the model in the models page.

h20_02_3.png

Once the model is registered it can be utilized for scoring over H2O processor in pipelines over StreamAnalytix.

Register Scikit model

StreamAnalytix provides its users with the option to register Scikit model that is trained outside of StreamAnalytix. After the successful registration, the user can use the model for making predictions.

In the left navigation pane, click Register Entities. Click the Models tab to register a model.

On the right side of the screen, click the + icon to register a new model.

In the Register Model window, mention the name of the model, specify the model API. Here, we choose Scikit as API.

Scikit_model_01_3.png

Select one of the below model categories:

• Classification

• Clustering

• Pipeline

• Regression

The Feature list field enlists the features that are used to train the model. You can specify the feature names either by entering them manually or by uploading a .csv file. Here, the header row of the file will be used for feature names.

You can provide the model source either by uploading the scikit model or by provid­ing the HDFS file path.

Note: If you choose HDFS as the model source, provide the connection name and model path.

Click the Validate button. Once, the model is validated successfully, the Register button next to Validate will be enabled. Click Register. After the model is registered successfully, you can view the model in the models page. Now you can use the reg­istered model in the data pipeline for making predictions.

Scikit_model_02_3.png

Now, you can use the registered scikit model for doing predictions by selecting the scikit processor to create a Data Pipeline. Refer to the image below:

scikitprocessor_3.png

 

 

Calendars

You can create holiday calendars from Register Entities < Calendar< Calendar listing page. There will be a + icon to create the calendar.

calendarlisting_3.PNG

On the calendar configuration page you will be asked to select the holiday dates to create calendar.

You can either select multiple dates from the complete year of multiple year calen­dar or upload the holiday dates for one or multiple calendar years.

These calender can be used in the Workflow.

Following is the configuration page, their properties and descriptions:

Celender_3.PNG

 

Entity

Description

Name

Name of the calendar.

Scope

Select Global or Local, which defines the scope of the cal­endar.

Timezone

Select the timezone for your calendar.

Date(s)

Select the date(s) for your calender to be marked as holiday.

Description

You can add a description about the calendar.

Upload

Upload a text file (.txt) that contains date(s) in the format of MM-DD-YYYY. In case the file has multiple dates, each entry should be in a new line.