Clusterfirstchapter

Installation Guide

Introduction

StreamAnalytix platform enables enterprises to analyze and respond to events in real-time at Big Data scale. With its unique multi-engine architecture, StreamAnalytix provides an abstraction that offers a flexibility to execute data pipelines using a stream processing engine of choice depending upon the application use-case, considering the advantages of Storm or Spark Streaming based upon processing methodology (CEP, ESP) and latency.

Objective of the document

The objective of this document is to install StreamAnalytix and configure various infrastructure components that would interact in StreamAnalytix pipelines.

Assumptions

1. User of this document is well acquainted with Linux systems and has fair knowledge of UNIX commands.

2. User has sudo rights or is working as a root user.

3. User has installed yum, rpm, unzip, tar and wget tools.

Overview

StreamAnalytix web studio provides a web interface to create, deploy and manage data processing and analytical flows. These data flows utilize services that are part of the big data cluster. Cluster managers like Ambari manage majority of these services (for HDP). The web studio needs to be configured correctly to enable data pipelines to interact with the services.

StreamAnalytix web studio provides a simple way to configure these service properties as part of a post-deployment setup process.

Managed services such as YARN, Zookeeper can be configured by simply providing Ambari information in the setup screen. Properties for services that are not part of the managed cluster can be configured by entering the values manually.

Pre-requisites

Before beginning with the installation please see supported technology stack in Appendix 1.

An HDP, CDH or Apache based cluster with the version described in Appendix 1 must be available for StreamAnalytix to work properly. Livy/Local service is required to create pipelines in StreamAnalytix.

The pre-requisites mentioned in Appendix 1 must be deployed before proceeding further.

Embedded StreamAnalytix

StreamAnalytix Webstudio can be manually configured during deployment process. This requires changing few configuration files manually.

Alternatively, a simpler way is to start the Webstudio in embedded mode. This enables the user to configure StreamAnalytix from the UI. Configuring and restarting StreamAnalytix can switch the webstudio to cluster mode.

Embedded mode requires two services Zookeeper and Qpid, these packages are bundled in the StreamAnalytix binary and do not need additional setup.

Steps to Run

1. Extract the StreamAnalytix bundle and go to the extracted location in a terminal.

2. This location is called as StreamAnalytix installation directory.

3. Run the below command to start Webstudio.

cd bin/

./startServicesServer.sh -deployment.mode=embedded

Once the command executes, an EULA page opens.

4. Accept the license and hit Next button. Upload License page opens.

5. Upload the license and confirm. Next page is the login page of StreamAnalytix.

6. Login page is displayed.

Setup

1. Navigate to setup page on the sidebar.

2. Setup page contains various tabs - Cluster Configuration, StreamAnalytix, Database, Messaging Queue, Elasticsearch, Cassandra and Version Control.

Cluster Configuration

StreamAnalytix enables automated configuration if the cluster is an HDP or CDH cluster.

Log in StreamAnalytix and go to Setup, select Cluster Configuration and enter login details for the cluster manager. On clicking Save – all the managed services will be fetched.

For HDP Using Ambari

Property	Description
Cluster Manager	Select the cluster manager as Ambari
URL	Provide the Ambari URL as below http://ambari host>:<ambari port
User Name	Username for Ambari
Password	Password for Ambari
Cluster Name	Provide the Ambari cluster name
Enable Kerberos	Enable Kerberos allows Kerberos configuration from cluster manager be configured in StreamAnalytix.

Click Save and configure all managed services that are supported in StreamAnalytix with its progress.

Note: Livy configuration is not supported through setup simplification process; please make sure that StreamAnalytix is pointing to correct Livy URL which is as follows.

Livy URL for HDP -> http://localhost:8999

For CDH Using Cloudera Manager

Property	Description
Cluster Manager	Select the cluster manager as Cloudera Manager
URL	Provide the Cloudera Manager URL as below
User Name	Username for Cloudera Manager
Password	Password for Cloudera Manager
Cluster Name	Provide the Cloudera Manager cluster name
Enable Kerberos	Enable Kerberos allows Kerberos configuration from cluster manager be configured in StreamAnalytix.

Click Save and configure all managed services that are supported in StreamAnalytix with its progress.

Note: Livy configuration is not supported through setup simplification process; please make sure that StreamAnalytix is pointing to correct Livy URL which looks as follows:

Livy URL for CDH/Apache: http://localhost:8998

For Apache

1. Select Web Studio tile and click on Zookeeper tab.

Provide value of following property:

Property	Description
Host List	The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration. For ex: hostname1:2181,hostname2:2181

Property

Description

Host List

The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration.

For ex: hostname1:2181,hostname2:2181

Save the changes by clicking on Save.

2. Select Processing Engine tab and click on Spark tab.

Provide values for the following properties to point StreamAnalytix to an external cluster:

Property	Description
Spark Livy URL	Livy web URL on which StreamAnalytix will submit pipelines.
Spark cluster manager	Defines Spark Cluster Manager i.e. ‘yarn’ or ‘standalone’.
spark.history.server	Defines spark history server URL.
Resource Manager Host	Defines resource manager hostname.
Resource Manager Webapp Port	Defines resource manager webapp port.
Resource Manager Port	Defines resource manager RPC port.
ResourceManager High Availability	Check this if Resource Manager is HA enabled.
ResourceManager HA Logical Names	Resource Manager HA logical IDs as defined in HA configuration.
ResourceManager HA Hosts	Resource Manager HA hostnames.
ResourceManager HA ZK Address	Resource Manager HA zookeeper quorum.

Save the changes by clicking on Save.

Configure StreamAnalytix with HTTPS

1. Get keystore.jks and truststore.jks certificates.

2. Import these certificates in $JAVA_HOME/jre/lib/security/cacerts

Example: keytool -import -alias cmagent_<hostname> -file <path of the file> /<filename> -keystore $JAVA_HOME/jre/lib/security/jssecacerts -storepass changeit

3. Update the below mentioned configuration under Configuration< Processing Engine< Spark.Change the Resource manager WEB port from 8088 to 8190 or the port number of Resource Manager.

4. If either of the services are running on HTTPS, StreamAnalytix, Spark or Ambari, respectively you can configure the same under configuration:

Livy Properties

History and Resource Properties

3. Select Hadoop tab and click on HDFS tab.

Property	Description
Hadoop Enable HA	Hadoop cluster is HA enabled or not. Keep this disabled if Hadoop is not running in HA mode.
File System URI	The filesystem FS URI. For ex: hdfs://hostname:port (in case HA is not enabled) Hdfs://nameservices(in case HA is enabled)
Hadoop User	The name of the user through which Hadoop services are running
Hadoop DFS Name Services	Defines nameservice ID of Hadoop HA cluster. Configure this only when Hadoop is running in HA mode.
Hadoop Namenode 1 Details	Defines RPC address of Namenode 1. For ex: nn1,hostname:port. Configure this only when Hadoop is running in HA mode.
Hadoop Namenode 2 Details	Defines RPC address of Namenode 2. For ex: nn2,hostname:port. Configure this only when Hadoop is running in HA mode.

Provide values of following properties:

Save the changes by clicking on Save.

4. Now login to terminal on node where StreamAnalytix web studio is installed. Follow below steps:

a. Go to <<StreamAnalytix_installation_dir>>/conf

b. Edit config.properties file and provide value of highlighted property ‘zk.hosts’:

Property	Description
zk.hosts	The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration. For ex: hostname1:2181,hostname2:2181

Property

Description

zk.hosts

The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration.

For ex: hostname1:2181,hostname2:2181

Note: Checkpointing is required to run pipeline on a standalone deployment.

StreamAnalytix Miscellaneous Properties

1. Go to Setup from left navigation pan and click on StreamAnalytix tab.

Provide values for following properties:

Property	Description
StreamAnalytix Web URL	Configure StreamAnalytix Web URL as below.
Zookeeper StreamAnalytix Node	Zookeeper StreamAnalytix node is where Webstudio specific properties are managed.
Zookeeper Configuration Node	Zookeeper configuration node is where all the YAML properties are managed.
Password Encryption Required	Enable Password Encryption Required, to encrypt all password fields.
Spark Home	Spark Home is the path to Spark Installation on machine where StreamAnalytix Studio is installed.
Spark Job Submit Mode	Spark Job Submit Mode is mode in which spark pipeline jobs are submitted. See Appendix-1 on deploying Livy and setting up Spark 2 client. The options are: • spark-submit • livy • job-server
Hadoop User	Hadoop User is the StreamAnalytix user through which pipeline will be uploaded to HDFS.

Database

Property	Description
Connection URL	Provide the JDBC connection URL. Supported Database deployment are PostgreSQL, Oracle, MySQL and MSSQL. jdbc:postgresql://<db_host>:<db_port>/streamanalytix.
User	Provide username.
Password	Provide password.
Run Script	Select run script option if all the SQL scripts needs to be executed into the configured database. Run script will execute all the SQL scripts (DDL & DML) in configured RDBMS database, however with 3.2 version of SAX, follow the below mentioned note. Note: 1. Before selecting Run Script, psql (PostgreSQL DB) or MySQL(MySQL DB) client should be installed. 2. Manually run both DDL and DML SQL scripts belongs to the folder named <SAX_HOME/db_dump/<RDBMS_3.2>, since it is not executed automatically with Run Script option.

Messaging Queue

Property	Description
Messaging Type	Select Messaging Type - Supported types are RABBITMQ and ACTIVEMQ
Host List	Comma separated host list where RabbitMQ is deployed in format <rmq_host1>:<rmq_port>, <rmq_host2>:<rmq_port>
User	Provide RabbitMQ login username.
Password	Provide RabbitMQ login password.

Elasticsearch

Property	Description
Elasticsearch Connection URL	Provide Elasticsearch connection URL as below <es_host>:<es_connetion_port>
Elasticsearch HTTP Port	Provide http port
Elasticsearch Cluster Name	Provide cluster name
Enable Security	If security is enabled on Elasticsearch set this to true.
Enable Authentication	If Authentication is enabled on Elasticsearch check the check box
Enable SSL	If SSL is enabled check the checkbox
Keystore Password	Elasticsearch Keystore password

Version Control

Property	Description
Version Control System	StreamAnalytix Metastore: In case of SAX metastore pipeline will be saved on the file system. GIT: In case of GIT user pipeline will be pushed to GIT after version is created. In case of GIT, following properties will be populated.
HTTP URL	HTTP URL of remote GIT repository.
Username or Email	Username of email id of GIT User.
Password	HTTP password of GIT user.
Branch	Branch Name where push operation will be performed.
Repositories Local Path	Repository local path where GIT clone will place the files on file system.
CLONE	This option will clone your entire GIT repository in the location mentioned on the Repository Local Path.

Restart StreamAnalytix

Please refer to post deployment steps to configure additional optional features in StreamAnalytix.

1. On the Terminal change directory to the StreamAnalytix installation directory and stop StreamAnalytix using the following command.

cd bin/

./stopServicesServer.sh

./startServicesServer.sh -config.reload=true

Accept the EULA, upload the license and login in the application.

(The credentials are mentioned above)

Note: If you install StreamAnalytix Webstudio for Apache (embedded mode), upload license again (if prompted).

Also, before starting StreamAnalytix, if underlying database is MySQL, then MySQL’s connection jar should be placed in <StreamAnalytix_HOME>/server/tomcat/lib> and <StreamAnalytix_Home/conf/thirdpartylib>.

The My SQL Jar version should be: mysql-connector-java-8.0.13.jar

Component Versions Supported

The technology stack supported for StreamAnalytix Deployment are as follows

Component Name	HDP 2.6.3	HDP 2.6.5	HDP 3.1.0	Apache	CDH
Apache Zookeeper	3.4.6	3.4.6	3.4.6	3.4.13	3.4.5
Apache Hadoop	2.7.3	2.7.3	3.1.1	2.8.5	2.6.0
Apache YARN	2.7.3	2.7.3	3.1.1	2.8.5	2.6.0
Apache Kafka	2.11-0.10.1	2.11-1.0.0	2.11-2.0.0	2.12-0.11.0	3.0.0(parcel)-0.11.0
Apache HBase	1.1.2	1.1.2	2.0.2	1.4.9	1.2.0
Apache Spark	2.2.0	2.3.0	2.3.2	2.4.0	2.3.0
Apache Hive	1.2.1000	1.2.1000	3.1.0	2.3.4	1.1.0
Apache Solr	5.5.2	5.5.5	7.4.0	7.0.0	4.10.3
Apache Livy	0.4.0	0.4.0	0.5.0	0.5.0	0.4.0
Elasticsearch	6.4.1	6.4.1	6.4.1	6.4.1	6.4.1
Graphite	-	-	-	-
Apache Tomcat	9.0.7	9.0.7	9.0.7	9.0.7	9.0.7
Java	1.8.0_77	1.8.0_121	1.8.0_201	1.8.0_201	1.8.0_201
Postgres	10.8	10.7	10.8	10.8	10.8
MySQL	5.7.19	5.7.19	5.7.19	5.7.24	5.7.19
RabbitMQ	3.6.10	3.6.10	3.6.10	3.6.10	3.6.10

Other services supported:

• Kerberos

Ambari managed HA services supported in StreamAnalytix Deployment

1. Ambari

2. HDFS

Set spark version to 2.2 for HDP 2.6.3”

Path -> Superuser > Configuration > Default > Spark > Spark Version

CDC Application Pre-requisites

CDC (Change Data Capture) is implemented using Debezium which is a distributed platform that turns your existing databases into event streams, so applications can see and respond immediately to each row-level change in the databases. Debezium is built on top of Apache Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, from where your application consumes them. This makes it possible for your application to easily consume all the events correctly and completely.

StreamAnalytix support four Databases – MYSQL, POSTGRES, ORACLE and MSSQL

Enable CDC for MYSQL

Before the Debezium MySQL connector can be used to monitor the changes committed on a MySQL server, the server must be set up to use row-level binary logging and have a database user with appropriate privileges.

Enable MYSQL Bin Log

Stop the mysqld service

Modify the /etc/my.cnf file with following parameters to enable MySQL bin logging:

Save the changes by clicking on Save.

log-bin=mysql-bin

server-id=1

binlog_format=row

expire_logs_days=10

Start the mysqld service after above changes are done.

For test purpose, we have created the database “test” with table as “books”. Grant all permission to the user.

GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'test' IDENTIFIED BY ‘password';

Connector Setup/Download

Debezium MySQL jars and detailed documentation are available on

https://debezium.io/docs/connectors/mysql/.

Make sure Zookeeper, Kafka, and Kafka Connect are already installed. Download the connector’s plugin archive, extract the JARs into your Kafka Connect environment, and add the directory with the JARs to Kafka Connect’s classpath. Restart your Kafka Connect process to pick up the new JARs. We are taking example of installation on HDP 3.1.0. Please check the compatibility of Debezium MYSQL CDC packages with the installed Kafka version.

Extract the package to a directory and then copy the Debezium jar files under:

/usr/hdp/3.1.0.0-78/kafka/libs

Connector configuration files

Configure the connect-standalone.properties or connect-distributed.properties depending on the cluster setup.

Make below settings in /usr/hdp/3.1.0.0-78/kafka /config

plugin.path=/usr/hdp/3.1.0.0-78/kafka/libs

rest.port=6669

You also need to create a new connector properties file, which will have all the information related to the MySQL server database configuration. There are more metrices available which can be enabled as per the requirements.

Below are minimum required details:

Example: /usr/hdp/3.1.0.0-78/kafka /config/connector.properties

name=<Mention any Name for connector>

connector.class=io.debezium.connector.mysql.MySqlConnector

database.user=<name of the MySQL user that has the required privileges>

database.history.kafka.bootstrap.servers= <List of Kafka brokers that this connector:PORT>

database.history.kafka.topic= <name of the DB history topic>

database.server.name=< provide any logical name of the MySQL server/cluster>

database.port=<MySQL DB port ie.3306>

database.hostname=< The address of the MySQL server.>

database.password=< The password for the MySQL user>

database.whitelist=< list of all databases hosted by this server that this connector will monitor. This is optional>

Start the Connector

Once all the above settings are done correctly. Start the Debezium Connector. Use below command to start:

nohup /usr/hdp/3.1.0.0-78/kafka/bin/connect-standalone.sh config/connect-standalone.properties config/connector.properties &

You can further monitor the nohup.log for its start-up or ERRORS while starting the connector. Once the connector is started successfully, it is now ready to record the data changes.

MYSQL Connector test

The MySQL connector writes events for all insert, update, and delete operations on a single table to a single Kafka topic. The name of the Kafka topics would be serverName.databaseName.tableName, where serverName is the logical name of the connector as specified with the database.server.name configuration property, databaseName is the name of the database where the operation occurred, and tableName is the name of the database table on which the operation occurred. For verification, make any changes into the MySQL DB, the changes should be reflected in the Kafka topic.

bin/kafka-console-consumer.sh --bootstrap-server <kafka bootstrapserver>:6667 --topic test.test.books

Setup in Kerberos Environment

For SSL and Kerberos enabled environment, below need to be set in connect-standalone.properties

# Kerberos

sasl.mechanism=GSSAPI

producer.security.protocol=SASL_PLAINTEXT

Also update the connector properties file, which has all the information related to the MYSQL server database configuration (as mentioned in #3)

#database.history.producer.security.protocol=SASL_PLAINTEXT

database.history.consumer.security.protocol=SASL_PLAINTEXT

Enable CDC for Postgres

To enable CDC for Postgres, a logical decoding output plugin need to be installed and configured For StreamAnalytix, we are using wal2json plugin. For more details please refer to https://debezium.io/docs/connectors/postgresql/#deploying-a-connector

Postgres Configuration

The setups has been done with Postgres v10. Update the below configuration files for Postgres

/var/lib/pgsql/10/data/postgresql.conf

# - Connection Settings -

listen_addresses = '*'

shared_preload_libraries='wal2json' # (change requires restart)

wal_level=logical

max_wal_senders=1

max_replication_slots=3

Configure the PostgreSQL server to allow replication to take place between the server machine and the host on which the Debezium PostgreSQL connector is running: pg_hba.conf

local replication <youruser> trust (1)

host replication <youruser> 127.0.0.1/32 trust (2)

host replication <youruser> ::1/128 trust (3)

Plugin Setup/Download

As mentioned, wal2json output plug-in is used for logical decoding. The wal2json output plug-in produces a JSON object per transaction.

Also note that the installation process requires the PostgreSQL utility pg_config. Verify that the PATH environment variable is set so as the utility can be found. If not, update the PATH environment variable appropriately.

PATH=/usr/pgsql-10/bin:$PATH

wal2json installation command:

git clone https://github.com/eulerto/wal2json.git

$ cd wal2json

# Make sure your path includes the bin directory that contains the correct `pg_config`

$ PATH=/path/to/pg/bin:$PATH

$ USE_PGXS=1 make

$ USE_PGXS=1 make install

In case “USE_PGXS=1 make” doesn’t work then install postgres dev tool first.

Connector configuration files

It is required to create a new connector properties file, which will have all the information related to the PGSQL server database configuration. There are more metrices available for which can be enabled as per the requirements. Below are minimum required details:

$plugin.name=wal2json

name=<Mention any Name for connector>

connector.class=io.debezium.connector.postgresql.PostgresConnector

database.user==<name of the Postgres user that has the required privileges>

database.server.name==< provide any logical name of the Postgres server/cluster>

database.port=<Postgres port ex.5432>

database.hostname==< The address of the Postgres server.>

database.password=< The password for the Postgres user>

database.dbname=<he name of the PostgreSQL database to connect to>

Start the Connector

Once all the above settings are done correctly. Start the Debezium Connector. Use below command to start:

nohup /usr/hdp/3.1.0.0-78/kafka/bin/connect-standalone.sh config/connect-standalone.properties config/connector.properties &

You can further monitor the nohup.log for its start-up or ERRORS while starting the connector. Once the connector is started successfully, it is now ready to record the data changes.

Postgres Connector test

The name of the Kafka topics would be serverName.databaseName.tableName, where serverName is the logical name of the connector as specified with the database.server.name configuration property, databaseName is the name of the database where the operation occurred, and tableName is the name of the database table on which the operation occurred. For verification, make any changes into the configured DB table, the changes should be reflected in the Kafka topic.

Setup in Kerberos

For SSL and Kerberos enabled environment, below need to be set in connect-standalone.properties

# Kerberos

sasl.mechanism=GSSAPI

producer.security.protocol=SASL_PLAINTEXT

Also update the connector properties file, which have all the information related to the MYSQL server database configuration (as mentioned in #3)

database.history.producer.security.protocol=SASL_PLAINTEXT

database.history.consumer.security.protocol=SASL_PLAINTEXT

Enable CDC for SQL Server

The database operator must enable CDC for the table(s) that should be captured by the connector. For more detail information please refer to https://debezium.io/docs/connectors/sqlserver/#deploying-a-connector),

MSSQL Configuration

Before using the Debezium SQL Server connector to monitor the changes committed on SQL Server, first enable CDC on a monitored database

-- ====

-- Enable Database for CDC template

-- ====

USE MyDB

EXEC sys.sp_cdc_enable_db

Then enable CDC for each table that you plan to monitor

-- ========= -- Enable a Table Specifying Filegroup Option Template -- ========= USE MyDB

EXEC sys.sp_cdc_enable_table

@source_schema = N'dbo',

@source_name = N'MyTable',

@role_name = N'MyRole',

@filegroup_name = N'MyDB_CT',

@supports_net_changes = 1

Verify that the user have access to the CDC table.

-- Verify the user of the connector have access, this query should not have empty result

EXEC sys.sp_cdc_help_change_data_capture GO

If the result is empty then please make sure that the user has privileges to access both the capture instance and CDC tables.

Connector Setup

Extract the JARs into your Kafka Connect environment and add the directory with the JARs to Kafka Connect’s classpath. Restart your Kafka Connect process to pick up the new JARs.

To use the connector to produce change events for a SQL Server database or cluster, enable CDC on SQL Server to publish the CDC events in the database

Connector configuration files

Create a configuration file for the SQL Server Connector and use the Kafka Connect REST API to add that connector to your Kafka Connect cluster.

name=<Mention any Name for connector>

connector.class= io.debezium.connector.sqlserver.SqlServerConnector

database.user=<name of the SQL user that has the required privileges>

database.dbname=< provide any logical name of the MSSQL server/cluster>

database.port=<MSSQL port ie.1433>

database.hostname=< The address of the MSSQL server.>

database.password=< The password for the MSSQL user>

database.whitelist=< A list of all tables whose changes Debezium should capture>

database.history.kafka.bootstrap.servers= <List of Kafka brokers that this connector:PORT>

database.history.kafka.topic= <name of the DB history topic>

MSSQL Connector test

Setup in Kerberos Environment

For SSL and Kerberos enabled environment, below need to be set in connect-standalone.properties

# Kerberos

sasl.mechanism=GSSAPI

producer.security.protocol=SASL_PLAINTEXT

Also update the connector properties file, which have all the information related to the MYSQL server database configuration (as mentioned in #3)

database.history.producer.security.protocol=SASL_PLAINTEXT

database.history.consumer.security.protocol=SASL_PLAINTEXT

Enable Oracle Logminer

As part of the CDC setup for Oracle, we need to enable Oracle Logminer. Oracle LogMiner is part of Oracle Database, enables to query online and archived redo log files through a SQL interface. Redo log files contain information about the history of activity on a database.

For setup in StreamAnalytix environment, Oracle VM VirtualBox manager is used.

# yum install VirtualBox-6.0

Please install the Linux kernel "header" files matching the current kernel for adding new hardware support to the system. The distribution packages containing the headers are probably:

# /usr/lib/virtualbox/vboxdrv.sh setup

Simply execute following command to start it from the terminal or use launcher from menu to start.

# VirtualBox

Identify the container database and the pluggable database in your environment.

Follow below steps to configure the Logminer:

sqlplus sys/oracle@orcl12c as sysdba

shutdown immediate;

startup mount;

alter database archivelog;

alter database open;

ALTER SESSION SET CONTAINER=ORCL;

CREATE USER inventory IDENTIFIED BY oracle;

GRANT CONNECT TO inventory;

GRANT CONNECT, RESOURCE, DBA TO inventory;

CREATE TABLE inventory.customers(id number(10),first_name varchar2(20),last_name varchar2(20),email varchar2(20),modified_date timestamp);

ALTER SESSION SET CONTAINER=cdb$root;

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

ALTER SYSTEM SWITCH LOGFILE;

ALTER SYSTEM SET db_recovery_file_dest_size = 50G SCOPE=BOTH SID='*';

CREATE USER c##cdc IDENTIFIED BY oracle CONTAINER=all;

GRANT create session, alter session, set container, select any dictionary, logmining, execute_catalog_role TO c##cdc CONTAINER=all;

ALTER SESSION SET CONTAINER=ORCL;

GRANT select on inventory.customers TO c##cdc;

ALTER SESSION SET CONTAINER=cdb$root;

EXECUTE DBMS_LOGMNR_D.BUILD(OPTIONS=> DBMS_LOGMNR_D.STORE_IN_REDO_LOGS);

sqlplus sys/oracle@orcl as sysdba

INSERT INTO inventory.customers VALUES (1,'NN','MM','nn@te',CURRENT_TIMESTAMP);

INSERT INTO inventory.customers VALUES (2,'NN','MM','nn@te',CURRENT_TIMESTAMP);

commit;

DELETE FROM inventory.customers;

commit;

After completing the above steps, go to StreamAnalytix UI and configure new Oracle connection.

Appendix -1 StreamAnalytix Pre-requisites

Java 1.8 X

1. Verify that you have a /usr/java directory. If not, create one:

$ mkdir /usr/java

2. Download the Oracle 64-bit JDK (jdk-8u101-linux-x64.tar.gz) from the Oracle download site. Open a web browser and navigate to http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

3. Copy the downloaded jdk.tar.gz file to the /usr/java directory.

4. Navigate to the /usr/java directory and extract the jdk.tar.gz file.

$ cd /usr/java tar zxvf jdk-8u101-linux-x64.tar.gz

5. The JDK files will be extracted into a /usr/java/jdk-8u101 directory.

6. Create a symbolic link (symlink) to the JDK:

$ ln -s /usr/java/jdk1.8.0_101 /usr/java/default

$ ln –s /usr/java/jdk1.8.0_101/bin/java /usr/bin/java

7. Set the JAVA_HOME and PATH environment variables.

$ export JAVA_HOME=/usr/java/default

$ export PATH=$JAVA_HOME/bin:$PATH

8. Run below commands to notify system that new java version is ready for use.

sudo update-alternatives --install "/usr/bin/java" "java"

"/usr/java/jdk1.8.0_101/bin/java" 1

sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/java/jdk1.8.0_101/bin/javac" 1

sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/java/jdk1.8.0_101/bin/javaws" 1

9. Verify that Java is installed in your environment by running the following command.

$ java –version

10. You should see output similar to following:

java version "1.8.0_101"

Java(TM) SE Runtime Environment (build 1.8.0_101-b01)

Java HotSpot(TM) 64-Bit Server VM (build 24.101-b01, mixed mode)

Access Rights to Livy user

Create a directory on HADOOP if it does not exist, using below command:

<HADOOP_HOME>/bin/hadoop fs -mkdir /hadoop

#Now, give full permission to the directory using below command:

<HADOOP_HOME>/bin/hadoop fs -chmod -R 777 /hadoop

Permission 777 required on directories configured in below common.yaml properties:

livy.sample.data.hdfs.path

livy.custom.jar.hdfs.path

Run the command:

<HADOOP_HOME>/bin/hadoop fs -chmod -R 777 /user/hdfs/sax/auto-detection/data/

<HADOOP_HOME>/bin/hadoop fs -chmod -R 777 /user/hdfs/sax/auto-detection/custom-jar/

livy.server.csrf_protection.enabled = false

#if below property is set 'true; make sure that Hive is installed and running properly. Otherwise set this property 'false'

livy.repl.enableHiveContext = false

Note: Please validate memory of node manager and container; it should be greater than 512 + 384 (75% of 512).

RabbitMQ 3.6.10

Erlang is required before installing RabbitMQ, use the below commands to do so:

yum -y install epel-release

yum -y install erlang socat

1. Run the following command to download RMQ RPM file:

wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.6.10/rabbitmq-server-3.6.10-1.el7.noarch.rpm

2. Run the below command to Install RabbitMQ:

rpm --import https://www.rabbitmq.com/rabbitmq-release-signing-key.asc

rpm -Uvh rabbitmq-server-3.6.10-1.el7.noarch.rpm

3. Start and Enable RabbitMQ using the following command:

systemctl start rabbitmq-server

systemctl enable rabbitmq-server

4. Check the status

systemctl status rabbitmq-server

5. Run the following command to Enable RabbitMQ plugin

sudo rabbitmq-plugins enable rabbitmq_management

6. Hit the below URL to access RabbitMQ UI

http://IP:15672

7. Run this commands for creating RabbitMQ user test with password test:

rabbitmqctl add_user test test

rabbitmaccqctl set_user_tags test administrator

rabbitmqctl set_permissions -p / test ".*" ".*" ".*"

NOTE:

For creating a user with guest/guest:

1. Create one file with name rabbitmq.config in /etc/rabbitmq

2. Copy this [{rabbit, [{loopback_users, []}]}]. in the file

3. Restart RabbitMQ

PostgreSQL 10

Configure YUM Repository

In order to prevent PostgreSQL to get installed with older version, you need to add the following line in the appropriate repository configuration file.

exclude=postgresql*

File path for making the above entry differs according to the OS:

• File path for CentOS machine: /etc/yum.repos.d/CentOS-Base.repo (in [base] and [updates] sections both)

• File path for RHEL machine: /etc/yum/pluginconf.d/rhnplugin.conf (in [main] section only)

Install PGDG RPM File

A PGDG file is available for each distribution/architecture/database version combination.

Install postgres repository in the system, use one of the commands below as per the system architecture and operating system.

CentOS/RHEL version 6.x, 64-Bit:

$ rpm -Uvh https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-6-x86_64/pgdg-redhat10-10-2.noarch.rpm

CentOS/RHEL version 7.x, 64-Bit:

$ rpm -Uvh https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-2.noarch.rpm

Install PostgreSQL10 Server

Install the basic PostgreSQL 10 server using below command

$ yum install postgresql10-server postgresql10

Initialize Database

After installing PostgreSQL server, it is required to initialize it before start using it. To initialize database run the below command.

$ service postgresql-10 initdb

NOTE: In case the above command gives any error, try one of the following commands:

$ /etc/init.d/postgresql-10 initdb

$ /usr/pgsql-10/bin/postgresql-10-setup initdb

Server configuration and Startup

To start PostgreSQL server automatically on system boot, run the following command

Configure Connection

$ chkconfig postgresql-10 on

1. Replace the following line written in /var/lib/pqsql/10/data/pg_hba.conf file:

host all all 127.0.0.1/32 ident

By the following line:

host all all all md5

2. Replace the following line written in /var/lib/pqsql/9.3/data/postgresql.conf file:

#listen_addresses = 'localhost'

By the following line:

listen_addresses = '*'

NOTE: Do not forget to uncomment the above line, it is commented by default.

• Start Server

Start PostgreSQL service using following command.

$ service postgresql-10 start

Verify PostgreSQL Installation

After completing PostgreSQL 9.3 installation on server, perform a basic sanity to verify whether installation has been completed successfully or not. To verify the same, switch to the postgreSQL user from root user (first time only) by the following command:

$ su postgres

• Use psql command to access PostgreSQL console.

bash-4.3$ psql

PostgreSQL User Configuration

Change password of PostgreSQL user by the following command (default password is postgres):

postgres=# alter user <<username>> with password <<newpassword>>;

postgres=# alter user postgres with password ‘scott’;

NOTE: Please keep the new password under quotes.

Create New User (Optional):

In PostgreSQL, the default root user is ‘postgres’, if you want to create a new user with login permission, use the following command:

$ sudo -u postgres psql postgres

postgres=# create role <<new_user_name>> login password '<<new_password>>';

Email Alert Database

1. In <<installationDir>>/db_dump/pgsql_1.2 , you will find activiti.sql .

2. Create a new database for activiti, you will need to point this database while configuring StreamAnalytix application.

3. Import <<installationDir>>/db_dump/pgsql_1.2/activiti.sql

$ psql -U postgres -d <<activiti_db_name>> -h <<pqsql_host>> -f <<installationDir>>/db_dump/pgsql_1.2/activiti.sql

Livy 0.4.0/0.5.0

Installation for HDP Cluster

Livy 0.4.0 is packaged with HDP2.6.3 and HDP2.6.5, Livy 0.5.0 is packaged with HDP3.1.0 stack as general availability.

Verify LIVY installation by following the below steps:

1. Login into Ambari console.

2. Go to Dashboard and select Spark or Spark2 service from left Pane.

3. For Spark2, 'Livy for Spark2 Server' should be running.

If Livy is not installed, follow these steps to install Livy:’.

Note: When using Livy - StreamAnalytix pipelines should be saved / submitted in the same mode (cluster/ client) that is configured on Livy

To verify Livy mode, follow below steps:

1. Login to Ambari.

2. Go to Dashboard and select specific Spark version.

3. Click on 'Config' tab.

4. Now, search 'livy.spark.master' property. If its value is 'yarn-cluster' or 'cluster' then Livy is configured to support ‘Cluster Mode’ otherwise the pipelines must be running on client mode.

In client mode, Livy picks up an application binary from local file system. Set the below property which tells Livy to add a local directory into its white list

livy.file.local-dir-whitelist =/home/sax/StreamAnalytix/lib/

Make sure that the local directory (directory where application binaries resides) is mounted on NFS

See - (https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-centos-6)

Alternatively, StreamAnalytix can be installed on the node where Livy is deployed in order to support local mode

4.Login into Ambari.

5.Go to Hosts and select node on which Livy is to be installed.

6.Now click on '+Add' button to install 'Livy for Spark2 Server'

Installation for CDH/Apache Cluster

To build Livy for CDH/Apache cluster use the following link.

• Build Livy manually

For CDH, change the below configuration in <Livy install dir>/conf/livy.conf

livy.spark.master = yarn

livy.spark.deploy-mode = cluster

Note: If Livy is configured on Cluster/Client mode, then the StreamAnalytix pipelines should be saved compatibly else pipeline will not be submitted to spark.

To enable Livy support, configure the below StreamAnalytix properties in env-config.yaml file.

StreamAnalytix Properties.

#job-server,spark-submit, livy
job.submit.mode: "livy"
livy.url: "http://localhost:8998"

In client mode, Livy picks up an application binary from local file system. Set the below property which tells Livy to add a local directory into its white list, else Livy does not accept it.

livy.file.local-dir-whitelist =/home/sax/StreamAnalytix/lib/

Kerberos Environment with Livy

Points to remember:

Note: Points to remember for Kerberos enabled installation is that it is configured with livy.

1. The following property value should be false.

livy.impersonation.enabled = false

2. During pipeline submission, if keytabs are uploaded then it is mandatory to mount the /tmp/kerberos folder on the livy node at same location and if keytab file path is provided then make sure all nodes are having all keytabs at same location.

For example, you have a cluster of multiple nodes and StreamAnalytix is on Node A and Livy on Node B. During the process, StreamAnalytix exports the uploaded keytabs to /tmp/kerberos (Node A) folder on playing a pipeline. Therefore, you should mount the /tmp/kerberos folder to the machine where Livy is running (Node B) since it won’t find uploaded keytabs (on Node A).

If you cannot mount the folder then do not upload keytabs instead supply keytab file path and make sure that all the keytabs on Livy node are at the same location.

Configure Livy in StreamAnalytix

To enable Livy support, configure the below StreamAnalytix properties in env-config.yaml file.

Webstudio Properties

#job-server,spark-submit, livy

job.submit.mode: "livy"

# livy url for HDP, by default runs on 8999(Apache + CDH) and 8998 (HDP)

livy.url: "http://localhost:<LIVY PORT>"

Elasticsearch 6.4.1

To install Elasticsearch, follow the steps mentioned below:

1. Download Elasticsearch binary (.tar.gz) version 6.4.1 from the below Url:

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.4.1.tar.gz

2. Extract the tar.gz using below command:

$ tar -xvf elasticsearch-6.2.4.tar.gz -C <<installationDir>>

$ cd <<installationDir>>/<<extractedDir>>

3. Open config/elasticsearch.yml

cluster.name: ES641

node.name: IP of the machine

path.data: /home/sax/elasticsearch-6.4.1/data2

path.logs: /home/sax/elasticsearch-6.4.1/logs2

network.host: IP of the machine

http.port: 9200

discovery.zen.ping.unicast.hosts: ["IP"]

Place this in the end of the file:

action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*,sax-meter*,sax_audit_*,*-sax-model-index,sax_error_*,true

NOTE: Run this if it gives error of vm.max_map_count

sudo sysctl -w vm.max_map_count=262144

4. To start Elasticsearch in background

nohup ./bin/elasticsearch &

Enable SSL in Elasticsearch

To enable SSL, perform the following steps on each node in the cluster:

5. Manually download the X-Pack zip file from the below URL.

https://artifacts.elastic.co/downloads/packs/x-pack/x-pack-6.2.4.zip)

6. Run $ES_HOME/bin/elasticsearch-plugin install on each node in your cluster.

$ $ES_HOME/bin/elasticsearch-plugin install file:///path/to/file/x-pack-6.2.4.zip

7. Confirm that you want to grant X-Pack additional permissions.

8. X-Pack will try to automatically create several indices within Elasticsearch. By default, Elasticsearch is configured to allow automatic index creation and no additional steps are required. However, if you have disabled automatic index creation in Elasticsearch, you must configure action.auto_create_index in elasticsearch.yml to allow X-Pack to create the following indices:

<<installationDir>>/<<extractedDir>>/config/elasticsearch.yml:

action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*

Generating Node Certificates

9. Create a certificate authority for your Elasticsearch cluster. Substitute <DOMAIN_NAME> with your machine’s domain name and <node1> and <node2> with the node name or IP address of machines which would be a part of Elasticsearch Cluster’:

$ keytool -genkeypair -keystore es-certificate.p12 -storetype PKCS12 -storepass elastic -alias esSSL1 -keyalg RSA -keysize 2048 -validity 99999 -dname "CN=DOMAIN_NAME, OU=My Team, O=My Company, L=My City, ST=My State, C=SA" -ext san=dns:DOMAIN_NAME,dns:localhost,ip:127.0.0.1,ip:node1,ip:node2

10. Copy the node certificate to the appropriate locations.Copy the generated .p12 file in a Elasticsearch configuration directory on each node. For example.,/home/es/config/certs.:

11. Add the Elasticsearch certificate in the JAVA cacerts of the machine from where we are trying to connect to the Elasticsearch (i.e. Yarn and StreamAnalytix nodes) using below command ::

$ keytool -importkeystore -srckeystore /path-to-p12-file/es-cer.p12 -destkeystore $JAVA_HOME/jre/lib/security/cacerts -srcstoretype pkcs12

Above command must be run with root or sudo account. It will prompt for destination keystore password if it has been set earlier and source keystore password which is ‘elastic’ in our case.

Enable SSL between nodes in a Cluster

12. Enable TLS and specify the information required to access the node’s certificate. Add the following information to the <<installationDir>>/<<extractedDir>>/config/elasticsearch.yml file on each node. :

xpack.security.transport.ssl.enabled: true

xpack.security.transport.ssl.verification_mode: certificate

xpack.security.transport.ssl.keystore.path: certs/es-certificate.p12

xpack.security.transport.ssl.truststore.path: certs/es-certificate.p12

13. If you have secured the node’s certificate with a password, add the password to your Elasticsearch keystore: Password what we have set was ‘elastic’. Enter the same when prompted.

Password set was ‘elastic’, enter the same when prompted.

$ bin/elasticsearch-keystore add xpack.security.transport.ssl.keystore.secure_password

$ bin/elasticsearch-keystore add xpack.security.transport.ssl.truststore.secure_password

Encrypting HTTP Client Communication

14. Enable TLS and specify the information required to access the node’s certificate.

Add the following information to the <<installationDir>>/<<extractedDir>>/config/elasticsearch.yml file on each node:

xpack.security.http.ssl.enabled: true

xpack.security.http.ssl.keystore.path: certs/es-certificate.p12

xpack.security.http.ssl.truststore.path: certs/es-certificate.p12

15. If you have secured the node’s certificate with a password, add the password to your Elasticsearch keystore:

Password what we have set was ‘elastic’. Enter the same when prompted.

$ bin/elasticsearch-keystore add xpack.security.http.ssl.keystore.secure_password

$ bin/elasticsearch-keystore add xpack.security.http.ssl.truststore.secure_password

16. Configure additional properties in <<installationDir>>/<<extractedDir>>/config/elasticsearch.yml file under the extracted folder

Note: - Make sure there is a space at the starting of the line (Just remove #, do not remove space).

cluster.name

node.name

path.data

path.logs

Elasticsearch nodes join a cluster based on just one property named cluster.name.

For example, if you want to add the node to cluster ‘mass_deployment’, change the value of property ‘cluster.name’ to ‘mass_deployment’ as follows:

cluster.name: mass_deployment

This should be same across all nodes of the cluster. This value will be required while configuring Elasticsearch in StreamAnalytix.

The node name should be unique for each ES node in a cluster. This is defined by the ‘node.name’ property.

For example: If user wants to deploy three nodes for the cluster, the names can be ‘node0’, ‘node1’ and ‘node2’.

• node.name: ‘node0’

This should be unique for each node in the cluster

• node.tag: ‘node0’

This should be unique for each node and same as node.name. Also, use the convention node0, node1, nodeN.

• path.data: /path/to/data/dir

This property perform discovery when new node is started. The default list of hosts is ["127.0.0.1", "[::1]"]

• discovery.zen.ping.unicast.hosts: ["<hostname/ip>" ]

Set this property to create a dedicated master-eligible node.

• node.master: true

This property defines Data nodes and it holds data and perform data related operations.

• node.data: true

Locks the memory for better performance of Elasticsearch.

• transport.tcp.port: 9300

• transport.bind_host: <hostname/IP>

• transport.host:<hostname/IP>

• network.host: hostname/IP>

• http.port:9200

Note: For viewing Monitoring Error add the following property in Elasticsearch.yml.

action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*, sax-meter*

17. Specify heap size for Elasticsearch by adding the below line to the file ‘<<installationDir>>/<<extractedDir>>/config/jvm.options’:

-Xms4g

-Xmx4g

18. Make sure to increase the limit on the number of open files descriptors for the user running Elasticsearch to 65,536 or higher. Run below command as root before starting Elasticsearch, or set nofile to 65536 in /etc/security/limits.conf.

$ ulimit -n 65536

19. Set the passwords of the built in elastic user. You must explicitly set a bootstrap.password setting in the keystore before you start Elasticsearch.

For example, the following command prompts you to enter a new bootstrap password.

$ bin/elasticsearch-keystore add "bootstrap.password"

The above password you set will be required to login to Elasticsearch cluster URL using ‘elastic’ as superuser.

20. Change ownership of Elasticsearch installation directory and start the Elasticsearch node by logging in as non-root user. This is done to enable memory locking as follows:

$ chown -R <<non-root_user>> <<installationDir>>

$<<installationDir>>/<<extractedDir>>/bin/elasticsearch -d

21. To enable Elasticsearch plugin, open Google Chrome browser and install extension ‘elasticsearch-head’.

22. To access Elasticsearch cluster, click on the ‘elasticsearch-head’ plugin icon on your browser and enter the cluster details as below and hit ‘connect’:

http://<es_http_node>:<es_http_port>

Couchbase-server-community-5.1.1

To install Couchbase, follow the steps mentioned below:

1. Download the rpm file from below URL.

wget https://packages.couchbase.com/releases/5.1.1/couchbase-server-community-5.1.1-centos7.x86_64.rpm

2. To install Couchbase, run the below command:

rpm --install couchbase-server-community-5.1.1-centos7.x86_64.rpm

The command will install the Couchbase DB and start the service. After running the command above, you will receive the following URL in output:

http://<<HOSTNAME>>:8091/

Open the URL in browser then follow the steps to create the cluster.

Step1: Click on the Setup New Cluster

Step 2: Provide the Cluster Name, Username and Password, then click on Next.

Step 3: Accept the terms and conditions, and click on Finish with Defaults. You can also configure Disk, Memory and Service, as per your requirements.

Step 4: Cluster setup has been completed. Now, login with the username and password set up in previous step.

Configure LLAP

Follow the below URL to enable LLAP on Hive

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/performance-tuning/content/hive_setting_up_llap.html

Appendix-2 Post Deployment Steps

Dashboard Installation

1. Find the dashboard folder inside STREAMANALYTIX_HOME and untar dashboard.tar.gz file using command: tar xvf dashboard.tar.gz. (On the machine where StreamAnalytix admin UI is not installed)

2. Create database named: dashboardrepo in postgres db.

3. Set the below lines in /dashboard/reportengine/config/ReportEngine.dat and change the properties in <USER>, <PostGres_IP>, <PASSWORD> below with actual values.

RepositoryDB.SID=dashboardrepo

RepositoryDB.charsetEncoding=

RepositoryDB.connectionType=DB

RepositoryDB.dataCachePurgeFrequency=30

RepositoryDB.incrementSize=5

RepositoryDB.initialConnections=5

RepositoryDB.isBlank=false

RepositoryDB.isCubeRepository=false

RepositoryDB.isDefault=false

RepositoryDB.isReadOnly=true

RepositoryDB.isRepository=true

RepositoryDB.isSecureConnection=FALSE

RepositoryDB.isStaging=false

RepositoryDB.maxConnections=30

RepositoryDB.metaDataCachePurgeFrequency=BOOTUP

RepositoryDB.metadataCachingEnabled=true

RepositoryDB.password=<PASSWORD>

RepositoryDB.poolConnections=

RepositoryDB.port=5432

RepositoryDB.provider=POSTGRES

RepositoryDB.reSubmitIdleTime=30

RepositoryDB.server=<PostGres_IP>

RepositoryDB.timeZone=

RepositoryDB.url=jdbc:postgresql://<PostGres_IP>:5432/dashboardrepo

RepositoryDB.useRuntimeCredential=false

RepositoryDB.user=<USER>

4. Steps to change the default port of Jakarta server:

a. Copy folder sax-dashboard from [Dashboard_installation_path]/jakarta/webapps to [SAX Tomcat_Home]/webappsl

b. Start Intellicus Report Server and Web Server.

sudo ./reportserver.sh start

c. Enable Dashboard in StreamAnalytix, set below properties in env-config.yaml

Note: Properties for enabling dashboard in StreamAnalytix need only be set on the machine that is hosting the StreamAnalytix web admin.

Location: STREAMANALYTIX_HOME/conf/yaml/env-config.yaml

intellicus:

sax.url: http://<IP>:<PORT>/sax-dashboard

NOTE: Replace <IP> and <PORT> with dashboard client IP and port.

Location:

STREAMANALYTIX_HOME/conf/common/dashboard-int/ReportClient.properties

REPORT_ENGINE_IP=<INSTALLATIONM_MACHINE_IP>

STREAMANALYTIX_HOME/conf/yaml/common.yaml

dashboard.enabled=true

d. Restart the StreamAnalytix admin server (Tomcat)

e. Log in to StreamAnalytix as the Admin.

Dashboard Synchronization Steps:

Perform the below synchronization steps in order to sync the existing users and other components with Dashboard.

NOTE:

• If the Dashboard is set up after creation of multiple users, migration steps are mandatory in order to sync the users with dashboard.

• Make sure to apply step-(e) i.e. log in with your StreamAnalytix admin credentials, before sync.

1. Open REST Client on browser.

2. Enter the below URL in address bar:

http://<StreamAnalytix_IP>:<PORT>/StreamAnalytix/dashboard/sync

Use HTTP method as GET.

3. Use basic authentication and add username: superuser and password as superuser.

4. Click on SEND button.

Enable SSL on Kafka

To install Kafka, follow the steps mentioned below

1. Download Kafka binary (.tar.gz) version 0.10.2.1 from the below URL.

https://www.apache.org/dist/kafka/0.10.2.1/kafka_2.12-0.10.2.1-site-docs.tgz

2. Extract the tar.gz using below command:

$ tar -xvf kafka_2.12-0.10.2.1.tgz -C <<installationDir>>

$ cd <<installationDir>>/<<extractedDir>>

To enable SSL on Kafka, follow the steps mentioned below:

Perform the following steps on each node in the cluster:

Generating Node Certificates:

3. Create a certificate authority for your Kafka cluster. Substitute <DOMAIN_NAME> with your machine’s domain name on all nodes with Keystore password and validity.

NOTE: Passwords should be same .

$keytool -genkeypair -keystore kafka.keystore -keyalg RSA -alias <<Domain Name >> -dname "CN=$(hostname -f)" -storepass <<password>> -keypass <<password>> -validity 32767

4. On all the nodes, rename keystore file to jks file..

$mv kafka.keystore kafka.jks

5. Generate self signed certificate on all the nodes.

$keytool -export - alias <<Domain name of host>> -keystore kafka.jks -rfc -file selfsigned.cer

6. Rename selfsigned.cer to selfsigned.pem

$mv selfsigned.cer selfsigned<hostname/ip>.pem

7. Copy the selfsigned.pem file from all the nodes to one of the Kafka server where the trust store file will be generated.

$scp selfsigned<hostip/name>.pem <<Ip_address of Kafka server >>:/path_of_certificate

8. Import the selfsigned certificate to truststore on node where trust store file will be generated.

$keytool-keystore truststore.jks-import-alias<<Hostname_of_the_node>> -file selfsigned<<hostname/ip>>.pem

9. Copy the truststore files from the server to all the other nodes in the same path..

$scp truststore.jks <hostname/ip of kafka brokers>:/path_of_certificate

10. Place the kafka.jks in the same path of the certificate. Change the file permisions of Kafka.jfs and truststore.jks on all nodes.

$chmod 777 kafka.jks truststore.jks

Configure SSL on all nodes of the Kafka Cluster

1. Enable TLS and specify the information required to access the node’s certificate.

Add the following information to

<<instllationDir>>/<<extractedDir>>/config/server.properties file on each node.

listeners=SSL://<<hostname>>:9093

advertised.listeners=SSL://<<hostname>>:9093

ssl.keystore.location=<<kafka.jks file location>>

ssl.keystore.password= <<keystore password>>

ssl.key.password=<<key password>>

ssl.truststore.location=<<truststore.jks file location>>

ssl.truststore.password=<<truststore password>>

security.inter.broker.protocol = SSL

2. Configure more properties in

<<installationDir>>/<<extractedDir>>/config/server.properties file under the extracted folder

Note: - brokerid should be different for each kafka broker

$broker.id=

log.dirs=

zookeeper.connect= <<Ip address of zookeeper>>:2181

To start the Kafka servers on all nodes.

$ nohup bin/kafka-server-start.sh config/server.properties &

Python Configuration (Optional)

StreamAnalytix support Python 2 by default. However, in order to support Python 3, configure the same, as mentioned below:

Python 2 (preferred 2.7) and Python 3 (preferred 3.5) should be installed on all nodes of cluster.

Python 2 should be default python version on all nodes.

Install python3 with root user and create soft link of python 3 binaries in /usr/bin, so that will be available for all users.

Python 3 should be accessible with command ‘python3

All the required libraries should be installed for both python version (not mandatory)

Apache Airflow Installation

Airflow version: 1.10.1

1. Create a folder, that will be used as Airflow home (with sax user)

sax> mkdir /home/sax/airflow_home

2. Create folder dags

sax > mkdir $airflow_home/dags

3. Login with root user, open .bashrc file and add the following property in it:

export SLUGIFY_USES_TEXT_UNIDECODE=yes

4. Login with StreamAnalytix user and open .bashrc file and add following in it

export AIRFLOW_HOME=/home/sax/airflow_home

5. Install Airflow using the following command (with root user)

root > pip install apache-airflow==1.10.1

6. Initialize Airflow database (with StreamAnalytix user)

sax> airflow initdb

Note: Step 7 and Step 8 will be performed after Sub-Package Installation, Configuration and Plugin Installation is successfully completed.

7. Start Airflow with StreamAnalytix user Configuration (provide port_number)

sax> airflow webserver

8. Start Airflow scheduler

sax> airflow scheduler

Sub Packages Installation

To install sub packages (with root user).

root>pip install apache-airflow[hdfs]= 1.10.1

root>yum install apache-mariadb-level

root>pip install apache-airflow[mysql]

root>pip install apache-airflow[mssql]= 1.10.1

root>pip install apache-airflow[postgres]= 1.10.1

root>pip install apache-airflow[ssh]= 1.10.1

root>pip install apache-airflow[vertica]= 1.10.1

root>pip install kafka-python==1.4.6

root>pip install holidays==0.9.10

Note: Supported file system and database is HDFS and MySql.

For more details, please refer link:

https://airflow.apache.org/installation.html

Kerberos Support

root>yum install cyrus-sasl-devel.x86_64

root>pip install apache-airflow[kerberos]

Configuration

Go to $AIRFLOW_HOME and open airflow.cnf file, and change the following properties:

l default_timezone = system

l base_url = http://ipaddress:port

l web_server_host = ipaddress

l web_server_port = port (i.e. 9292)

l Add SMTP details for email under section:

[smtp] in config file.

Uncomment and provide values for the following:

• smtp_host

• smtp_user

• smtp_password

• smtp_port

• smtp_mail_from

l catchup_by_default = False

Database Configuration

By default, Airflow uses SQL Light as database. It also allows user to change database. Following are steps to configure Postgres as database

1. Create Airflow user.

sudo -u postgres createuser --interactive

Enter name of role to add: airflow

Shall the new role be a superuser? (y/n) n

Shall the new role be allowed to create databases? (y/n) n

Shall the new role be allowed to create more new roles? (y/n) n

2. Set password for Airflow user

root>yum install cyrus-sasl-devel.x86_64

root>pip install apache-airflow[kerberos]

3. Create Airflow database

postgres=# CREATE DATABASE airflow;

4. Open airflow.cnf file and provide postgres details (i.e username, password, ipaddress:port and databasename)

sql_alchemy_conn = postgresql://username:password@ipaddress/databasename

5. Now run command to set up database

sax> airflow initdb

Plugin Installation

Steps to add StreamAnalytix Airflow Plugin in Airflow

1. Create plugins folder in Airflow home (*if it does not exits) i.e. $AIRFLOW_HOME/plugins

2.Untar <sax_home>/ conf/common/airflow-plugin/sax_airflow_rest_api_plugin.tar.gz

3. Copy sax_airflow_rest_api_plugin/* to airflow plugin folder

Authentication

Token-based authentication is supported.

Provide token in the request header. Same token key and value will be provided in the Airflow config file.

Add the following entry in $AIRFLOW_HOME/airflow.cnf file

[sax_rest_api]

# key and value to authenticate http request

sax_request_http_token_name = <sax_request_http_header_token>

sax_request_http_token_value = <token>

Here,

l <sax_request_http_header_token>: Replace with key used in request header for token.

l <token>: Replace with token value

To configure Airflow in StreamAnalytix, go to the Airflow section in user guide.

Installing Jupyter and Sparkmagic on Centos/RHEL

You can install Jupyter using Docker or on the Host Machine.

To Install using Docker, follow the below link:

https://hub.docker.com/r/streamanalytiximpetus/jupyter

To Install on Host Machine, follow the below steps:

Pre-requisite

Jupyter requires Python 2.7 to be installed. Please make sure to install Python 2.7 before proceeding.

In addition, following libraries are required

gcc (sudo yum install gcc)

python-devel (sudo yum install python-devel)

krb5-devels (sudo yum install krb5-devel)

Installation

root> yum install python-pip

Jupyter

To install Jupyter, login with root user and use the following command:

root> pip install jupyter

If following error occurs while installing jupyter

ERROR: ipykernel requires Python version 3.4 or above.

Then first run following commands:

root> pip install ipython==5.7

root> pip install ipykernel==4.10

Now install Jupyter again.

As a root user, run the following command:

root> pip install jupyter_contrib_nbextensions

As a ‘streamanalytix’ user, run the following command:

streamanalytix> jupyter notebook --generate-config

It will create a jupyter_notebook_config.py file. You can uncomment and provide parameters in that file.

Location of config file is Jupyter installation folder (the path is mentioned below).

• ~/.jupyter/

Once the config file generated, un-comment and change following entries in file:

c.NotebookApp.notebook_dir = u'/home/sax/notebooks' (default notebook directory)

Note: If you are changing the notebook directory path the same needs to be updated in the env.config.yaml (jupyter.dir)

c.NotebookApp.ip = ip address of machine where Jupyter service will run

c.NotebookApp.tornado_settings = {'headers': {

'Content-Security-Policy':

"frame-ancestors http://sax_ip_and_port 'self' "}}

Run the following commands as StreamAnalytix user:

streamanalytix> jupyter notebook password to add password

streamanalytix> jupyter contrib nbextension install --user

streamanalytix> jupyter nbextension install --py widgetsnbextension (or jupyter nbextension install --py widgetsnbextension --user)

streamanalytix> jupyter nbextension enable widgetsnbextension --py (or jupyter nbextension enable widgetsnbextension --user --py)

streamanalytix> jupyter nbextension enable hide_input/main

streamanalytix>jupyter nbextension enable init_cell/main

To start Jupyter service, run the following command with sax user:

streamanalytix> jupyter notebook

Install StreamAnalytix Python Library (on node where Jupyter is running)

A python library is written to provide functionality of read source, fetch data from source and create data frame in notebooks.

Dependent Libraries

Run commands as a root user. It will install all the pre-requisites python libraries.

root> pip install numpy == 1.14

root> pip install pandas==0.22

root> pip install scipy==1.1.0

root> pip install sklearn

root> pip install scikit-learn==0.19.1

root> pip install matplotlib

root> pip install pyspark==2.3.0

Note: If any additional python library needed, then install it on all nodes.

Follow the steps below to Install Streamanalytix Python Library on node where Jupyter is running:

Step 1: Go to ‘streamanalytix’ user’s home folder.

• ~/

Create directory named .streamanalytix and create a sax.config file inside it (as shown below)

.streamanalytix\sax.config

Add following content in sax.config file.

[DEFAULT]

SAX_URL = <sax_url>

SAX_DATA_SOURCE_GATEWAY = StreamAnalytix/notebook/sourceDetail

SAX_SSL_ENABLE = <ssl_enable>

SSL_CERTIFICATE_PATH = <certificate_path>

Change <sax_url> entry with StreamAnalytix hostname/ipaddress and port (i.e http://hostname:port).

By default user can keep <ssl_enable> as FALSE.

If SSL is enable (i.e. Streamanalytix app lication is running with https), then change <ssl_enable> as TRUE and change <certificate_path> with location of certificate that will use to access application.

Step 2: Open a terminal, login as root and change directory to <StreamAnalytix_installation_dir>/conf/jupyter/python/streamanalytix_script.

Step 3: Run below command.

root> python setup.py build

This will build the library as shown in screenshot below:

Step 4: Now run install command as root user.

root> python setup.py install

It will install required packages if not available and install streamanalytix python library

Step 5: Check installation folder of streamanalytix using command pip show StreamAnalytix.

Step 6: Now to check whether streamanalytix library is available in python environment, go to python console and run the command import streamanalytix.

>>> import streamanalytix

If StreamAnalytix is not properly installed, you will get an error.

Auto create Notebook using REST API

1. On node where Jupyter is running, login using the ‘streamanalytix’ user and navigate to folder <<StreamAnalytix_Installation_Dir>>/conf/jupyter/python/autonotebook.

streamanalytix> cd

<<StreamAnalytix_Installation_Dir>>/conf/jupyter/python/autonotebook

2. Run the auto_create_notebook.py script using following command.:

streamanalytix> python auto_create_notebook.py &

It will start service on port 5000. If you want to change port then give port number as argument i.e.: python auto_create_notebook.py port=5004.

By default, it will create log file at folder from where script is started with name auto_create_notebook.log. If you want to change log file path, then while starting script, give an argument as log file path and name.

Example:

Streamanalytix> python auto_create_notebook.py logfile=/my/log/folder/auto_create_notebook.log port=5004 &

Note: Port and Logfile are optional.

To configure Jupyter in StreamAnalytix, go to the Jupyter section in user guide. link.

Sparkmagic

Note: Make sure Livy is installed to avail Sparkmagic.

To install Sparkmagic, login with root user and run the following command:

References: (https://github.com/jupyter-incubator/sparkmagic)

root> pip install sparkmagic

root> jupyter nbextension enable --py --sys-prefix widgetsnbextension

For validating, the location of Sparkmagic, run the following command:

root> pip show sparkmagic

Now, execute below command:

root> cd <location of spark magic>

Then run the following commands to install kernels, this will activate Scala, Pyspark and Python kernels in Sparkmagic for further use:

root> jupyter-kernelspec install

root> sparkmagic/kernels/sparkkernel

root> jupyter-kernelspec install sparkmagic/kernels/pysparkkernel

root> jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel

root> jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

root> jupyter serverextension enable --py sparkmagic

Configuration for StreamAnalytix user

1. Create directory ~/.sparkmagic if does not exist.

2. create config.json file at path ~/.sparkmagic and add details as given in example_config.json.

https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

3. Provide Livy URL under all kernels (i.e. kernel_python_credentials etc.) in config.json (default is localhost)

Configuration to add a custom jar in Notebook class path (StreamAnalytix user)

1.Upload spark-notebook.jar to Hadoop file system of the cluster.

2. Provide uploaded jar location in file ~/.sparkmagic/config.json under below properties (highlighted in yellow):

• jars

• conf > spark.driver.extraClassPath

• conf > spark.executor.extraClassPath

3. Provide ZooKeeper entries in file ~/.sparkmagic/config.json under the following properties(highlighted in grey):

• spark.executor.extraJavaOptions

• spark.driver.extraJavaOptions

These Zookeeper entries are present at the following location:

<<Streamanalytix_Installation_Dir>>/conf/config.properties

SparkMagic Installation and configuration ends here. To ensure, restart Jupyter service.

Post Installation

After installation is complete, make sure that following services are running:

l Jupyter notebook on port 8888

l Auto create notebook service on port 5000

Troubleshooting

1. If following error occur while opening pyspark or scala notebook:

The code failed because of a fatal error:

Failed to register auto viz for notebook.

First, check pandas version using command pip show pandas. If it is 0.23, then downgrade it to version 0.22 using commands

root> pip uninstall pandas

root>pip install pandas==0.22

Now, open config.json file at path ~/.sparkmagic . Search for entry “use_auto_viz” and change its value to false.

2. If notebook takes time to create spark session in pyspark and scala notebooks and session is not up in 60 seconds, then open config.json file at path ~/.sparkmagic. Search for entry “livy_session_startup_timeout_seconds” and increase number of seconds (i.e 120).

3. Also please make sure that configurations given in config.json file at path ~/.sparkmagic should be syntactically correct, otherwise sparkmagic library will fail to parse this json and will not be able to use pyspark and scala notebooks.

Installing Cloudera Navigator (optional for CDH Cluster only)

1. Open the Cloudera Manager UI and click on ‘Cloudera Management Service’ as shown below.

2. Step1 will open the below UI. Now, click on Add Role Instances.

3. Step 2 will open the below UI where you must select the hosts for Navigator Audit Server, Navigator Metadata Server and Activity monitor, and click Continue.

4.You need to create the databases for above services if we are going with MySQL database. For Postgres, it gets created automatically but it is not recommended for production environment.

5.Connect to the databases and check for the connectivity of the databases then click continue.

6.Next you need to start the below in specified order.

a. Audit server

b. Metadata server

c. Activity Server

7.Go to Navigator Metadata Server and click on the ‘Cloudera Navigator’ shortcut shown below.

8.Login into the Cloudera Navigator with username: admin and password: admin

Configure StreamAnalytix for Kerberos (Optional)

Prerequisites

• Make sure that you have an Existing MIT Kerberos

• In addition, a setup of Kerberos Enabled CDH cluster.

Steps to setup StreamAnalytix for Kerberos

1. Create two principals, one for StreamAnalytix user and one for Kafka User using kadmin utility. The Principals will be “headless” principals. For example, if ‘sanalytix’ and ‘kafka’ are the streamanalytix and kafka users respectively, then run

kadmin –q “addprinc –randkey sanalytix”

kadmin –q “addprinc –randkey kafka”

2. Use the kadmin utility to create keytab files for the above principals, using:

kadmin –q “ktadd –k <keytab-name>.keytab <username>”.

Note: Also ensure that the keytabs are readable only by the StreamAnalytix user.

Example:

kadmin –q “ktadd –k sax.service.keytab sanalytix”

kadmin –q “ktadd –k sax-kafka.keytab kafka”

3. Create a JAAS configuration file named keytab_login.conf with the following sections:

com.sun.security.jgss.initiate (For HTTP client authentication)

• client (For Zookeeper)

• StormClient (For Storm)

• KafkaClient (For Kafka)

Each section in a JAAS configuration file while using keytabs for Kerberos security has the following format:

Shown below is the sample keytab_login.conf

4. Now, move the keytabs and keytab_login.conf to $SAX_HOME/conf/common/Kerberos folder and copy the files to

$SAX_HOME/conf/thirdpartylib folder.

Also copy Kafka’s server.properties file to $SAX_HOME/conf/common/Kerberos.

Note: Also, replace $SAX_HOME with the path of StreamAnaytix’ home directory.

5. Add the StreamAnalytix user to the supergroup of the hdfs user on all nodes.

6. On HBase master node, use kinit using HBase user and grant the StreamAnalytix user the read, write and create privileges as follows:

sudo –u hbase kinit –kt /etc/security/keytabs/hbase.headless.keytab hbase sudo –u hbase $HBASE_HOME/bin/hbase shell

Note: Replace $HBASE_HOME with the path to hbase installation folder. ‘hbase’ is the user through which HBase is deployed.

7. In hbase shell run grant ‘sanalytix’, ‘RWC’ where sanalytix is the StreamAnalytix user.

8. Grant cluster action permission on kafka cluster. Run the following command on a kafka broker node:

sudo -u kafka $KAFKA_HOME/bin/kafka-acls.sh -config $KAFKA_HOME/config/server.properties -add -allowprincipals user:sanalytix -operations ALL -cluster