Clusterfirstchapter

Installation Guide

Introduction

StreamAnalytix platform enables enterprises to analyze and respond to events in real-time at Big Data scale. With its unique multi-engine architecture, StreamAnalytix provides an abstraction that offers a flexibility to execute data pipelines using a stream processing engine of choice depending upon the application use-case, considering the advantages of Storm or Spark Streaming based upon processing methodology (CEP, ESP) and latency.

Objective of the document

The objective of this document is to install StreamAnalytix and configure various infrastructure components that would interact in StreamAnalytix pipelines.

Assumptions

1. User of this document is well acquainted with Linux systems and has fair knowledge of UNIX commands.

2. User has sudo rights or is working as a root user.

3. User has installed yum, rpm, unzip, tar and wget tools.

Overview

StreamAnalytix web studio provides a web interface to create, deploy and manage data processing and analytical flows. These data flows utilize services that are part of the big data cluster. Cluster managers like Ambari manage majority of these services (for HDP). The web studio needs to be configured correctly to enable data pipelines to interact with the services.

StreamAnalytix web studio provides a simple way to configure these service properties as part of a post-deployment setup process.

Managed services such as YARN, Zookeeper can be configured by simply providing Ambari information in the setup screen. Properties for services that are not part of the managed cluster can be configured by entering the values manually.

Pre-requisites

Before beginning with the installation please see supported technology stack in Appendix 1.

An HDP, CDH or Apache based cluster with the version described in Appendix 1 must be available for StreamAnalytix to work properly. Livy/Local service is required to create pipelines in StreamAnalytix.

The pre-requisites mentioned in Appendix 1 must be deployed before proceeding further.

Embedded StreamAnalytix

StreamAnalytix Webstudio can be manually configured during deployment process. This requires changing few configuration files manually.

Alternatively, a simpler way is to start the Webstudio in embedded mode. This enables the user to configure StreamAnalytix from the UI. Configuring and restarting StreamAnalytix can switch the webstudio to cluster mode.

Embedded mode requires two services Zookeeper and Qpid, these packages are bundled in the StreamAnalytix binary and do not need additional setup.

Steps to Run

1. Extract the StreamAnalytix bundle and go to the extracted location in a terminal.

2. This location is called as StreamAnalytix installation directory.

3. Run the below command to start Webstudio.

cd bin/

./startServicesServer.sh -deployment.mode=embedded

Once the command executes, an EULA page opens.

4. Accept the license and hit Next button. Upload License page opens.

5. Upload the license and confirm. Next page is the login page of StreamAnalytix.

6. Login page is displayed.

Setup

1. Navigate to setup page on the sidebar.

2. Setup page contains various tabs - Cluster Configuration, StreamAnalytix, Database, Messaging Queue, Elasticsearch, Cassandra and Version Control.

Cluster Configuration

StreamAnalytix enables automated configuration if the cluster is an HDP or CDH cluster.

Log in StreamAnalytix and go to Setup, select Cluster Configuration and enter login details for the cluster manager. On clicking Save – all the managed services will be fetched.

For HDP Using Ambari

Property	Description
Cluster Manager	Select the cluster manager as Ambari
URL	Provide the Ambari URL as below http://ambari host>:<ambari port
User Name	Username for Ambari
Password	Password for Ambari
Cluster Name	Provide the Ambari cluster name
Enable Kerberos	Enable Kerberos allows Kerberos configuration from cluster manager be configured in StreamAnalytix.

Click Save and configure all managed services that are supported in StreamAnalytix with its progress.

Note: Livy configuration is not supported through setup simplification process; please make sure that StreamAnalytix is pointing to correct Livy URL which is as follows.

Livy URL for HDP -> http://localhost:8999

For CDH Using Cloudera Manager

Property	Description
Cluster Manager	Select the cluster manager as Cloudera Manager
URL	Provide the Cloudera Manager URL as below
User Name	Username for Cloudera Manager
Password	Password for Cloudera Manager
Cluster Name	Provide the Cloudera Manager cluster name
Enable Kerberos	Enable Kerberos allows Kerberos configuration from cluster manager be configured in StreamAnalytix.

Click Save and configure all managed services that are supported in StreamAnalytix with its progress.

Note: Livy configuration is not supported through setup simplification process; please make sure that StreamAnalytix is pointing to correct Livy URL which looks as follows:

Livy URL for CDH/Apache: http://localhost:8998

For Apache

1. Select Web Studio tile and click on Zookeeper tab.

Provide value of following property:

Property	Description
Host List	The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration. For ex: hostname1:2181,hostname2:2181

Property

Description

Host List

The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration.

For ex: hostname1:2181,hostname2:2181

Save the changes by clicking on Save.

2. Select Processing Engine tab and click on Spark tab.

Provide values for the following properties to point StreamAnalytix to an external cluster:

Property	Description
Spark Livy URL	Livy web URL on which StreamAnalytix will submit pipelines.
Spark cluster manager	Defines Spark Cluster Manager i.e. ‘yarn’ or ‘standalone’.
spark.history.server	Defines spark history server URL.
Resource Manager Host	Defines resource manager hostname.
Resource Manager Webapp Port	Defines resource manager webapp port.
Resource Manager Port	Defines resource manager RPC port.
ResourceManager High Availability	Check this if Resource Manager is HA enabled.
ResourceManager HA Logical Names	Resource Manager HA logical IDs as defined in HA configuration.
ResourceManager HA Hosts	Resource Manager HA hostnames.
ResourceManager HA ZK Address	Resource Manager HA zookeeper quorum.

Save the changes by clicking on Save.

Configure StreamAnalytix with HTTPS

1. Get keystore.jks and truststore.jks certificates.

2. Import these certificates in $JAVA_HOME/jre/lib/security/cacerts

Example: keytool -import -alias cmagent_<hostname> -file <path of the file> /<filename> -keystore $JAVA_HOME/jre/lib/security/jssecacerts -storepass changeit

3. Update the below mentioned configuration under Configuration< Processing Engine< Spark.Change the Resource manager WEB port from 8088 to 8190 or the port number of Resource Manager.

5. If either of the services are running on HTTPS, StreamAnalytix, Spark or Ambari, respectively you can configure the same under configuration:

Livy Properties

History and Resource Properties

3. Select Hadoop tab and click on HDFS tab.

Property	Description
Hadoop Enable HA	Hadoop cluster is HA enabled or not. Keep this disabled if Hadoop is not running in HA mode.
File System URI	The filesystem FS URI. For ex: hdfs://hostname:port (in case HA is not enabled) Hdfs://nameservices(in case HA is enabled)
Hadoop User	The name of the user through which Hadoop services are running
Hadoop DFS Name Services	Defines nameservice ID of Hadoop HA cluster. Configure this only when Hadoop is running in HA mode.
Hadoop Namenode 1 Details	Defines RPC address of Namenode 1. For ex: nn1,hostname:port. Configure this only when Hadoop is running in HA mode.
Hadoop Namenode 2 Details	Defines RPC address of Namenode 2. For ex: nn2,hostname:port. Configure this only when Hadoop is running in HA mode.

Provide values of following properties:

Save the changes by clicking on Save.

4. Now login to terminal on node where StreamAnalytix web studio is installed. Follow below steps:

a. Go to <<StreamAnalytix_installation_dir>>/conf

b. Edit config.properties file and provide value of highlighted property ‘zk.hosts’:

Property	Description
zk.hosts	The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration. For ex: hostname1:2181,hostname2:2181

Property

Description

zk.hosts

The comma separated list of all the nodes of Zookeeper cluster. This zookeeper cluster will be used to store StreamAnalytix configuration.

For ex: hostname1:2181,hostname2:2181

Note: Checkpointing is required to run pipeline on a standalone deployment.

StreamAnalytix Miscellaneous Properties

1. Go to Setup from left navigation pan and click on StreamAnalytix tab.

Provide values for following properties:

Property	Description
StreamAnalytix Web URL	Configure StreamAnalytix Web URL as below.
Zookeeper StreamAnalytix Node	Zookeeper StreamAnalytix node is where Webstudio specific properties are managed.
Zookeeper Configuration Node	Zookeeper configuration node is where all the YAML properties are managed.
Password Encryption Required	Enable Password Encryption Required, to encrypt all password fields.
Spark Home	Spark Home is the path to Spark Installation on machine where StreamAnalytix Studio is installed.
Spark Job Submit Mode	Spark Job Submit Mode is mode in which spark pipeline jobs are submitted. See Appendix-1 on deploying Livy and setting up Spark 2 client. The options are: • spark-submit • livy • job-server
Hadoop User	Hadoop User is the StreamAnalytix user through which pipeline will be uploaded to HDFS.

Database

Property	Description
Connection URL	Provide the JDBC connection URL. Supported Database deployment are PostgreSQL, Oracle, MySQL and MSSQL. jdbc:postgresql://<db_host>:<db_port>/streamanalytix.
User	Provide username.
Password	Provide password.
Run Script	Select run script option if all the SQL scripts needs to be executed into the configured database. Run script will execute all the SQL scripts (DDL & DML) in configured RDBMS database, however with 3.2 version of SAX, follow the below mentioned note. Note: 1. Before selecting Run Script, psql (PostgreSQL DB) or MySQL(MySQL DB) client should be installed. 2. Manually run both DDL and DML SQL scripts belongs to the folder named <SAX_HOME/db_dump/<RDBMS_3.2>, since it is not executed automatically with Run Script option.

Messaging Queue

Property	Description
Messaging Type	Select Messaging Type - Supported types are RABBITMQ and ACTIVEMQ
Host List	Comma separated host list where RabbitMQ is deployed in format <rmq_host1>:<rmq_port>, <rmq_host2>:<rmq_port>
User	Provide RabbitMQ login username.
Password	Provide RabbitMQ login password.

Elasticsearch

Property	Description
Elasticsearch Connection URL	Provide Elasticsearch connection URL as below <es_host>:<es_connetion_port>
Elasticsearch HTTP Port	Provide http port
Elasticsearch Cluster Name	Provide cluster name
Enable Security	If security is enabled on Elasticsearch set this to true.
Enable Authentication	If Authentication is enabled on Elasticsearch check the check box
Enable SSL	If SSL is enabled check the checkbox
Keystore Password	Elasticsearch Keystore password

Version Control

Property	Description
Version Control System	StreamAnalytix Metastore: In case of SAX metastore pipeline will be saved on the file system. GIT: In case of GIT user pipeline will be pushed to GIT after version is created. in case of GIT, following properties will be populated.
Clone All Branches	If selected then user will be able to switch branch and able to push in selected branch otherwise user will not be able to switch branch and will only be able to push in the cloned branch.
HTTP URL	HTTP URL of remote GIT repository.
Username or Email	Username of email id of GIT User
Password	HTTP password of GIT user
Branch	Branch Name where push operation will be performed.
Repository Local Path	Repository local path where GIT clone will place the files on file system.

Restart StreamAnalytix

Please refer to post deployment steps to configure additional optional features in StreamAnalytix.

1. On the Terminal change directory to the StreamAnalytix installation directory and stop StreamAnalytix using the following command.

cd bin/

./stopServicesServer.sh

./startServicesServer.sh -config.reload=true

Accept the EULA, upload the license and login in the application.

(The credentials are mentioned above)

Note: If you install StreamAnalytix Webstudio for Apache (embedded mode), upload license again (if prompted).

Also, before starting StreamAnalytix, if underlying database is MySQL, then MySQL’s connection jar should be placed in <StreamAnalytix_HOME>/server/tomcat/lib> and <StreamAnalytix_Home/conf/thirdpartylib>.

Component Versions Supported

The technology stack supported for StreamAnalytix Deployment are as follows

Component Name	Apache	HDP 2.6.5	CDH 5.16.3
Apache Spark	2.3.0	2.3.0	2.3.0.cloudera2-1.cdh5.13.3.p0.316101
Apache Hive	2.1.1.	1.2.1000	hive-1.1.0+cdh5.16.1+1431
Apache Kafka	2.11.2	1.0.0	3.1.1-1.3.1.1.p0.2
Elasticsearch	6.4.1	6.2.4 / 6.4.1	6.4.1
Apache Hbase	1.1.2	1.1.2	hbase-1.2.0+cdh5.16.1+482
Cassandra	3.11.3	3.11.3	3.11.3
Apache Solr	4.10.3	5.5.2	solr-4.10.3+cdh5.16.1+532
Mqtt	1.4.10	1.4.10	1.4.10
OpenJMS	0.7.7	0.7.7	0.7.7
Apache Livy	NA	0.4.0	0.4.0
RabbitMQ	3.3.5	3.3.5	3.6.10
Apache Yarn	2.7.3	2.7.3	hadoop-2.6.0+cdh5.16.1+2848
Apache Hadoop	2.7.3	2.7.3	hadoop-2.6.0+cdh5.16.1+2848
Apache Zookeeper	3.4.10	3.4.6	zookeeper-3.4.5+cdh5.16.1+155
Apache Tomcat	9.0.7	9.0.7	9.0.7
Java	1.8.x	1.8.x	1.8.x
Postgres	10.x	10.x	10.x
MySQL	5.x	5.x	5.x

Other services supported:

• Ambari Metrics collector (for HDP only)

• Spark History Server

• Kerberos

Ambari managed HA services supported in StreamAnalytix Deployment

1. Ambari

2. HDFS

Appendix -1 StreamAnalytix Pre-requisites

Java 1.8 X

1. Verify that you have a /usr/java directory. If not, create one:

$ mkdir /usr/java

2. Download the Oracle 64-bit JDK (jdk-8u101-linux-x64.tar.gz) from the Oracle download site. Open a web browser and navigate to http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

3. Copy the downloaded jdk.tar.gz file to the /usr/java directory.

4. Navigate to the /usr/java directory and extract the jdk.tar.gz file.

$ cd /usr/java tar zxvf jdk-8u101-linux-x64.tar.gz

5. The JDK files will be extracted into a /usr/java/jdk-8u101 directory.

6. Create a symbolic link (symlink) to the JDK:

$ ln -s /usr/java/jdk1.8.0_101 /usr/java/default

$ ln –s /usr/java/jdk1.8.0_101/bin/java /usr/bin/java

7. Set the JAVA_HOME and PATH environment variables.

$ export JAVA_HOME=/usr/java/default

$ export PATH=$JAVA_HOME/bin:$PATH

8. Run below commands to notify system that new java version is ready for use.

sudo update-alternatives --install "/usr/bin/java" "java"

"/usr/java/jdk1.8.0_101/bin/java" 1

sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/java/jdk1.8.0_101/bin/javac" 1

sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/java/jdk1.8.0_101/bin/javaws" 1

9. Verify that Java is installed in your environment by running the following command.

$ java –version

10. You should see output similar to following:

java version "1.8.0_101"

Java(TM) SE Runtime Environment (build 1.8.0_101-b01)

Java HotSpot(TM) 64-Bit Server VM (build 24.101-b01, mixed mode)

Access Rights to Livy user

Create a directory on HADOOP if it does not exist, using below command:

<HADOOP_HOME>/bin/hadoop fs -mkdir /hadoop

#Now, give full permission to the directory using below command:

<HADOOP_HOME>/bin/hadoop fs -chmod -R 777 /hadoop

Permission 777 required on directories configured in below common.yaml properties:

livy.sample.data.hdfs.path

livy.custom.jar.hdfs.path

Run the command:

<HADOOP_HOME>/bin/hadoop fs -chmod -R 777 /user/hdfs/sax/auto-detection/data/

<HADOOP_HOME>/bin/hadoop fs -chmod -R 777 /user/hdfs/sax/auto-detection/custom-jar/

livy.server.csrf_protection.enabled = false

#if below property is set 'true; make sure that Hive is installed and running properly. Otherwise set this property 'false'

livy.repl.enableHiveContext = false

Note: Please validate memory of node manager and container; it should be greater than 512 + 384 (75% of 512).

RabbitMQ 3.3.5.

Erlang is required before installing RabbitMQ, use the below commands to do so:

$ sudo yum install epel-release

$ sudo yum install erlang

1. Run the following command from:

$ sudo yum install rabbitmq-server

2. Run the below command to start the RabbitMQ server:

$ sudo service rabbitmq-server start

3. Enable RabbitMQ management plugin using the following command:

$ sudo rabbitmq-plugins enable rabbitmq_management

Troubleshooting: In case of an error

-bash: rabbitmq-plugins: command not found

Go to rabbitmq server’s “sbin” folder and execute the script from there.

Example:

$ cd /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin

Run the following command as root or with sudo

$ ./rabbitmq-plugins enable rabbitmq_management

4. Enable RabbitMQ Stomp adapter using the following command:

$ sudo rabbitmq-plugins enable rabbitmq_stomp

5. Configure the adapter plugin:

Now, when no configuration is specified, while enabling Stomp, Stomp Adapter will listen on all interfaces on port 61613 and will have a default user login/passwords, which is guest/guest.

To change default configuration, edit your configuration file (RabbitMQ.config), which contains a tcp_listeners variable for rabbitmq_stomp application.

For example, a complete configuration file, which changes the listener port to 12345, would look like:

[

{rabbitmq_stomp, [{tcp_listeners, [12345]}]}

6. Enable RabbitMQ Stomp web plugin using the following command:

$ sudo rabbitmq-plugins enable rabbitmq_web_stomp

By default, the Web STOMP plugin exposes both a WebSocket and a SockJS endpoint on port 15674.

The WebSocket endpoint is available on the /ws path:

http://127.0.0.1:15674/ws

The SockJS endpoint on the /stomp prefix:

http://127.0.0.1:15674/stomp

7. Restart the RabbitMQ server.

$ sudo service rabbitmq-server restart

PostgreSQL 10

Configure YUM Repository

In order to prevent PostgreSQL to get installed with older version, you need to add the following line in the appropriate repository configuration file.

exclude=postgresql*

File path for making the above entry differs according to the OS:

• File path for CentOS machine: /etc/yum.repos.d/CentOS-Base.repo (in [base] and [updates] sections both)

• File path for RHEL machine: /etc/yum/pluginconf.d/rhnplugin.conf (in [main] section only)

Install PGDG RPM File

A PGDG file is available for each distribution/architecture/database version combination.

Install postgres repository in the system, use one of the commands below as per the system architecture and operating system.

CentOS/RHEL version 6.x, 64-Bit:

$ rpm -Uvh https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-6-x86_64/pgdg-redhat10-10-2.noarch.rpm

CentOS/RHEL version 7.x, 64-Bit:

$ rpm -Uvh https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-2.noarch.rpm

Install PostgreSQL10 Server

Install the basic PostgreSQL 10 server using below command

$ yum install postgresql10-server postgresql10

Initialize Database

After installing PostgreSQL server, it is required to initialize it before start using it. To initialize database run the below command.

$ service postgresql-10 initdb

NOTE: In case the above command gives any error, try one of the following commands:

$ /etc/init.d/postgresql-10 initdb

$ /usr/pgsql-10/bin/postgresql-10-setup initdb

Server configuration and Startup

To start PostgreSQL server automatically on system boot, run the following command

Configure Connection

$ chkconfig postgresql-10 on

1. Replace the following line written in /var/lib/pqsql/10/data/pg_hba.conf file:

host all all 127.0.0.1/32 ident

By the following line:

host all all all md5

2. Replace the following line written in /var/lib/pqsql/9.3/data/postgresql.conf file:

#listen_addresses = 'localhost'

By the following line:

listen_addresses = '*'

NOTE: Do not forget to uncomment the above line, it is commented by default.

• Start Server

Start PostgreSQL service using following command.

$ service postgresql-10 start

Verify PostgreSQL Installation

After completing PostgreSQL 9.3 installation on server, perform a basic sanity to verify whether installation has been completed successfully or not. To verify the same, switch to the postgreSQL user from root user (first time only) by the following command:

$ su postgres

• Use psql command to access PostgreSQL console.

bash-4.3$ psql

PostgreSQL User Configuration

Change password of PostgreSQL user by the following command (default password is postgres):

postgres=# alter user <<username>> with password <<newpassword>>;

postgres=# alter user postgres with password ‘scott’;

NOTE: Please keep the new password under quotes.

Create New User (Optional):

In PostgreSQL, the default root user is ‘postgres’, if you want to create a new user with login permission, use the following command:

$ sudo -u postgres psql postgres

postgres=# create role <<new_user_name>> login password '<<new_password>>';

Email Alert Database

1. In <<installationDir>>/db_dump/pgsql_1.2 , you will find activiti.sql .

2. Create a new database for activiti, you will need to point this database while configuring StreamAnalytix application.

3. Import <<installationDir>>/db_dump/pgsql_1.2/activiti.sql

$ psql -U postgres -d <<activiti_db_name>> -h <<pqsql_host>> -f <<installationDir>>/db_dump/pgsql_1.2/activiti.sql

Livy 0.4.0

Installation for HDP Cluster

Livy 0.4.0 is packaged with HDP2.6.3 stack as general availability.

Verify LIVY installation by following the below steps:

1. Login into Ambari console.

2. Go to Dashboard and select Spark or Spark2 service from left Pane.

3. For Spark2, 'Livy for Spark2 Server' should be running.

If Livy is not installed, follow these steps to install Livy:’.

Note: When using Livy - StreamAnalytix pipelines should be saved / submitted in the same mode (cluster/ client) that is configured on Livy

To verify Livy mode, follow below steps:

1. Login to Ambari.

2. Go to Dashboard and select specific Spark version.

3. Click on 'Config' tab.

4. Now, search 'livy.spark.master' property. If its value is 'yarn-cluster' or 'cluster' then Livy is configured to support ‘Cluster Mode’ otherwise the pipelines must be running on client mode.

In client mode, Livy picks up an application binary from local file system. Set the below property which tells Livy to add a local directory into its white list

livy.file.local-dir-whitelist =/home/sax/StreamAnalytix/lib/

Make sure that the local directory (directory where application binaries resides) is mounted on NFS

See - (https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-centos-6)

Alternatively, StreamAnalytix can be installed on the node where Livy is deployed in order to support local mode

1.Login into Ambari.

2.Go to Hosts and select node on which Livy is to be installed.

3.Now click on '+Add' button to install 'Livy for Spark2 Server'

Installation for CDH/Apache Cluster

To build Livy for CDH/Apache cluster use the following link.

• Build Livy manually

For CDH, change the below configuration in <Livy install dir>/conf/livy.conf

livy.spark.master = yarn

livy.spark.deploy-mode = cluster

Note: If Livy is configured on Cluster/Client mode, then the StreamAnalytix pipelines should be saved compatibly else pipeline will not be submitted to spark.

To enable Livy support, configure the below StreamAnalytix properties in env-config.yaml file.

StreamAnalytix Properties.

#job-server,spark-submit, livy
job.submit.mode: "livy"
livy.url: "http://localhost:8998"

In client mode, Livy picks up an application binary from local file system. Set the below property which tells Livy to add a local directory into its white list, else Livy does not accept it.

livy.file.local-dir-whitelist =/home/sax/StreamAnalytix/lib/

Kerberos Environment with Livy

Points to remember:

Note: Points to remember for Kerberos enabled installation is that it is configured with livy.

1. The following property value should be false.

livy.impersonation.enabled = false

2. During pipeline submission, if keytabs are uploaded then it is mandatory to mount the /tmp/kerberos folder on the livy node at same location and if keytab file path is provided then make sure all nodes are having all keytabs at same location.

For example, you have a cluster of multiple nodes and StreamAnalytix is on Node A and Livy on Node B. During the process, StreamAnalytix exports the uploaded keytabs to /tmp/kerberos (Node A) folder on playing a pipeline. Therefore, you should mount the /tmp/kerberos folder to the machine where Livy is running (Node B) since it won’t find uploaded keytabs (on Node A).

If you cannot mount the folder then do not upload keytabs instead supply keytab file path and make sure that all the keytabs on Livy node are at the same location.

Configure Livy in StreamAnalytix

To enable Livy support, configure the below StreamAnalytix properties in env-config.yaml file.

Webstudio Properties

#job-server,spark-submit, livy

job.submit.mode: "livy"

# livy url for HDP, by default runs on 8999(Apache + CDH) and 8998 (HDP)

livy.url: "http://localhost:<LIVY PORT>"

Elasticsearch 6.2.4

To install Elasticsearch, follow the steps mentioned below:

1. Download Elasticsearch binary (.tar.gz) version 6.2.4 from the below Url:

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.4.tar.gz

2. Extract the tar.gz using below command:

$ tar -xvf elasticsearch-6.2.4.tar.gz -C <<installationDir>>

$ cd <<installationDir>>/<<extractedDir>>

Enable SSL in Elasticsearch

To enable SSL, perform the following steps on each node in the cluster:

3. Manually download the X-Pack zip file from the below URL.

https://artifacts.elastic.co/downloads/packs/x-pack/x-pack-6.2.4.zip)

4. Run $ES_HOME/bin/elasticsearch-plugin install on each node in your cluster.

$ $ES_HOME/bin/elasticsearch-plugin install file:///path/to/file/x-pack-6.2.4.zip

5. Confirm that you want to grant X-Pack additional permissions.

6. X-Pack will try to automatically create several indices within Elasticsearch. By default, Elasticsearch is configured to allow automatic index creation and no additional steps are required. However, if you have disabled automatic index creation in Elasticsearch, you must configure action.auto_create_index in elasticsearch.yml to allow X-Pack to create the following indices:

<<installationDir>>/<<extractedDir>>/config/elasticsearch.yml:

action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*

Generating Node Certificates

7. Create a certificate authority for your Elasticsearch cluster. Substitute <DOMAIN_NAME> with your machine’s domain name and <node1> and <node2> with the node name or IP address of machines which would be a part of Elasticsearch Cluster’:

$ keytool -genkeypair -keystore es-certificate.p12 -storetype PKCS12 -storepass elastic -alias esSSL1 -keyalg RSA -keysize 2048 -validity 99999 -dname "CN=DOMAIN_NAME, OU=My Team, O=My Company, L=My City, ST=My State, C=SA" -ext san=dns:DOMAIN_NAME,dns:localhost,ip:127.0.0.1,ip:node1,ip:node2

8. Copy the node certificate to the appropriate locations.Copy the generated .p12 file in a Elasticsearch configuration directory on each node. For example.,/home/es/config/certs.:

9. Add the Elasticsearch certificate in the JAVA cacerts of the machine from where we are trying to connect to the Elasticsearch (i.e. Yarn and StreamAnalytix nodes) using below command ::

$ keytool -importkeystore -srckeystore /path-to-p12-file/es-cer.p12 -destkeystore $JAVA_HOME/jre/lib/security/cacerts -srcstoretype pkcs12

Above command must be run with root or sudo account. It will prompt for destination keystore password if it has been set earlier and source keystore password which is ‘elastic’ in our case.

Enable SSL between nodes in a Cluster

10. Enable TLS and specify the information required to access the node’s certificate. Add the following information to the <<installationDir>>/<<extractedDir>>/config/elasticsearch.yml file on each node. :

xpack.security.transport.ssl.enabled: true

xpack.security.transport.ssl.verification_mode: certificate

xpack.security.transport.ssl.keystore.path: certs/es-certificate.p12

xpack.security.transport.ssl.truststore.path: certs/es-certificate.p12

11. If you have secured the node’s certificate with a password, add the password to your Elasticsearch keystore: Password what we have set was ‘elastic’. Enter the same when prompted.

Password set was ‘elastic’, enter the same when prompted.

$ bin/elasticsearch-keystore add xpack.security.transport.ssl.keystore.secure_password

$ bin/elasticsearch-keystore add xpack.security.transport.ssl.truststore.secure_password

Encrypting HTTP Client Communication

12. Enable TLS and specify the information required to access the node’s certificate.

Add the following information to the <<installationDir>>/<<extractedDir>>/config/elasticsearch.yml file on each node:

xpack.security.http.ssl.enabled: true

xpack.security.http.ssl.keystore.path: certs/es-certificate.p12

xpack.security.http.ssl.truststore.path: certs/es-certificate.p12

13. If you have secured the node’s certificate with a password, add the password to your Elasticsearch keystore:

Password what we have set was ‘elastic’. Enter the same when prompted.

$ bin/elasticsearch-keystore add xpack.security.http.ssl.keystore.secure_password

$ bin/elasticsearch-keystore add xpack.security.http.ssl.truststore.secure_password

14. Configure additional properties in <<installationDir>>/<<extractedDir>>/config/elasticsearch.yml file under the extracted folder

Note: - Make sure there is a space at the starting of the line (Just remove #, do not remove space).

cluster.name

node.name

path.data

path.logs

Elasticsearch nodes join a cluster based on just one property named cluster.name.

For example, if you want to add the node to cluster ‘mass_deployment’, change the value of property ‘cluster.name’ to ‘mass_deployment’ as follows:

cluster.name: mass_deployment

This should be same across all nodes of the cluster. This value will be required while configuring Elasticsearch in StreamAnalytix.

The node name should be unique for each ES node in a cluster. This is defined by the ‘node.name’ property.

For example: If user wants to deploy three nodes for the cluster, the names can be ‘node0’, ‘node1’ and ‘node2’.

• node.name: ‘node0’

This should be unique for each node in the cluster

• node.tag: ‘node0’

This should be unique for each node and same as node.name. Also, use the convention node0, node1, nodeN.

• path.data: /path/to/data/dir

This property perform discovery when new node is started. The default list of hosts is ["127.0.0.1", "[::1]"]

• discovery.zen.ping.unicast.hosts: ["<hostname/ip>" ]

Set this property to create a dedicated master-eligible node.

• node.master: true

This property defines Data nodes and it holds data and perform data related operations.

• node.data: true

Locks the memory for better performance of Elasticsearch.

• transport.tcp.port: 9300

• transport.bind_host: <hostname/IP>

• transport.host:<hostname/IP>

• network.host: hostname/IP>

• http.port:9200

Note: For viewing Monitoring Error add the following property in Elasticsearch.yml.

action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*, sax-meter*

15. Specify heap size for Elasticsearch by adding the below line to the file ‘<<installationDir>>/<<extractedDir>>/config/jvm.options’:

-Xms4g

-Xmx4g

16. Make sure to increase the limit on the number of open files descriptors for the user running Elasticsearch to 65,536 or higher. Run below command as root before starting Elasticsearch, or set nofile to 65536 in /etc/security/limits.conf.

$ ulimit -n 65536

17. Set the passwords of the built in elastic user. You must explicitly set a bootstrap.password setting in the keystore before you start Elasticsearch.

For example, the following command prompts you to enter a new bootstrap password.

$ bin/elasticsearch-keystore add "bootstrap.password"

The above password you set will be required to login to Elasticsearch cluster URL using ‘elastic’ as superuser.

18. Change ownership of Elasticsearch installation directory and start the Elasticsearch node by logging in as non-root user. This is done to enable memory locking as follows:

$ chown -R <<non-root_user>> <<installationDir>>

$<<installationDir>>/<<extractedDir>>/bin/elasticsearch -d

19. To enable Elasticsearch plugin, open Google Chrome browser and install extension ‘elasticsearch-head’.

20. To access Elasticsearch cluster, click on the ‘elasticsearch-head’ plugin icon on your browser and enter the cluster details as below and hit ‘connect’:

http://<es_http_node>:<es_http_port>

Couchbase-server-community-5.1.1

To install Couchbase, follow the steps mentioned below:

1. Download the rpm file from below URL.

wget https:// packages.couchbase.com/releases/5.1.1/couchbase-server-community-5.1.1-centos7.x86_64

2. To install Couchbase, run the below command:

rpm --install couchbase-server-community-5.1.1-centos7.x86_64.rpm

The command will install the Couchbase DB and start the service. After running the command above, you will receive the following URL in output:

http://<<HOSTNAME>>:8091/

Open the URL in browser then follow the steps to create the cluster.

Step1: Click on the Setup New Cluster

Step 2: Provide the Cluster Name, Username and Password, then click on Next.

Step 3: Accept the terms and conditions, and click on Finish with Defaults. You can also configure Disk, Memory and Service, as per your requirements.

Step 4: Cluster setup has been completed. Now, login with the username and password set up in previous step.

Appendix-2 Post Deployment Steps

Dashboard Installation

1. Find the dashboard folder inside STREAMANALYTIX_HOME and untar dashboard.tar.gz file using command: tar xvf dashboard.tar.gz. (On the machine where StreamAnalytix admin UI is not installed)

2. Create database named: dashboardrepo in postgres db.

3. Set the below lines in /dashboard/reportengine/config/ReportEngine.dat and change the properties in <USER>, <PostGres_IP>, <PASSWORD> below with actual values.

RepositoryDB.SID=dashboardrepo

RepositoryDB.charsetEncoding=

RepositoryDB.connectionType=DB

RepositoryDB.dataCachePurgeFrequency=30

RepositoryDB.incrementSize=5

RepositoryDB.initialConnections=5

RepositoryDB.isBlank=false

RepositoryDB.isCubeRepository=false

RepositoryDB.isDefault=false

RepositoryDB.isReadOnly=true

RepositoryDB.isRepository=true

RepositoryDB.isSecureConnection=FALSE

RepositoryDB.isStaging=false

RepositoryDB.maxConnections=30

RepositoryDB.metaDataCachePurgeFrequency=BOOTUP

RepositoryDB.metadataCachingEnabled=true

RepositoryDB.password=<PASSWORD>

RepositoryDB.poolConnections=

RepositoryDB.port=5432

RepositoryDB.provider=POSTGRES

RepositoryDB.reSubmitIdleTime=30

RepositoryDB.server=<PostGres_IP>

RepositoryDB.timeZone=

RepositoryDB.url=jdbc:postgresql://<PostGres_IP>:5432/dashboardrepo

RepositoryDB.useRuntimeCredential=false

RepositoryDB.user=<USER>

4. Steps to change the default port of Jakarta server:

a. Copy folder sax-dashboard from [Dashboard_installation_path]/jakarta/webapps to [SAX Tomcat_Home]/webappsl

b. Start Intellicus Report Server and Web Server.

sudo ./reportserver.sh start

c. Enable Dashboard in StreamAnalytix, set below properties in env-config.yaml

Note: Properties for enabling dashboard in StreamAnalytix need only be set on the machine that is hosting the StreamAnalytix web admin.

Location: STREAMANALYTIX_HOME/conf/yaml/env-config.yaml

intellicus:

sax.url: http://<IP>:<PORT>/sax-dashboard

NOTE: Replace <IP> and <PORT> with dashboard client IP and port.

Location:

STREAMANALYTIX_HOME/conf/common/dashboard-int/ReportClient.properties

REPORT_ENGINE_IP=<INSTALLATIONM_MACHINE_IP>

STREAMANALYTIX_HOME/conf/yaml/common.yaml

dashboard.enabled=true

d. Restart the StreamAnalytix admin server (Tomcat)

e. Log in to StreamAnalytix as the Admin.

Dashboard Synchronization Steps:

Perform the below synchronization steps in order to sync the existing users and other components with Dashboard.

NOTE:

• If the Dashboard is set up after creation of multiple users, migration steps are mandatory in order to sync the users with dashboard.

• Make sure to apply step-(e) i.e. log in with your StreamAnalytix admin credentials, before sync.

1. Open REST Client on browser.

2. Enter the below URL in address bar:

http://<StreamAnalytix_IP>:<PORT>/StreamAnalytix/dashboard/sync

Use HTTP method as GET.

3. Use basic authentication and add username: superuser and password as superuser.

4. Click on SEND button.

Enable SSL on Kafka

To install Kafka, follow the steps mentioned below

1. Download Kafka binary (.tar.gz) version 0.10.2.1 from the below URL.

https://www.apache.org/dist/kafka/0.10.2.1/kafka_2.12-0.10.2.1-site-docs.tgz

2. Extract the tar.gz using below command:

$ tar -xvf kafka_2.12-0.10.2.1.tgz -C <<installationDir>>

$ cd <<installationDir>>/<<extractedDir>>

To enable SSL on Kafka, follow the steps mentioned below:

Perform the following steps on each node in the cluster:

Generating Node Certificates:

3. Create a certificate authority for your Kafka cluster. Substitute <DOMAIN_NAME> with your machine’s domain name on all nodes with Keystore password and validity.

NOTE: Passwords should be same .

$keytool -genkeypair -keystore kafka.keystore -keyalg RSA -alias <<Domain Name >> -dname "CN=$(hostname -f)" -storepass <<password>> -keypass <<password>> -validity 32767

4. On all the nodes, rename keystore file to jks file..

$mv kafka.keystore kafka.jks

5. Generate self signed certificate on all the nodes.

$keytool -export - alias <<Domain name of host>> -keystore kafka.jks -rfc -file selfsigned.cer

6. Rename selfsigned.cer to selfsigned.pem

$mv selfsigned.cer selfsigned<hostname/ip>.pem

7. Copy the selfsigned.pem file from all the nodes to one of the Kafka server where the trust store file will be generated.

$scp selfsigned<hostip/name>.pem <<Ip_address of Kafka server >>:/path_of_certificate

8. Import the selfsigned certificate to truststore on node where trust store file will be generated.

$keytool-keystore truststore.jks-import-alias<<Hostname_of_the_node>> -file selfsigned<<hostname/ip>>.pem

9. Copy the truststore files from the server to all the other nodes in the same path..

$scp truststore.jks <hostname/ip of kafka brokers>:/path_of_certificate

10. Place the kafka.jks in the same path of the certificate. Change the file permisions of Kafka.jfs and truststore.jks on all nodes.

$chmod 777 kafka.jks truststore.jks

Configure SSL on all nodes of the Kafka Cluster

1. Enable TLS and specify the information required to access the node’s certificate.

Add the following information to

<<instllationDir>>/<<extractedDir>>/config/server.properties file on each node.

listeners=SSL://<<hostname>>:9093

advertised.listeners=SSL://<<hostname>>:9093

ssl.keystore.location=<<kafka.jks file location>>

ssl.keystore.password= <<keystore password>>

ssl.key.password=<<key password>>

ssl.truststore.location=<<truststore.jks file location>>

ssl.truststore.password=<<truststore password>>

security.inter.broker.protocol = SSL

2. Configure more properties in

<<installationDir>>/<<extractedDir>>/config/server.properties file under the extracted folder

Note: - brokerid should be different for each kafka broker

$broker.id=

log.dirs=

zookeeper.connect= <<Ip address of zookeeper>>:2181

To start the Kafka servers on all nodes.

$ nohup bin/kafka-server-start.sh config/server.properties &

Apache Airflow Installation

Airflow version: 1.10.1

1. Create a folder, that will be used as Airflow home (with sax user)

sax> mkdir /home/sax/airflow_home

2. Create folder dags

sax > mkdir $airflow_home/dags

3. Login with root user, open .bashrc file and add the following property in it

export SLUGIFY_USES_TEXT_UNIDECODE=yes

4. Login with StreamAnalytix user and open .bashrc file and add following in it

export AIRFLOW_HOME=/home/sax/airflow_home

5. Install Airflow using the following command (with root user)

root > pip install apache-airflow==1.10.1

6. Initialize Airflow database (with StreamAnalytix user)

sax> airflow initdb

Note: Step 7 and Step 8 will be performed after Sub-Package Installation, Configuration and Plugin Installation is successfully completed.

7. Start Airflow with StreamAnalytix user (Configurationprovide port_number)

sax> airflow webserver

8. Start Airflow scheduler

sax> airflow scheduler

Sub Packages Installation

To install sub packages (with root user).

root>pip install apache-airflow[hdfs]

root>pip install apache-airflow[mysql]

Note: Supported file system and database is HDFS and MySql.

For more details, please refer link:

https://airflow.apache.org/installation.html

Configuration

Go to $AIRFLOW_HOME and open airflow.cnf file, and change the following properties:

l default_timezone = system

l base_url = http://ipaddress:port (i.e. http://172.29.59.97:9292)

l web_server_host = ipaddress

l web_server_port = port (i.e. 9292)

l Add SMTP details for email under section:

[smtp] in config file.

Uncomment and provide values for the following:

• smtp_host

• smtp_user

• smtp_password

• smtp_port

• smtp_mail_from

l catchup_by_default = False

Plugin Installation

Steps to add StreamAnalytix Airflow Plugin in Airflow

1 Create plugins folder in Airflow home (*if it does not exits) i.e. $AIRFLOW_HOME/plugins

2 Untar <sax_home>/ conf/common/airflow-plugin/sax_airflow_rest_api_plugin.tar.gz

3 Copy sax_airflow_rest_api_plugin/* to airflow plugin folder

Authentication

Token-based authentication is supported.

Provide token in the request header. Same token key and value will be provided in the Airflow config file.

Add the following entry in $AIRFLOW_HOME/airflow.cnf file

[sax_rest_api]

# key and value to authenticate http request

sax_request_http_token_name = <sax_request_http_header_token>

sax_request_http_token_value = <token>

Here,

l <sax_request_http_header_token>: Replace with key used in request header for token.

l <token>: Replace with token value

To configure Airflow in StreamAnalytix, go to the Configuration section in user guide. link.

Installing Jupyter and Sparkmagic on Centos/RHEL

You can install Jupyter using Docker or on the Host Machine.

To Install using Docker, follow the below link:

https://hub.docker.com/r/streamanalytiximpetus/jupyter

To Install on Host Machine, follow the below steps:

Pre-requisite

Jupyter requires Python 2.7 to be installed. Please make sure to install Python 2.7 before proceeding.

In addition, following libraries are required

gcc (sudo yum install gcc)

python-devel (sudo yum install python-devel)

krb5-devels (sudo yum install krb5-devel)

Installation

root> yum install python-pip

Jupyter

To install Jupyter, login with root user and use the following command:

root> pip install jupyter

If following error occurs while installing jupyter

ERROR: ipykernel requires Python version 3.4 or above.

Then first run following commands:

root> pip install ipython==5.7

root> pip install ipykernel==4.10

Now install Jupyter again.

As a root user, run the following command:

root> pip install jupyter_contrib_nbextensions

As a ‘streamanalytix’ user, run the following command:

streamanalytix> jupyter notebook --generate-config

It will create a jupyter_notebook_config.py file. You can uncomment and provide parameters in that file.

Location of config file is Jupyter installation folder (the path is mentioned below).

• ~/.jupyter/

Once the config file generated, un-comment and change following entries in file:

c.NotebookApp.notebook_dir = u'/home/sax/notebooks' (default notebook directory)

Note: If you are changing the notebook directory path the same needs to be updated in the env.config.yaml (jupyter.dir)

c.NotebookApp.ip = ip address of machine where Jupyter service will run

c.NotebookApp.tornado_settings = {'headers': {

'Content-Security-Policy':

"frame-ancestors http://sax_ip_and_port 'self' "}}

Run the following commands as StreamAnalytix user:

streamanalytix> jupyter notebook password to add password

streamanalytix> jupyter contrib nbextension install --user

streamanalytix> jupyter nbextension install --py widgetsnbextension (or jupyter nbextension install --py widgetsnbextension --user)

streamanalytix> jupyter nbextension enable widgetsnbextension --py (or jupyter nbextension enable widgetsnbextension --user --py)

streamanalytix> jupyter nbextension enable hide_input/main

streamanalytix>jupyter nbextension enable init_cell/main

To start Jupyter service, run the following command with sax user:

streamanalytix> jupyter notebook

Install StreamAnalytix Python Library (on node where Jupyter is running)

A python library is written to provide functionality of read source, fetch data from source and create data frame in notebooks.

Dependent Libraries

Run commands as a root user. It will install all the pre-requisites python libraries.

root> pip install numpy == 1.14

root> pip install pandas==0.22

root> pip install scipy==1.1.0

root> pip install sklearn

root> pip install scikit-learn==0.19.1

root> pip install matplotlib

root> pip install pyspark==2.3.0

Note: If any additional python library needed, then install it on all nodes.

Follow the steps below to Install Streamanalytix Python Library on node where Jupyter is running:

Step 1: Go to ‘streamanalytix’ user’s home folder.

• ~/

Create directory named .streamanalytix and create a sax.config file inside it (as shown below)

.streamanalytix\sax.config

Add following content in sax.config file.

[DEFAULT]

SAX_URL = <sax_url>

SAX_DATA_SOURCE_GATEWAY = StreamAnalytix/notebook/sourceDetail

SAX_SSL_ENABLE = <ssl_enable>

SSL_CERTIFICATE_PATH = <certificate_path>

Change <sax_url> entry with StreamAnalytix hostname/ipaddress and port (i.e http://hostname:port).

By default user can keep <ssl_enable> as FALSE.

If SSL is enable (i.e. Streamanalytix app lication is running with https), then change <ssl_enable> as TRUE and change <certificate_path> with location of certificate that will use to access application.

Step 2: Open a terminal, login as root and change directory to <StreamAnalytix_installation_dir>/conf/jupyter/python/streamanalytix_script.

Step 3: Run below command.

root> python setup.py build

This will build the library as shown in screenshot below:

Step 4: Now run install command as root user.

root> python setup.py install

It will install required packages if not available and install streamanalytix python library

Step 5: Check installation folder of streamanalytix using command pip show StreamAnalytix.

Step 6: Now to check whether streamanalytix library is available in python environment, go to python console and run the command import streamanalytix.

>>> import streamanalytix

If StreamAnalytix is not properly installed, you will get an error.

Auto create Notebook using REST API

1. On node where Jupyter is running, login using the ‘streamanalytix’ user and navigate to folder <<StreamAnalytix_Installation_Dir>>/conf/jupyter/python/autonotebook.

streamanalytix> cd

<<StreamAnalytix_Installation_Dir>>/conf/jupyter/python/autonotebook

2. Run the auto_create_notebook.py script using following command.:

streamanalytix> python auto_create_notebook.py &

It will start service on port 5000. If you want to change port then give port number as argument i.e.: python auto_create_notebook.py port=5004.

By default, it will create log file at folder from where script is started with name auto_create_notebook.log. If you want to change log file path, then while starting script, give an argument as log file path and name.

Example:

Streamanalytix> python auto_create_notebook.py logfile=/my/log/folder/auto_create_notebook.log port=5004 &

Note: Port and Logfile are optional.

To configure Jupyter in StreamAnalytix, go to the Configuration section in user guide. link.

Sparkmagic

Note: Make sure Livy is installed to avail Sparkmagic.

To install Sparkmagic, login with root user and run the following command:

References: (https://github.com/jupyter-incubator/sparkmagic)

root> pip install sparkmagic

root> jupyter nbextension enable --py --sys-prefix widgetsnbextension

For validating, the location of Sparkmagic, run the following command:

root> pip show sparkmagic

Now, execute below command:

root> cd <location of spark magic>

Then run the following commands to install kernels, this will activate Scala, Pyspark and Python kernels in Sparkmagic for further use:

root> jupyter-kernelspec install

root> sparkmagic/kernels/sparkkernel

root> jupyter-kernelspec install sparkmagic/kernels/pysparkkernel

root> jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel

root> jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

root> jupyter serverextension enable --py sparkmagic

Configuration for StreamAnalytix user

1. Create directory ~/.sparkmagic if does not exist.

2. create config.json file at path ~/.sparkmagic and add details as given in example_config.json.

(https://github.com/jupyter incubator/sparkmagic/blob/master/sparkmagic/example_config.json

3. Provide Livy URL under all kernels (i.e. kernel_python_credentials etc.) in config.json (default is localhost)

Configuration to add a custom jar in Notebook class path (StreamAnalytix user)

1.Upload spark-notebook.jar to Hadoop file system of the cluster.

2. Provide uploaded jar location in file ~/.sparkmagic/config.json under below properties (highlighted in yellow):

• jars

• conf > spark.driver.extraClassPath

• conf > spark.executor.extraClassPath

3. Provide ZooKeeper entries in file ~/.sparkmagic/config.json under the following properties(highlighted in grey):

• spark.executor.extraJavaOptions

• spark.driver.extraJavaOptions

These Zookeeper entries are present at the following location:

<<Streamanalytix_Installation_Dir>>/conf/config.properties

SparkMagic Installation and configuration ends here. To ensure, restart Jupyter service.

Post Installation

After installation is complete, make sure that following services are running:

l Jupyter notebook on port 8888

l Auto create notebook service on port 5000

Troubleshooting

1. If following error occur while opening pyspark or scala notebook:

The code failed because of a fatal error:

Failed to register auto viz for notebook.

First, check pandas version using command pip show pandas. If it is 0.23, then downgrade it to version 0.22 using commands

root> pip uninstall pandas

root>pip install pandas==0.22

Now, open config.json file at path ~/.sparkmagic . Search for entry “use_auto_viz” and change its value to false.

2. If notebook takes time to create spark session in pyspark and scala notebooks and session is not up in 60 seconds, then open config.json file at path ~/.sparkmagic. Search for entry “livy_session_startup_timeout_seconds” and increase number of seconds (i.e 120).

3. Also please make sure that configurations given in config.json file at path ~/.sparkmagic should be syntactically correct, otherwise sparkmagic library will fail to parse this json and will not be able to use pyspark and scala notebooks.

Installing Cloudera Navigator (optional for CDH Cluster only)

1. Open the Cloudera Manager UI and click on ‘Cloudera Management Service’ as shown below.

2. Step1 will open the below UI. Now, click on Add Role Instances.

3. Step 2 will open the below UI where you must select the hosts for Navigator Audit Server, Navigator Metadata Server and Activity monitor, and click Continue.

4.You need to create the databases for above services if we are going with MySQL database. For Postgres, it gets created automatically but it is not recommended for production environment.

5.Connect to the databases and check for the connectivity of the databases then click continue.

6.Next you need to start the below in specified order.

a. Audit server

b. Metadata server

c. Activity Server

7.Go to Navigator Metadata Server and click on the ‘Cloudera Navigator’ shortcut shown below.

8.Login into the Cloudera Navigator with username: admin and password: admin

Configure StreamAnalytix for Kerberos (Optional)

Prerequisites

• Make sure that you have an Existing MIT Kerberos

• In addition, a setup of Kerberos Enabled CDH cluster.

Steps to setup StreamAnalytix for Kerberos

1. Create two principals, one for StreamAnalytix user and one for Kafka User using kadmin utility. The Principals will be “headless” principals. For example, if ‘sanalytix’ and ‘kafka’ are the streamanalytix and kafka users respectively, then run

kadmin –q “addprinc –randkey sanalytix”

kadmin –q “addprinc –randkey kafka”

2. Use the kadmin utility to create keytab files for the above principals, using:

kadmin –q “ktadd –k <keytab-name>.keytab <username>”.

Note: Also ensure that the keytabs are readable only by the StreamAnalytix user.

Example:

kadmin –q “ktadd –k sax.service.keytab sanalytix”

kadmin –q “ktadd –k sax-kafka.keytab kafka”

3. Create a JAAS configuration file named keytab_login.conf with the following sections:

com.sun.security.jgss.initiate (For HTTP client authentication)

• client (For Zookeeper)

• StormClient (For Storm)

• KafkaClient (For Kafka)

Each section in a JAAS configuration file while using keytabs for Kerberos security has the following format:

Shown below is the sample keytab_login.conf

4. Now, move the keytabs and keytab_login.conf to $SAX_HOME/conf/common/Kerberos folder and copy the files to

$SAX_HOME/conf/thirdpartylib folder.

Also copy Kafka’s server.properties file to $SAX_HOME/conf/common/Kerberos.

Note: Also, replace $SAX_HOME with the path of StreamAnaytix’ home directory.

5. Add the StreamAnalytix user to the supergroup of the hdfs user on all nodes.

6. On HBase master node, use kinit using HBase user and grant the StreamAnalytix user the read, write and create privileges as follows:

sudo –u hbase kinit –kt /etc/security/keytabs/hbase.headless.keytab hbase sudo –u hbase $HBASE_HOME/bin/hbase shell

Note: Replace $HBASE_HOME with the path to hbase installation folder. ‘hbase’ is the user through which HBase is deployed.

7. In hbase shell run grant ‘sanalytix’, ‘RWC’ where sanalytix is the StreamAnalytix user.

8. Grant cluster action permission on kafka cluster. Run the following command on a kafka broker node:

sudo -u kafka $KAFKA_HOME/bin/kafka-acls.sh -config $KAFKA_HOME/config/server.properties -add -allowprincipals user:sanalytix -operations ALL -cluster