Gathr Prerequisites

💡 Before beginning with the installation, see Component Versions Supported →

Hardware and Software Configurations

Machine Type: 8 VCPUs, 32 GiB memory
Disk Space: 100 GB (Min)

Operating System

Linux: RHEL9/OEL9
sudo Access: Required during installation.
Internet Access: Optional (preferred during the installation)

Prerequisites

The prerequisites are listed below, and explained further in the sub-topics must be deployed before proceeding further:

Note: - For RHEL9 use yum package manager and for OEL9 use dnf package manager to install/update/remove packages.

💡 The component versions that are mentioned for each prerequisite are for representational purpose. For the Gathr supported component version details, see Component Versions Supported →

Java
Zookeeper
PostgreSQL
Elasticsearch
RabbitMQ
Python
Anaconda
Docker
Kubernetes
Tesseract Installation for Text Parsing

Java 8

Java is a basic prerequisite and it is required for Gathr installation.

Update Existing Packages on Machine

Install the latest updates on your machine using the below command:

sudo yum update

Java Installation Command

To install Java 8 on your machine, follow the below steps:

# sudo yum install java-1.8.0-openjdk
# sudo yum install java-1.8.0-openjdk-devel

For Multiple Java on System

If the machine has multiple JDK installed, you can use the alternatives command to set the default java.

# sudo alternatives --config java

A list of all installed Java versions will be printed on the screen. Enter the number of the version you want to use as the default and press Enter.

Set Java Home in User bashrc

Edit the bashrc using below command to set the Java home.

vi ~/.bashrc

Once the .bashrc file opens, append this Java home at the end:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64
export PATH=$JAVA_HOME/bin:$PATH

Test Java home to check if it is set correctly using below command:
```
# source ~/.bashrc

# echo $JAVA_HOME

# echo $PATH
```
Type jps to check the status of running Java processes.

Zookeeper 3.9.1

Zookeeper is a basic prerequisite and it is required for Gathr installation.

👉

If Gathr Webstudio deployment is done on any CDP cluster, then the Zookeeper component is not mandatory.

Install Zookeeper

Steps to Install Zookeeper:

Download the apache-zookeeper-3.9.1-bin.tar.gz package.
You can download the package from: https://archive.apache.org/dist/zookeeper
Copy it in <installation_dir> (e.g: /opt/gathr/services/).

Untar the bundle using below command:

tar -xvzf apache-zookeeper-3.9.1-bin.tar.gz

Run the below commands as a root or sudo user to provide permissions to the respective service ID.
```
chown -R serviceId:serviceId apache-zookeeper-3.9.1-bin
```
In zookeeper folder, create a folder with the name datadir.
```
mkdir -p apache-zookeeper-3.9.1-bin/datadir
```
Create a copy of <installation_dir>/apache-zookeeper-3.9.1-bin/conf/zoo_sample.cfg and rename it to zoo.cfg using below command:
```
cp apache-zookeeper-3.9.1-bin/conf/zoo_sample.cfg apache-zookeeper-3.9.1-bin/conf/zoo.cfg
```
Open the zoo.cfg file using below command:
```
vi apache-zookeeper-3.9.1-bin/conf/zoo.cfg
```
Next, update the below details in the zoo.cfg file:
- Update the IP address in zoo.cfg file and add below property:
```
server.1=<ip of machine where zk is being installed>:2888:3888
```
- Update dataDir path in zoo.cfg:
```
dataDir=<installation_dir>/apache-zookeeper-3.9.1-bin/datadir
```

Execute the below command to start Zookeeper.

<installation_dir>/apache-zookeeper-3.9.1-bin/bin/zkServer.sh start

To check the zookeeper status, run the below command:
```
<installation_dir>/apache-zookeeper-3.9.1-bin/bin/zkServer.sh status
```
Run jps command to see if the QuorumPeerMain process is running.

PostgreSQL 14

PostgreSQL is the recommended database for various storage operations in Gathr.

💡

MySQL, MSSQL, and Oracle are also supported.

Install PostgreSQL using the below steps:

Download the Postgres RPM using below command:

sudo yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-$(rpm -E %{rhel})-x86_64/pgdg-redhat-repo-latest.noarch.rpm

Install Postgresql 14:

sudo yum install postgresql14 postgresql14-server

Initialize the Database:
```
sudo /usr/pgsql-14/bin/postgresql-14-setup initdb
```
You will get a confirmation stating - ‘Initializing database … OK’.
Start the PostgreSQL Service:
```
sudo systemctl start postgresql-14
```
Enable the PostgreSQL Service:
```
sudo systemctl enable postgresql-14
```
Check the status of the PostgreSQL Service:
```
sudo systemctl status postgresql-14
```

Configure PostgreSQL

Reset the Admin database password:

psql -c "ALTER USER postgres WITH PASSWORD 'your-password';"

Identify the PostgreSQL primary configuration file using below command:
```
sudo -u postgres psql -c 'SHOW config_file;'        
```

Create a backup of the Postgres primary configuration file:

sudo cp /var/lib/pgsql/14/data/postgresql.conf /var/lib/pgsql/14/data/postgresql.conf.bak

Open and configure the PostgreSQL configuration file to allow remote access:
```
sudo vi /var/lib/pgsql/14/data/postgresql.conf
```
Allow remote access:
```
listen_addresses = 'localhost'
```
Change the listen address from localhost to all as follows:
```
listen_addresses = '*'
```
Now, edit the PostgreSQL access policy configuration file:
```
sudo vi /var/lib/pgsql/14/data/pg_hba.conf
```

Update the PostgreSQL access policy configuration file as shown in the below example:

 ```
 # TYPE  DATABASE        USER            ADDRESS                 METHOD

 # "local" is for Unix domain socket connections only
 local   all             all                                     peer
 # IPv4 local connections:
 host    all         all         0.0.0.0/0          md5
 # IPv6 local connections:
 host    all             all             ::1/128                 md5
 # Allow replication connections from localhost, by a user with the
 # replication privilege.
 local   replication     all                                     peer
 host    replication     all             127.0.0.1/32            md5
 host    replication     all             ::1/128                 md5
 ```

Restart Postgresql service using below command:
```
sudo systemctl restart postgresql-14
```

RabbitMQ 3.11.16 (Optional)

This is an optional component; However, it is important as we are using this for pipeline error handling in Gathr.

Download below files from Rabbit MQ.

rabbitmq-server-generic-unix-3.11.16.tar.xz 
esl-erlang_25.0.3-1~centos~7_amd64.rpm

Install erlang package

yum localinstall esl-erlang_25.0.3-1~centos~7_amd64.rpm  -y

Then extract RabbitMQ tar:

tar -xf rabbitmq-server-generic-unix-3.11.16.tar.xz

Export PATH:

export PATH=$PWD/rabbitmq_server-3.11.16/sbin:$PATH

Starting the Server

To start the server, run the

sbin/rabbitmq-server

This displays a short banner message, concluding with the message “completed with [n] plugins.”, indicating that the RabbitMQ broker has been started successfully. To start the server in “detached” mode, use.

rabbitmq-server -detached

This will run the node process in the background.

Stopping the Server

To stop a running node, use the below command:

sbin/rabbitmqctl shutdown.

The command will wait for the node process to stop. If the target node is not running, it will exit with an error.

The node can be instructed to use more conventional system directories for configuration, node data directory, log files, plugins and so on.

In order to make the node use operating system defaults, locate the following line:

PREFIX=${RABBITMQ_HOME}

In the sbin/rabbitmq-defaults script, change this line to:

SYS_PREFIX=

Do not modify any other line in this script.

To create test user use the below command:

sudo rabbitmqctl add_user test test ; sudo rabbitmqctl set_user_tags test administrator; sudo rabbitmqctl set_permissions -p / test ".*" ".*" ".*"
rabbitmq-plugins enable rabbitmq_management

Systemd unit file:

Replace $DATA_PATH with actual Path. Create a file:

/tmp/rabbitmq-server.service

[Unit]
Description=RabbitMQ broker
After=syslog.target network.target

[Service]
Type=notify
User=rabbitmq
Group=rabbitmq
UMask=0027
NotifyAccess=all
TimeoutStartSec=600
LimitNOFILE=32768

Restart=on-failure
RestartSec=10
#WorkingDirectory=/var/lib/rabbitmq
ExecStart=$DATA_PATH/rabbitmq_server-3.11.16/sbin/rabbitmq-server
ExecStop=$DATA_PATH/rabbitmq_server-3.11.16/sbin/rabbitmqctl shutdown
# See rabbitmq/rabbitmq-server-release#51
SuccessExitStatus=69

[Install]
WantedBy=multi-user.target

Start/Stop/enable/Status commands for RabbitMQ:

Status- sudo systemctl status rabbitmq-server.service Start- sudo systemctl start rabbitmq-server.service Enable- sudo systemctl enable rabbitmq-server.service Stop-sudo systemctl stop rabbitmq-server.service

Elasticsearch 6.8.22

Elasticsearch is required to run features like Audit Trail, Error Search, Monitoring Graphs, and various search operations in Gathr.

To install Elasticsearch, open an SSH terminal and follow these steps:

Download Elasticsearch binary (.tar.gz) (version 6.8.22) using below command:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.8.22.tar.gz

Extract the tar.gz using below command:

$ tar -xvf elasticsearch-6.8.22.tar.gz -C <<installationDir>>
$ cd <<installationDir>>/<<extractedDir>>

Open config/elasticsearch.yml using below command:

vi config/elasticsearch.yml

Once elasticsearch.yml file opens, update below configurations:

Parameter	Value
cluster.name:	ES6822
node.name:	< IP of the machine >
path.data:	/< installation-directory >/elasticsearch-6.8.22/data
path.logs:	/< installation-directory >/elasticsearch-6.8.22/logs
network.host:	< IP of the machine >
http.port:	9200
discovery.zen.ping.unicast.hosts:	["< IP of the machine >"]

Place this at the end of the file:

action.auto_create_index: .security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*,sax-meter*,sax_audit_*,*-sax-model-index,true,sax_error_*,ns*,gathr_*,*_accumulator_*,query_monitoring_*

Run the below command to set vm.max_map_count:

sudo sysctl -w vm.max_map_count=262144

Run below command to start Elasticsearch in background:
```
nohup ./bin/elasticsearch &
```
Check if the Elasticsearch process is running correctly by following any of these steps:
- Run jps command, it will show Elasticsearch process in output:
  OR
- Hit the below URL on a Web browser to check if Elasticsearch service is running.
```
http://<IP of the machine>:9200/
```

Python 3.8.8

Gathr requires Python for features like Workflows, Scikit-Based Models Registration, Python Processor, Scikit Processor, OpenAI Embeddings Generator Processor, Pinecone Vector Emitter, Pinecone Vector Lookup, Binary To Text Parser, and so on.

Python 3.8.8 should be installed on all the nodes of the cluster that are hosting any services required to run Gathr, like, HDFS, Spark, Yarn, Airflow, Gathr Webstudio, and so on.

Python 3.8.8 Installation

To install Python 3.8.8, open an SSH terminal and follow these steps:

Required Packages for Installation:
Use the following command to install prerequisites for Python 3.8.8.
```
sudo yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel gcc openssl-devel bzip2-devel libffi-devel zlib-devel 

sudo yum groupinstall "Development Tools"
```
If the command sudo yum groupinstall "Development Tools" fails to download development tools, then run the below command:
```
sudo yum install @development 
```
Once the prerequisites for installing Python 3.8.8 are met, proceed to the next step.
Download Python 3.8.8 tar bundle
Download Python tar bundle using the below command:
```
cd /opt 

wget https://www.python.org/ftp/python/3.8.8/Python-3.8.8.tgz
```
Now, extract the downloaded package.
```
cd /opt 

tar xzf Python-3.8.8.tgz 
```
Compile Python Source
Use the below set of commands to compile python source code on your system.
```
cd Python-3.8.8 

sudo ./configure --enable-optimizations 

sudo make altinstall 
```
Now, remove the downloaded source archive file from your system.
```
sudo rm Python-3.8.8.tgz
```
Check Python 3.8 Version
Check the Python 3.8 version using the below command:
```
python3.8 -V 
```
Create a softlink
Run the below command to create a softlink so that default “python3” points to “python3.8”
```
ln -sf /usr/local/bin/python3.8 /usr/bin/python3 

ln -sf /usr/local/bin/pip3.8 /usr/bin/pip3  
```
Check Python 3 Version
Check Python 3 version using the below command:
```
python3 -V 
```

Install Python Packages

Install below required Python packages with Python 3.8:

pip3 install urllib3==1.26.15(optional) 

pip3 install h2o==3.28.1.3 

pip3 install scikit-learn==0.24.1 

pip3 install numpy==1.19.2 

pip3 install pandas==1.2.4  

pip3 install matplotlib==3.4.2 

pip3 install mlflow==1.0.0 

pip3 install hdfs==2.5.8 

pip3 install scipy==1.6.2

Anaconda Installation

Below are the steps to install Anaconda

To use GUI (Graphical User Interface) packages with Linux, install the following extended dependencies for Qt:

yum install libXcomposite libXcursor libXi libXtst libXrandr alsa-lib mesa-libEGL libXdamage mesa-libGL libXScrnSaver

To download the installer, open a terminal and use the following command:

curl -O https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh

To install, run the following command as mentioned below:
```
bash Anaconda3-2023.09-0-Linux-x86_64.sh
```
Enter to review the license agreement.
Then press and hold Enter to scroll.
Enter yes to agree to the license agreement.

Use Enter key to accept the default install location, use CTRL+C to cancel the installation, or enter another file path to specify an alternate installation directory. Eg:- /anaconda3 in below case:

Enter “no”, conda will not modify your shell scripts at all to initialize Anaconda Distribution.

Change permission of /anaconda3 folder by using the below command:

```
sudo chmod 777 /anaconda3   

```

Once installation is done, go to the below folder

```
<anaconda installed folder>/bin  

```

Run the below command:

```
conda init  

```

It will make entries in .bashrc.

Repeat this step for users to initialize conda.

Switch to sax user.

```
cd /anaconda3/bin

  ./conda init

```

Even if conda activate base environment after login, run the below command

```
conda config --set auto_activate_base false 

```

  Open new terminal with sax user.

Repeat step 8 and 9 for root user.

Installation is done, now check conda version.

To know more about anaconda refer to the click here. 

Tesseract Installation for Text Parsing

Install the below packages:

sudo yum install epel-release
sudo yum install tesseract-langpack-eng
sudo yum install tesseract-langpack-hin

echo 'export TESSDATA_PREFIX=/usr/share/tesseract/tessdata/' >> ~/.bashrc
source ~/.bashrc

Make sure tessract path “export TESSDATA_PREFIX=/usr/share/tesseract/tessdata/” is correct.

Generic JDBC Elasticsearch Connector

Steps to follow for the Generic JDBC Elasticsearch component first time deployment.

Create folder /genericjdbc/ inside sax installation directory i.e $SAX_HOME/lib/genericjdbc/
Copy the Generic JDBC Elasticsearch connector jars to the given location $SAX_HOME/lib/genericjdbc/ from $SAX_HOME/conf/thirdpartylib/genericjdbc-elasticsearch-1.0.1.jar
Update “sax.local.inspect.classpath”:"$SAX_HOME/lib/genericjdbc/*.jar” property in common.yaml file.
Add the connector jar to the “$SAX_HOME/server/tomcat/lib/ " folder from $SAX_HOME/lib/genericjdbc/

Read-only user for Postgres database required for Data asset AI support

Below are the steps to Create readonly user in SQL Server:

Login with SQL admin user
CREATE LOGIN readonly WITH PASSWORD = ‘ReadOnly@123’;
USE gathr_database;
CREATE USER readonly FOR LOGIN readonly;
GRANT SELECT TO readonly;

Below are the steps to create readonly user in Postgres Server:

Login with postgres admin user
CREATE ROLE readonly WITH LOGIN PASSWORD ‘readonly’;
GRANT CONNECT ON DATABASE gathr_database TO readonly;
GRANT USAGE ON SCHEMA public TO readonly;
\c <db_name>
GRANT SELECT ON ALL TABLES IN SCHEMA public to readonly;

Kubernetes

💡 The component versions that are mentioned in this sub-topic are for representational purpose only. For the Gathr supported component version details, see Component Versions Supported →

Kubernetes is required to register container images in Gathr application and connect integrated development environments such as, Jupyter Lab or Visual Studio Code on the sandbox.

Below are the setup details for Kubernetes cluster:

Requirements

A Kubernetes cluster with access to the kube-apiserver endpoint.
https://kube-apiserver:kube-apiserver_port_number)
Default API port is 443
Connectivity between the access node and the API server endpoint URL.
To check accessibility, run the following command on the access node:

curl https://kube-apiserver:kube-apiserver\_port\_number/version --insecure

A Kubernetes service account, an account to access Kubernetes, or a kubeconfig file that is created by using the service account and a token.

Kubernetes cluster should be available, which typically comprises of master node and multiple worker nodes.

The cluster and its nodes are managed from the master node using ‘kubeadm’ and ‘kubectl’ command.

To install and deploy Kubernetes, it is recommend to have Kubeadm (Multi Node Cluster).

On the Master Node following components will be installed:

API Server
Scheduler
Controller Manager
etcd
Kubectl utility

On the Worker Nodes following components will be installed:

Kubelet
Kube-Proxy
Pod

For the detailed setup information about setting up Kubernetes cluster, please refer:

Kubernetes Website

or:

Gathr Support

Verify Kubernetes Installation:

On Kubernetes master and worker nodes, check Start/Stop/Restart services:

systemctl status kubelet
systemctl status docker
systemctl status nfs-server

Run below commands on Kubernetes master to get status of cluster and pods:

kubectl get nodes
kubectl get pods --all-namespaces

Debugging Kubernetes Pods:

Run below commands on Kubernetes master:

Try to get pod info and verify the events, volume mounts, environment variables, endpoints etc.
```
kubectl describe pod <pod-name>
```
You can also watch logs of pod using:
```
kubectl logs -f <pod-name>|
```
Try entering bash/sh terminal of the pod and look at the configurations, volume map etc.
```
kubectl exec -it <pod-name> bash
```
If a pod is evicted, try to look at nodes, CPU/Memory/Disk pressure. Describe a node:
```
kubectl describe node <node-name>
```
If disk pressure is True, which evicts the pod, also see the events listed at bottom while you describe node. You can also watch logs of kube-scheduler for more details.

Troubleshooting Cluster

Run below commands on Kubernetes master:

Listing cluster:

kubectl get nodes

To get detailed information about the overall health of the cluster:

kubectl cluster-info dump

To check logs on Master Node:

API Server, responsible for serving the API

/var/log/kube-apiserver.log

Scheduler, responsible for making scheduling decisions:

/var/log/kube-scheduler.log

Controller that manages replication controllers:

/var/log/kube-controller-manager.log

To check logs on Worker Nodes:

Kubelet, responsible for running containers on the node:

/var/log/kubelet.log

Kube Proxy, responsible for service load balancing:

/var/log/kube-proxy.log

Firewall Settings

Check whether firewall is stopped:

firewall-cmd --state
systemctl status firewalld

If you have any feedback on Gathr documentation, please email us!

Gathr Prerequisites

Hardware and Software Configurations #

Operating System #

Prerequisites #

Java 8 #

Update Existing Packages on Machine #

Java Installation Command #

For Multiple Java on System #

Set Java Home in User bashrc #

Zookeeper 3.9.1 #

Install Zookeeper #

PostgreSQL 14 #

RabbitMQ 3.11.16 (Optional) #

Starting the Server #

Stopping the Server #

Elasticsearch 6.8.22 #

Python 3.8.8 #

Python 3.8.8 Installation #

Anaconda Installation #

Tesseract Installation for Text Parsing #

Generic JDBC Elasticsearch Connector #

Read-only user for Postgres database required for Data asset AI support #

Kubernetes #