Gathr Installation using Docker Swarm

This document provides a guide on the prerequisites and steps to deploy Gathr inside Docker for a Spark Standalone Cluster.

Prerequisites

  • Verify Docker and Swarm initialization.

  • Ensure that the latest versions of docker are installed on Gathr deployment nodes.

Verify installation:

$ docker –version

docker_swarm1

  • Follow below points to initialize Swarm cluster and add worker nodes

  • To create a swarm, run the below command:

$ docker swarm init
  • The output of docker swarm init provides the connection command to join new worker nodes to the swarm.

(Optional) In case you want to add worker node later and want to fetch the connection command again run below command:

$ docker swarm join-token worker
  • To add a worker to this swarm, run the below command:

Example:

$ docker swarm join --token SWMTKN-1-65joev0gpm0ngnfpz9zzcujl92ghfq1arl9bzribqn17ser0ln-2y69tqfo83m8ov36dhfx502vf localhost:2377
  • Verify the list of nodes in the swarm by running the below command from a manager node:
$ docker node ls

In order to leave node from swarm, run below command in worker node (Optional) .

$ docker swarm leave

System Requirements

  • The base OS must be RHEL 9 / OEL 9.

  • A minimum of 8 CPU cores and 16 GB RAM is recommended for running the Gathr container.

  • The Docker image includes Debian Linux 11 as its operating system.

PostgreSQL - Version 14.x

  • A PostgreSQL 14.x server must be running.
  • Admin credentials (username & password) are required.
  • PostgreSQL must be accessible from Docker host.

Zookeeper - Version 3.9.3

  • A Zookeeper 3.9.3 server should be available.
  • Zookeeper services must be accessible from Docker host.

Elasticsearch - Version 8.11.0

  • Elasticsearch 8.11.0 must be running.
  • Elasticsearch services should be accessible from Docker host.

RabbitMQ - Version 3.11.16

  • A RabbitMQ 3.11.16 server must be available.
  • RabbitMQ services should be accessible from Docker host.

Port Availability

  • Ports 8090 and 9595 must be free on the machine where Gathr is to be deployed.
  • Check the port status using:
$ netstat -anp | egrep "8090|9595"

User Permissions for Docker

The docker commands should be accessible by the application user (other than root).

  • Ensure the application user is part of the docker group.

  • Verify the user’s group membership:

$ id <username>

  • Confirm Docker commands are working:
$ docker ps
  • Spark Standalone Cluster (v3.5.2) is required for pipeline execution.

  • This cluster must be accessible from the Docker container node.

  • Local Volume for Gathr Data

  • A local directory is required to store Gathr data. In case of HA deployment, ensure that this directory is accessible on all HA nodes via NFS.

$ mkdir -p /path/to/gathr-volume

  • (Optional) OpenAI API Key Requirement

  • An OpenAI API key is required to use the Gathr IQ feature.

  • Internet access is mandatory for this feature to function.

Steps for Fresh deployment

  • Download the Gathr docker tar bundle shared URL by Gathr team and extract it in some directory.
Gathr-Bundle.tar.gz 

(Bundle name may vary)

Tar bundle includes the following files:

  • docker-compose1.yml
  • docker-compose2.yml
  • .env
  • gathr_image.tar

Use the same “.env”, “docker-compose1.yml” and “docker-compose2.yml” which we have shipped with image.

  • Load the docker image.
$ docker load -i gathr_image.tar

  • Check by:
$ docker images
  • Update the variables in .env file. The file includes useful comments to guide you through the properties.
$ vi .env 

Sample .env File:

#---------------------------------------------------------
#               GATHR DEPLOYMENT DETAILS
#---------------------------------------------------------

## ─────────────────────── ZOOKEEPER (Mandatory) ───────────────────────
# Provide External Zookeeper String (Host1:Port,Host2:Port,Host3:Port)
ZK_CONNECTION_STRING=localhost1:2181

## ─────────────────────── POSTGRESQL (Mandatory) ───────────────────────
# PostgreSQL database hostname/IP
DB_HOST=localhost

# PostgreSQL database port
DB_PORT=5432

# Database username (This user should have admin permission)
DB_USER=postgres

# Database password
DB_PASSWORD=postgres

# Gathr database name
DB_NAME=sax_db

## ─────────────────────── ELASTICSEARCH (Optional) ───────────────────────
# Elasticsearch host
ES_HOST=localhost

# REST API port
ES_CONNECT_PORT=9200

# Transport communication port
ES_TRANSPORT_PORT=9200

# Elasticsearch cluster name
ES_CLUSTER_NAME=gathrcluster

## ─────────────────────── RABBITMQ (Optional) ───────────────────────
# RabbitMQ host
RABBITMQ_HOST=localhost

# RabbitMQ connection port
RABBITMQ_PORT=5672

# RabbitMQ management UI port
RABBITMQ_MANAGEMENT_PORT=15672

# RabbitMQ username
RABBITMQ_USERNAME=demo

# RabbitMQ password
RABBITMQ_PASSWORD=demo

## ─────────────────────── KAFKA (Optional) ───────────────────────
# Provide External Kafka Brokers string (host1:port,host2:port,host3:port) and Kafka ZK String (host1:port,host2:port,host3:port)
KAFKA_ZK_STRING=localhost:2181
KAFKA_BROKER_LIST=localhost:9092

## ─────────────────────── SOLR (Optional) ───────────────────────
# Provide External Solr ZK String (host1:port,host2:port,host3:port/solr-chroot
SOLR_ZK_STRING=localhost:2181/solr

## ─────────────────────── SPARK (Mandatory in case of Standalone) ───────────────────────
# Spark master node(s)
SPARK_MASTER=spark://localhost:7077

# Spark UI host(s)
SPARK_UI_HOST=localhost

# Spark UI monitoring port
SPARK_UI_PORT=8080

# Spark REST API endpoints
SPARK_REST_MASTER=localhost:6066

# Spark installation path (Provide the same path as your external SPARK_HOME)
SPARK_HOME=/opt/spark

# Enable Spark history server (If you want to redirect spark application logs to external History Server)
SPARK_HISTORY_ENABLED=false

# Spark history server URL
SPARK_HISTORY_URL=localhost:18080

# Spark log directory
SPARK_HISTORY_LOG_DIRECTORY=hdfs://localhost:8020/spark-logs

## ─────────────────────── JAVA ───────────────────────
# This path should be same as the JAVA_HOME where Spark is running.
JAVA_HOME=/usr/share/openjdk

## ─────────────────────── GATHR CONFIGS (Mandatory) ───────────────────────
# Gathr service user
GATHR_SERVICE_USER=g.one

# Service user UID
GATHR_SERVICE_UID=2999

# Service group
GATHR_SERVICE_GROUP=devops

# Service group GID
GATHR_SERVICE_GID=2999

# Gathr hostname/IP (Provide HAProxy Host in case of Gathr HA)
GATHR_HOSTNAME=localhost

# Gathr service port (Provide HAProxy port in case of Gathr HA)
GATHR_PORT=8090

# Frontail logs UI port (Provide HAProxy Frontail port in case of Gathr HA)
FRONTAIL_PORT=9595

# CPU allocation for Gathr container
GATHR_CPU=8

# RAM allocation for Gathr container
GATHR_RAM=16g

# Authentication method (DB, LDAP, OKTA) - For the fresh deployment It will always be DB.
GATHR_AUTH_SOURCE=DB

# Backup retention count
BACKUPS_TO_KEEP=1

# Data storage path (Path where Gathr will store the data externally)
GATHR_DATA_VOLUME_PATH=/opt/shared

## ─────────────────────── HADOOP CONFIGS (Mandatory if you want to submit the pipelines on YARN Cluster) ───────────────────────
# When the property below is set to true, you must create a directory named "hadoop-conf" inside the GATHR_DATA_VOLUME_PATH (as mentioned above) and copy all files and folders from your external Hadoop server’s HADOOP_CONF_DIR into that directory.
IS_HADOOP_EXTERNAL=false

# Provide the same path as your external HADOOP_HOME
HADOOP_HOME=/opt/hadoop

# Specify the user under which the Hadoop services are running
HADOOP_USER=hdfs

## ─────────────────────── EXTERNALIZATION FOR GATHR (Optional) ───────────────────────
# Enable external configurations for Gathr
ENABLE_EXTERNALIZATION=false

# Provide s3 bucket name where external configurations are placed.
S3_BUCKET_FOR_EXTERNALIZATION=gathrbucket

# provide path in S3 bucket which you will use for externalization. For eg. If your full S3 path is s3://gathr-bucket/others/externalization/ then others/externalization/ is your S3_PATH_FOR_EXTERNALIZATION
S3_PATH_FOR_EXTERNALIZATION=test/externalization/

# AWS Access Key
AWS_ACCESS_KEY=AWSAccessKey

# AWS Secret Key
AWS_SECRET_KEY=AWSSecretKey

## ─────────────────────── TIMEZONE (Default is UTC) ───────────────────────
# Timezone for Gathr container
TZ=Asia/Kolkata

  • Deploy the Gathr container using below command:

For first node

$ export $(cat .env) > /dev/null 2>&1; docker stack deploy -c docker-compose1.yml gathr-ul-merge

Here, gathr-ul-merge is the stack name.

Output will be as below of above command:

# export $(cat .env) > /dev/null 2>&1; docker stack deploy -c docker-compose1.yml gathr-ul-merge
Creating service gathr-ul-merge_saasul1

To verify the status of all services, run the below command. (for Swarm Mode Only)

$ docker service ls

Output:

# docker service ls
ID             NAME                     MODE         REPLICAS   IMAGE                                     PORTS
zr3o1kdjj580   gathr-ul-merge_saasul1   replicated   1/1        172.26.78.4:5000/ul-saas-jdk17:build-73   

To verify the container status on individual nodes, run the below command: (on node1)

$ docker ps

Output:

# docker ps
CONTAINER ID   IMAGE                                     COMMAND                  CREATED          STATUS          PORTS                                                                                                                                 NAMES
085aa374e147   localhost/ul-saas-jdk17:build-73   "/bin/bash /opt/entr…"   51 minutes ago   Up 51 minutes   0.0.0.0:5090->5090/tcp, :::5090->5090/tcp, 0.0.0.0:5595->5595/tcp, :::5595->5595/tcp, 0.0.0.0:50053->50051/tcp, :::50053->50051/tcp   gathr-ul-merge_saasul1.1.o4j3o27oy39wjyhlkrs1pk8jf

To check container logs:

$ docker logs -f <container_id>/<container_name>

Access the Gathr web UI at:

http://<node1-IP>:8090/

For second node

$ export $(cat .env) > /dev/null 2>&1; docker stack deploy -c docker-compose2.yml gathr-ul-merge

Here, gathr-ul-merge is the stack name. Output will be like this of above command:

# export $(cat .env) > /dev/null 2>&1; docker stack deploy -c docker-compose2.yml gathr-ul-merge
Creating service gathr-ul-merge_saasul2

To verify the status of all services, run the below command. (for Swarm Mode Only)

$ docker service ls

Output:

# docker service ls
ID             NAME                     MODE         REPLICAS   IMAGE                                     PORTS
zr3o1kdjj580   gathr-ul-merge_saasul1   replicated   1/1        172.26.78.4:5000/ul-saas-jdk17:build-73   
kse63fro902w   gathr-ul-merge_saasul2   replicated   1/1        172.26.78.4:5000/ul-saas-jdk17:build-73   

To verify the container status on individual nodes, run the below command: (on node2)

$ docker ps
# docker ps
CONTAINER ID   IMAGE                                     COMMAND                   CREATED          STATUS          PORTS                                                                                                                                 NAMES
6520aae4be13   172.26.78.4:5000/ul-saas-jdk17:build-73   "/bin/bash /opt/entr…"    52 minutes ago   Up 52 minutes   0.0.0.0:5090->5090/tcp, :::5090->5090/tcp, 0.0.0.0:5595->5595/tcp, :::5595->5595/tcp, 0.0.0.0:50053->50051/tcp, :::50053->50051/tcp   gathr-ul-merge_saasul2.1.kmulgmp5ge4v4jfd92tk4scg2

To check container logs:

$ docker logs -f <container_id>/<container_name>

Access the Gathr web UI at:

http://<node2-IP>:8090/

Useful Commands for Swarm

To list all stacks in swarm

docker stack ls

  • To see services list of a particular stack “gathr-ul-merge”
docker stack services gathr-ul-merge

  • To see tasks related to services of a particular stack “gathr-ul-merge”
docker stack ps gathr-ul-merge

  • To delete particular service
docker service rm <service-name>

  • To force recreate the docker container:
docker service update --force <service name>

  • To remove the entire Gathr stack (All containers):
docker stack rm gathr-ul-merge

  • To check logs of container:
docker logs <container-name> -f 

  • To login to container terminal
docker exec -it <container-name> /bin/bash

Steps for Upgrade

Upgrade means deploying the gathr version higher than current version.

  • Stop the containers of the stack.
$ docker stack rm gathr-ul-merge

  • Change image tag in docker-compose1.yml file.
$ vim docker-compose1.yml

  • Replace in below line, with the tag you want to upgrade to:
image: 172.26.78.4:5000/ul-saas-jdk17:build-73

  • Deploy the first container
$ export $(cat .env) > /dev/null 2>&1; docker stack deploy -c docker-compose1.yml gathr-ul-merge

Once the UI is up, repeat steps 2 and 3 for docker-compose2.yml.

Steps for Rollback

Rollback means deploying the Gathr version lower than current version.

Steps are same as upgrade.

  • Stop the containers of the stack.
$ docker stack rm gathr-ul-merge

  • Change image tag in docker-compose1.yml file.
$ vim docker-compose1.yml

  • Replace in below line with the tag you want to rollback to
image: 172.26.78.4:5000/ul-saas-jdk17:build-73

  • Deploy the first container
$ export $(cat .env) > /dev/null 2>&1; docker stack deploy -c docker-compose1.yml gathr-ul-merge

Once the UI is up, repeat steps 2 and 3 for docker-compose2.yml.

Top