Deploy Gathr Using Docker Containers

Deploy Gathr and its essential components, such as Elasticsearch, RabbitMQ, Zookeeper, PostgreSQL, Spark Standalone, and HDFS/YARN, using Docker containers.

Prerequisites

1. Verify Docker and Docker Compose

Ensure that the latest versions of docker and docker compose are installed on Gathr deployment node.

Verify installation:

docker --version
docker compose version

2. System Requirements

Operating System: The base OS must be RHEL 9 / OEL 9.
Note: The Docker image includes Debian Linux 11 as its operating system.

Minimum Hardware Specifications:

Service	CPU	RAM	Disk	Optional
Zookeeper	2 CPU	4 GB	20 GB	No
Postgres	2 CPU	4 GB	20 GB	No
Gathr	8 CPU	32 GB	100 GB	No
HAProxy	1 CPU	1 GB	10 GB	No
Elasticsearch	2 CPU	4 GB	20 GB	Yes
RabbitMQ	1 CPU	2 GB	10 GB	Yes
Spark Standalone	Custom	Custom	100 GB	Yes
HDFS & YARN	Custom	Custom	100 GB	Yes

Additional Requirements:

HAProxy and Gathr must run on separate machines.
vm.max_map_count value should be 262144 in /etc/sysctl.conf
Python3 and pip3 are required on the server.
OpenJDK17 should be installed on all nodes.

3. PEM File Requirement for HAProxy

HAProxy Server should be up and running with SSL in case of deploying Klera Analytics. Non-SSL HAProxy will work if not using Klera Analytics.
Valid .pem file is required for enabling SSL on HAProxy.

4. Ansible Server

Ansible should be installed on the server from where you will run the playbook.
SELinux should be disabled on all the servers.
Password less SSH must be enabled from the Ansible server to the Gathr and HAProxy machines.
Note: Password less SSH should be configured for the root user on the remote server.

Command to add password less SSH:

Copy ssh key to remote server:

ssh-copy-id root@remote_host

Test ssh passwordless connection:

ssh root@remote_host

5. Shared Location

Create a shared path that is accessible by the Gathr Nodes.
The same path shall be mounted on Spark nodes and Gathr Analytics.
All deployment nodes should be on same network

6. Port Availability

Below ports should be available on the servers where the respective services are being deployed.

Service	Ports (Default)	Optional
Zookeeper	2181, 2888, 3888, 8081	No
Postgres	5432	No
Gathr	8090, 9595	No
HAProxy	8090, 9596	No
Elasticsearch	9200, 9300	Yes
RabbitMQ	5672, 15672	Yes
Spark Standalone	7077, 8080, 8081, 6066, 18080	Yes
HDFS & YARN (Non-HA)	8020, 9870, 9864, 9866, 9867, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033	Yes
HDFS & YARN (HA)	8485, 8480, 8020, 50070, 8019, 9866, 9867, 50075, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033	Yes

7. User Permissions for Docker

Create an application user on the server with password-based authentication.
Note: The UID and GID for this user must be the same across all machines.
The docker and docker compose commands should be accessible by this application user.

Ensure the application user is part of the docker group.

Verify the user’s group membership:

id <username>

Confirm Docker commands are working:

docker ps

8. Local Volume for Gathr Data

A local directory is required to store Gathr data. Note: This path should be accessible on the Gathr nodes.

mkdir -p /shared/path/gathr-volume

9. (Optional) OpenAI API Key Requirement

An OpenAI API key is required to use the Gathr IQ feature.
Internet access is mandatory for this feature to function.

10. (Optional) Data Intelligence DI:

Shared Storage for Logs and Data (Optional for single-node setups)
- Configure a shared mount (EFS/NFS) on the Docker-installed machine to store and manage DI application logs and data efficiently.
Logstash Configuration
- Install and configure Logstash (version 8.11.0) on the chosen node, ensuring it is seamlessly integrated with the designated Elasticsearch instance.
- Additionally, configure Logstash to process logs from the DI log directory for efficient log management.
- Logstash service should run on the same node where DI application logs are stored.
Access Control for Gathr User
- Make sure Gathr user has access to read DI Docker logs.
- If proper access is not there then, set up ACL permissions to grant the Gathr user the necessary access to read logs generated by the DI Docker environment.
Load Balancer for Multi-DI Deployments (Optional for single node DI Docker)
- Configure a private load balancer to efficiently handle multi-DI Docker deployments, ensuring it listens to the port specified in the Gathr configuration and routes traffic to the deployed container instances.
AWS CloudWatch Integration (Applicable for AWS Deployments Only)
- Set up and configure AWS CloudWatch along with the CloudWatch Agent to enable seamless log monitoring and analysis in the AWS environment.

11. (Optional) For MLflow

Ensure you have a private docker registry and Gathr team will provide the MLflow related images. Load those images in your repository.
K8s Cluster should be up and running.
Artifact Storage Configuration:
- MLFlow generates artifacts when a model is registered, which can be stored in either S3 or NFS.
- If you use S3 or Ceph for artifact storage, ensure you have the following credentials:
  - S3 Access Key
  - S3 Secret Key
  - S3 Endpoint URL
If using NFS, make sure you have the PVC (Persistent Volume Claim) name that the Gathr pod is using to share the path between the Gathr and MLFlow pods.
(Optional) Private Docker Registry Access should be available to pull from the private docker repository.

Steps to Install Gathr using Playbook

Download the playbook bundle shared by the Gathr team on Ansible server.
Untar the bundle using below command:
```
tar -xzf GathrBundle.tar.gz
```
Go to the Playbook Path.
```
cd /path/to/playbook
```
(Optional) If you want to add any host entries inside the Gathr Container, you can create a file named hosts and place it inside packages folder of our playbook.
Example:
```
vim packages/hosts
```
Add entries like:
```
10.0.0.1 gathr-node1
10.0.0.2 gathr-node2
```
Save the file.
(Optional) Copy haproxy.pem file inside packages/ folder. This pem file will be used to enable SSL on HAProxy.
Update the properties in the gathr.properties file. Please ensure all the properties are correctly filled.
We have attached a sample gathr.properties file with useful comments for each property to guide you in providing the appropriate values.
Once the above file is updated, run config.sh to reload the ansible variables. Use below command:
```
./config.sh gathr.properties
```
Run the playbook
You can run the playbook using one of the following methods:
- To install all components at once: Run the following command to install all components in a single execution:
```
ansible-playbook -i hosts gathr_one.yaml -v
```
- To install components individually: If you prefer to install components one by one, use the respective commands:
  - To install PostgreSQL:
```
ansible-playbook -i hosts postgres.yml -v
```
  - To install Zookeeper:
```
ansible-playbook -i hosts zookeeper.yml -v
```
  - To install Gathr:
```
ansible-playbook -i hosts gathr_saas_ul_HA.yml -v
```
  - To install HAProxy:
```
ansible-playbook -i hosts haproxy.yml -v
```
  - To install RabbitMQ:
```
ansible-playbook -i hosts rabbitmq.yml -v
```
  - To install Elasticsearch:
```
ansible-playbook -i hosts elasticsearch.yml -v
```
  - To install Spark Standalone:
```
ansible-playbook -i hosts spark.yml -v
```
  - To install HDFS and YARN:
```
ansible-playbook -i hosts hadoop.yml -v
```
Note about Gathr Analytics: Gathr Analytics integrated installation can be performed by enabling klera.yml inside gathr_one.yml and the Gathr Analytics installation script would be called after Gathr installation finishes.
We need to provide KLERA_DATABASE_NAME=<klera database name> in the gathr.properties file so that an empty database would be created for Klera applications in the database.
Before starting the Gathr Analytics deployment, please follow the Gathr Analytics deployment document.
- To install Klera:
```
ansible-playbook -i hosts klera.yml -v
```

Post Deployment Validation

Access the Gathr UI on: https://<haproxy_hostname>:8090/
After Gathr is up, you will see a license agreement page. Click on the “I accept” check box and click on “Accept” button.
An Upload license page will appear. Before uploading the license, ensure the Gathr Analytics service is also up and running.
Upload the valid Gathr license and click on “confirm”.
Click “continue” on the welcome page. The Login Page will appear.
You can now login with the default superuser credentials:
- Email: super@mailinator.com
- Password: superuser
Gathr is now successfully deployed.

If you have any feedback on Gathr documentation, please email us!

Deploy Gathr Using Docker Containers

Prerequisites #

1. Verify Docker and Docker Compose #

2. System Requirements #

3. PEM File Requirement for HAProxy #

4. Ansible Server #

5. Shared Location #

6. Port Availability #

7. User Permissions for Docker #

8. Local Volume for Gathr Data #

9. (Optional) OpenAI API Key Requirement #

10. (Optional) Data Intelligence DI: #

11. (Optional) For MLflow #

Steps to Install Gathr using Playbook #

Post Deployment Validation #