Deploy Gathr Using Docker Containers
- Prerequisites- 1. Verify Docker and Docker Compose
- 2. System Requirements
- 3. PEM File Requirement for HAProxy
- 4. Ansible Server
- 5. Shared Location
- 6. Port Availability
- 7. User Permissions for Docker
- 8. Local Volume for Gathr Data
- 9. (Optional) OpenAI API Key Requirement
- 10. (Optional) Data Intelligence DI:
- 11. (Optional) For MLflow
 
- Steps to Install Gathr using Playbook
- Post Deployment Validation
In this article
- Prerequisites- 1. Verify Docker and Docker Compose
- 2. System Requirements
- 3. PEM File Requirement for HAProxy
- 4. Ansible Server
- 5. Shared Location
- 6. Port Availability
- 7. User Permissions for Docker
- 8. Local Volume for Gathr Data
- 9. (Optional) OpenAI API Key Requirement
- 10. (Optional) Data Intelligence DI:
- 11. (Optional) For MLflow
 
- Steps to Install Gathr using Playbook
- Post Deployment Validation
Deploy Gathr and its essential components, such as Elasticsearch, RabbitMQ, Zookeeper, PostgreSQL, Spark Standalone, and HDFS/YARN, using Docker containers.
Prerequisites
1. Verify Docker and Docker Compose
Ensure that the latest versions of docker and docker compose are installed on Gathr deployment node.
Verify installation:
docker --version
docker compose version
2. System Requirements
- Operating System: The base OS must be RHEL 9 / OEL 9.
- Note: The Docker image includes Debian Linux 11 as its operating system.
Minimum Hardware Specifications:
| Service | CPU | RAM | Disk | Optional | 
|---|---|---|---|---|
| Zookeeper | 2 CPU | 4 GB | 20 GB | No | 
| Postgres | 2 CPU | 4 GB | 20 GB | No | 
| Gathr | 8 CPU | 32 GB | 100 GB | No | 
| HAProxy | 1 CPU | 1 GB | 10 GB | No | 
| Elasticsearch | 2 CPU | 4 GB | 20 GB | Yes | 
| RabbitMQ | 1 CPU | 2 GB | 10 GB | Yes | 
| Spark Standalone | Custom | Custom | 100 GB | Yes | 
| HDFS & YARN | Custom | Custom | 100 GB | Yes | 
Additional Requirements:
- HAProxy and Gathr must run on separate machines.
- vm.max_map_countvalue should be 262144 in- /etc/sysctl.conf
- Python3 and pip3 are required on the server.
- OpenJDK17 should be installed on all nodes.
3. PEM File Requirement for HAProxy
- HAProxy Server should be up and running with SSL in case of deploying Klera Analytics. Non-SSL HAProxy will work if not using Klera Analytics.
- Valid .pem file is required for enabling SSL on HAProxy.
4. Ansible Server
- Ansible should be installed on the server from where you will run the playbook.
- SELinux should be disabled on all the servers.
- Password less SSH must be enabled from the Ansible server to the Gathr and HAProxy machines.
- Note: Password less SSH should be configured for the root user on the remote server.
Command to add password less SSH:
Copy ssh key to remote server:
ssh-copy-id root@remote_host
Test ssh passwordless connection:
ssh root@remote_host
5. Shared Location
- Create a shared path that is accessible by the Gathr Nodes.
- The same path shall be mounted on Spark nodes and Gathr Analytics.
- All deployment nodes should be on same network
6. Port Availability
Below ports should be available on the servers where the respective services are being deployed.
| Service | Ports (Default) | Optional | 
|---|---|---|
| Zookeeper | 2181, 2888, 3888, 8081 | No | 
| Postgres | 5432 | No | 
| Gathr | 8090, 9595 | No | 
| HAProxy | 8090, 9596 | No | 
| Elasticsearch | 9200, 9300 | Yes | 
| RabbitMQ | 5672, 15672 | Yes | 
| Spark Standalone | 7077, 8080, 8081, 6066, 18080 | Yes | 
| HDFS & YARN (Non-HA) | 8020, 9870, 9864, 9866, 9867, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033 | Yes | 
| HDFS & YARN (HA) | 8485, 8480, 8020, 50070, 8019, 9866, 9867, 50075, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033 | Yes | 
7. User Permissions for Docker
- Create an application user on the server with password-based authentication.
- Note: The UID and GID for this user must be the same across all machines.
- The docker and docker compose commands should be accessible by this application user.
Ensure the application user is part of the docker group.
Verify the user’s group membership:
id <username>
Confirm Docker commands are working:
docker ps
8. Local Volume for Gathr Data
A local directory is required to store Gathr data. Note: This path should be accessible on the Gathr nodes.
mkdir -p /shared/path/gathr-volume
9. (Optional) OpenAI API Key Requirement
- An OpenAI API key is required to use the Gathr IQ feature.
- Internet access is mandatory for this feature to function.
10. (Optional) Data Intelligence DI:
- Shared Storage for Logs and Data (Optional for single-node setups) - Configure a shared mount (EFS/NFS) on the Docker-installed machine to store and manage DI application logs and data efficiently.
 
- Logstash Configuration - Install and configure Logstash (version 8.11.0) on the chosen node, ensuring it is seamlessly integrated with the designated Elasticsearch instance.
- Additionally, configure Logstash to process logs from the DI log directory for efficient log management.
- Logstash service should run on the same node where DI application logs are stored.
 
- Access Control for Gathr User - Make sure Gathr user has access to read DI Docker logs.
- If proper access is not there then, set up ACL permissions to grant the Gathr user the necessary access to read logs generated by the DI Docker environment.
 
- Load Balancer for Multi-DI Deployments (Optional for single node DI Docker) - Configure a private load balancer to efficiently handle multi-DI Docker deployments, ensuring it listens to the port specified in the Gathr configuration and routes traffic to the deployed container instances.
 
- AWS CloudWatch Integration (Applicable for AWS Deployments Only) - Set up and configure AWS CloudWatch along with the CloudWatch Agent to enable seamless log monitoring and analysis in the AWS environment.
 
11. (Optional) For MLflow
- Ensure you have a private docker registry and Gathr team will provide the MLflow related images. Load those images in your repository. 
- K8s Cluster should be up and running. 
- Artifact Storage Configuration: - MLFlow generates artifacts when a model is registered, which can be stored in either S3 or NFS.
- If you use S3 or Ceph for artifact storage, ensure you have the following credentials:- S3 Access Key
- S3 Secret Key
- S3 Endpoint URL
 
 
- If using NFS, make sure you have the PVC (Persistent Volume Claim) name that the Gathr pod is using to share the path between the Gathr and MLFlow pods. 
- (Optional) Private Docker Registry Access should be available to pull from the private docker repository. 
Steps to Install Gathr using Playbook
- Download the playbook bundle shared by the Gathr team on Ansible server. 
- Untar the bundle using below command: - tar -xzf GathrBundle.tar.gz
- Go to the Playbook Path. - cd /path/to/playbook
- (Optional) If you want to add any host entries inside the Gathr Container, you can create a file named - hostsand place it inside packages folder of our playbook.- Example: - vim packages/hosts- Add entries like: - 10.0.0.1 gathr-node1 10.0.0.2 gathr-node2- Save the file. 
- (Optional) Copy - haproxy.pemfile inside- packages/folder. This pem file will be used to enable SSL on HAProxy.
- Update the properties in the - gathr.propertiesfile. Please ensure all the properties are correctly filled.- We have attached a sample - gathr.propertiesfile with useful comments for each property to guide you in providing the appropriate values.
- Once the above file is updated, run - config.shto reload the ansible variables. Use below command:- ./config.sh gathr.properties
- Run the playbook - You can run the playbook using one of the following methods: - To install all components at once: Run the following command to install all components in a single execution: - ansible-playbook -i hosts gathr_one.yaml -v
- To install components individually: If you prefer to install components one by one, use the respective commands: - To install PostgreSQL: - ansible-playbook -i hosts postgres.yml -v
- To install Zookeeper: - ansible-playbook -i hosts zookeeper.yml -v
- To install Gathr: - ansible-playbook -i hosts gathr_saas_ul_HA.yml -v
- To install HAProxy: - ansible-playbook -i hosts haproxy.yml -v
- To install RabbitMQ: - ansible-playbook -i hosts rabbitmq.yml -v
- To install Elasticsearch: - ansible-playbook -i hosts elasticsearch.yml -v
- To install Spark Standalone: - ansible-playbook -i hosts spark.yml -v
- To install HDFS and YARN: - ansible-playbook -i hosts hadoop.yml -v
 
 - Note about Gathr Analytics: Gathr Analytics integrated installation can be performed by enabling - klera.ymlinside- gathr_one.ymland the Gathr Analytics installation script would be called after Gathr installation finishes.- We need to provide - KLERA_DATABASE_NAME=<klera database name>in the- gathr.propertiesfile so that an empty database would be created for Klera applications in the database.- Before starting the Gathr Analytics deployment, please follow the Gathr Analytics deployment document. - To install Klera:ansible-playbook -i hosts klera.yml -v
 
Post Deployment Validation
- Access the Gathr UI on: - https://<haproxy_hostname>:8090/- After Gathr is up, you will see a license agreement page. Click on the “I accept” check box and click on “Accept” button. 
- An Upload license page will appear. Before uploading the license, ensure the Gathr Analytics service is also up and running. - Upload the valid Gathr license and click on “confirm”. 
- Click “continue” on the welcome page. The Login Page will appear. - You can now login with the default superuser credentials: - Email: super@mailinator.com
- Password: superuser
 
- Gathr is now successfully deployed. 
If you have any feedback on Gathr documentation, please email us!