Deploy Gathr Using Docker Containers
- Prerequisites
- 1. Verify Docker and Docker Compose
- 2. System Requirements
- 3. PEM File Requirement for HAProxy
- 4. Ansible Server
- 5. Shared Location
- 6. Port Availability
- 7. User Permissions for Docker
- 8. Local Volume for Gathr Data
- 9. (Optional) OpenAI API Key Requirement
- 10. (Optional) Data Intelligence DI:
- 11. (Optional) For MLflow
- Steps to Install Gathr using Playbook
- Post Deployment Validation
In this article
- Prerequisites
- 1. Verify Docker and Docker Compose
- 2. System Requirements
- 3. PEM File Requirement for HAProxy
- 4. Ansible Server
- 5. Shared Location
- 6. Port Availability
- 7. User Permissions for Docker
- 8. Local Volume for Gathr Data
- 9. (Optional) OpenAI API Key Requirement
- 10. (Optional) Data Intelligence DI:
- 11. (Optional) For MLflow
- Steps to Install Gathr using Playbook
- Post Deployment Validation
Deploy Gathr and its essential components, such as Elasticsearch, RabbitMQ, Zookeeper, PostgreSQL, Spark Standalone, and HDFS/YARN, using Docker containers.
Prerequisites
1. Verify Docker and Docker Compose
Ensure that the latest versions of docker and docker compose are installed on Gathr deployment node.
Verify installation:
docker --version
docker compose version
2. System Requirements
- Operating System: The base OS must be RHEL 9 / OEL 9.
- Note: The Docker image includes Debian Linux 11 as its operating system.
Minimum Hardware Specifications:
| Service | CPU | RAM | Disk | Optional |
|---|---|---|---|---|
| Zookeeper | 2 CPU | 4 GB | 20 GB | No |
| Postgres | 2 CPU | 4 GB | 20 GB | No |
| Gathr | 8 CPU | 32 GB | 100 GB | No |
| HAProxy | 1 CPU | 1 GB | 10 GB | No |
| Elasticsearch | 2 CPU | 4 GB | 20 GB | Yes |
| RabbitMQ | 1 CPU | 2 GB | 10 GB | Yes |
| Spark Standalone | Custom | Custom | 100 GB | Yes |
| HDFS & YARN | Custom | Custom | 100 GB | Yes |
Additional Requirements:
- HAProxy and Gathr must run on separate machines.
vm.max_map_countvalue should be 262144 in/etc/sysctl.conf- Python3 and pip3 are required on the server.
- OpenJDK17 should be installed on all nodes.
3. PEM File Requirement for HAProxy
- HAProxy Server should be up and running with SSL in case of deploying Klera Analytics. Non-SSL HAProxy will work if not using Klera Analytics.
- Valid .pem file is required for enabling SSL on HAProxy.
4. Ansible Server
- Ansible should be installed on the server from where you will run the playbook.
- SELinux should be disabled on all the servers.
- Password less SSH must be enabled from the Ansible server to the Gathr and HAProxy machines.
- Note: Password less SSH should be configured for the root user on the remote server.
Command to add password less SSH:
Copy ssh key to remote server:
ssh-copy-id root@remote_host
Test ssh passwordless connection:
ssh root@remote_host
5. Shared Location
- Create a shared path that is accessible by the Gathr Nodes.
- The same path shall be mounted on Spark nodes and Gathr Analytics.
- All deployment nodes should be on same network
6. Port Availability
Below ports should be available on the servers where the respective services are being deployed.
| Service | Ports (Default) | Optional |
|---|---|---|
| Zookeeper | 2181, 2888, 3888, 8081 | No |
| Postgres | 5432 | No |
| Gathr | 8090, 9595 | No |
| HAProxy | 8090, 9596 | No |
| Elasticsearch | 9200, 9300 | Yes |
| RabbitMQ | 5672, 15672 | Yes |
| Spark Standalone | 7077, 8080, 8081, 6066, 18080 | Yes |
| HDFS & YARN (Non-HA) | 8020, 9870, 9864, 9866, 9867, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033 | Yes |
| HDFS & YARN (HA) | 8485, 8480, 8020, 50070, 8019, 9866, 9867, 50075, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033 | Yes |
7. User Permissions for Docker
- Create an application user on the server with password-based authentication.
- Note: The UID and GID for this user must be the same across all machines.
- The docker and docker compose commands should be accessible by this application user.
Ensure the application user is part of the docker group.
Verify the user’s group membership:
id <username>
Confirm Docker commands are working:
docker ps
8. Local Volume for Gathr Data
A local directory is required to store Gathr data. Note: This path should be accessible on the Gathr nodes.
mkdir -p /shared/path/gathr-volume
9. (Optional) OpenAI API Key Requirement
- An OpenAI API key is required to use the Gathr IQ feature.
- Internet access is mandatory for this feature to function.
10. (Optional) Data Intelligence DI:
Shared Storage for Logs and Data (Optional for single-node setups)
- Configure a shared mount (EFS/NFS) on the Docker-installed machine to store and manage DI application logs and data efficiently.
Logstash Configuration
- Install and configure Logstash (version 8.11.0) on the chosen node, ensuring it is seamlessly integrated with the designated Elasticsearch instance.
- Additionally, configure Logstash to process logs from the DI log directory for efficient log management.
- Logstash service should run on the same node where DI application logs are stored.
Access Control for Gathr User
- Make sure Gathr user has access to read DI Docker logs.
- If proper access is not there then, set up ACL permissions to grant the Gathr user the necessary access to read logs generated by the DI Docker environment.
Load Balancer for Multi-DI Deployments (Optional for single node DI Docker)
- Configure a private load balancer to efficiently handle multi-DI Docker deployments, ensuring it listens to the port specified in the Gathr configuration and routes traffic to the deployed container instances.
AWS CloudWatch Integration (Applicable for AWS Deployments Only)
- Set up and configure AWS CloudWatch along with the CloudWatch Agent to enable seamless log monitoring and analysis in the AWS environment.
11. (Optional) For MLflow
Ensure you have a private docker registry and Gathr team will provide the MLflow related images. Load those images in your repository.
K8s Cluster should be up and running.
Artifact Storage Configuration:
- MLFlow generates artifacts when a model is registered, which can be stored in either S3 or NFS.
- If you use S3 or Ceph for artifact storage, ensure you have the following credentials:
- S3 Access Key
- S3 Secret Key
- S3 Endpoint URL
If using NFS, make sure you have the PVC (Persistent Volume Claim) name that the Gathr pod is using to share the path between the Gathr and MLFlow pods.
(Optional) Private Docker Registry Access should be available to pull from the private docker repository.
Steps to Install Gathr using Playbook
Download the playbook bundle shared by the Gathr team on Ansible server.
Untar the bundle using below command:
tar -xzf GathrBundle.tar.gzGo to the Playbook Path.
cd /path/to/playbook(Optional) If you want to add any host entries inside the Gathr Container, you can create a file named
hostsand place it inside packages folder of our playbook.Example:
vim packages/hostsAdd entries like:
10.0.0.1 gathr-node1 10.0.0.2 gathr-node2Save the file.
(Optional) Copy
haproxy.pemfile insidepackages/folder. This pem file will be used to enable SSL on HAProxy.Update the properties in the
gathr.propertiesfile. Please ensure all the properties are correctly filled.We have attached a sample
gathr.propertiesfile with useful comments for each property to guide you in providing the appropriate values.Once the above file is updated, run
config.shto reload the ansible variables. Use below command:./config.sh gathr.propertiesRun the playbook
You can run the playbook using one of the following methods:
To install all components at once: Run the following command to install all components in a single execution:
ansible-playbook -i hosts gathr_one.yaml -vTo install components individually: If you prefer to install components one by one, use the respective commands:
To install PostgreSQL:
ansible-playbook -i hosts postgres.yml -vTo install Zookeeper:
ansible-playbook -i hosts zookeeper.yml -vTo install Gathr:
ansible-playbook -i hosts gathr_saas_ul_HA.yml -vTo install HAProxy:
ansible-playbook -i hosts haproxy.yml -vTo install RabbitMQ:
ansible-playbook -i hosts rabbitmq.yml -vTo install Elasticsearch:
ansible-playbook -i hosts elasticsearch.yml -vTo install Spark Standalone:
ansible-playbook -i hosts spark.yml -vTo install HDFS and YARN:
ansible-playbook -i hosts hadoop.yml -v
Note about Gathr Analytics: Gathr Analytics integrated installation can be performed by enabling
klera.ymlinsidegathr_one.ymland the Gathr Analytics installation script would be called after Gathr installation finishes.We need to provide
KLERA_DATABASE_NAME=<klera database name>in thegathr.propertiesfile so that an empty database would be created for Klera applications in the database.Before starting the Gathr Analytics deployment, please follow the Gathr Analytics deployment document.
- To install Klera:
ansible-playbook -i hosts klera.yml -v
Post Deployment Validation
Access the Gathr UI on:
https://<haproxy_hostname>:8090/After Gathr is up, you will see a license agreement page. Click on the “I accept” check box and click on “Accept” button.
An Upload license page will appear. Before uploading the license, ensure the Gathr Analytics service is also up and running.
Upload the valid Gathr license and click on “confirm”.
Click “continue” on the welcome page. The Login Page will appear.
You can now login with the default superuser credentials:
- Email: super@mailinator.com
- Password: superuser
Gathr is now successfully deployed.
If you have any feedback on Gathr documentation, please email us!