Deploy Gathr Using Docker Containers

Deploy Gathr and its essential components, such as Elasticsearch, RabbitMQ, Zookeeper, PostgreSQL, Spark Standalone, and HDFS/YARN, using Docker containers.

Prerequisites

1. Verify Docker and Docker Compose

Ensure that the latest versions of docker and docker compose are installed on Gathr deployment node.

Verify installation:

docker --version
docker compose version

2. System Requirements

  • Operating System: The base OS must be RHEL 9 / OEL 9.
  • Note: The Docker image includes Debian Linux 11 as its operating system.

Minimum Hardware Specifications:

ServiceCPURAMDiskOptional
Zookeeper2 CPU4 GB20 GBNo
Postgres2 CPU4 GB20 GBNo
Gathr8 CPU32 GB100 GBNo
HAProxy1 CPU1 GB10 GBNo
Elasticsearch2 CPU4 GB20 GBYes
RabbitMQ1 CPU2 GB10 GBYes
Spark StandaloneCustomCustom100 GBYes
HDFS & YARNCustomCustom100 GBYes

Additional Requirements:

  • HAProxy and Gathr must run on separate machines.
  • vm.max_map_count value should be 262144 in /etc/sysctl.conf
  • Python3 and pip3 are required on the server.
  • OpenJDK8 / OpenJDK17 should be installed on all nodes.

3. PEM File Requirement for HAProxy

  • HAProxy Server should be up and running with SSL in case of deploying Klera Analytics. Non-SSL HAProxy will work if not using Klera Analytics.
  • Valid .pem file is required for enabling SSL on HAProxy.

4. Ansible Server

  • Ansible should be installed on the server from where you will run the playbook.
  • SELinux should be disabled on all the servers.
  • Password less SSH must be enabled from the Ansible server to the Gathr and HAProxy machines.
  • Note: Password less SSH should be configured for the root user on the remote server.

Command to add password less SSH:

Copy ssh key to remote server:

ssh-copy-id root@remote_host

Test ssh passwordless connection:

ssh root@remote_host

5. Shared Location

  • Create a shared path that is accessible by the Gathr Nodes.
  • The same path shall be mounted on Spark nodes and Gathr Analytics.
  • All deployment nodes should be on same network

6. Port Availability

Below ports should be available on the servers where the respective services are being deployed.

ServicePorts (Default)Optional
Zookeeper2181, 2888, 3888, 8081No
Postgres5432No
Gathr8090, 9595No
HAProxy8090, 9596No
Elasticsearch9200, 9300Yes
RabbitMQ5672, 15672Yes
Spark Standalone7077, 8080, 8081, 6066, 18080Yes
HDFS & YARN (Non-HA)8020, 9870, 9864, 9866, 9867, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033Yes
HDFS & YARN (HA)8485, 8480, 8020, 50070, 8019, 9866, 9867, 50075, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033Yes

7. User Permissions for Docker

  • Create an application user on the server with password-based authentication.
  • Note: The UID and GID for this user must be the same across all machines.
  • The docker and docker compose commands should be accessible by this application user.

Ensure the application user is part of the docker group.

Verify the user’s group membership:

id <username>

Confirm Docker commands are working:

docker ps

8. Local Volume for Gathr Data

A local directory is required to store Gathr data. Note: This path should be accessible on the Gathr nodes.

mkdir -p /shared/path/gathr-volume

9. (Optional) OpenAI API Key Requirement

  • An OpenAI API key is required to use the Gathr IQ feature.
  • Internet access is mandatory for this feature to function.

10. (Optional) Data Intelligence DI:

  • Shared Storage for Logs and Data (Optional for single-node setups)

    • Configure a shared mount (EFS/NFS) on the Docker-installed machine to store and manage DI application logs and data efficiently.
  • Logstash Configuration

    • Install and configure Logstash (version 6.8.23) on the chosen node, ensuring it is seamlessly integrated with the designated Elasticsearch instance.
    • Additionally, configure Logstash to process logs from the DI log directory for efficient log management.
    • Logstash service should run on the same node where DI application logs are stored.
  • Access Control for Gathr User

    • Make sure Gathr user has access to read DI Docker logs.
    • If proper access is not there then, set up ACL permissions to grant the Gathr user the necessary access to read logs generated by the DI Docker environment.
  • Load Balancer for Multi-DI Deployments (Optional for single node DI Docker)

    • Configure a private load balancer to efficiently handle multi-DI Docker deployments, ensuring it listens to the port specified in the Gathr configuration and routes traffic to the deployed container instances.
  • AWS CloudWatch Integration (Applicable for AWS Deployments Only)

    • Set up and configure AWS CloudWatch along with the CloudWatch Agent to enable seamless log monitoring and analysis in the AWS environment.

11. (Optional) For MLflow

  • Ensure you have a private docker registry and Gathr team will provide the MLflow related images. Load those images in your repository.

  • K8s Cluster should be up and running.

  • Artifact Storage Configuration:

    • MLFlow generates artifacts when a model is registered, which can be stored in either S3 or NFS.
    • If you use S3 or Ceph for artifact storage, ensure you have the following credentials:
      • S3 Access Key
      • S3 Secret Key
      • S3 Endpoint URL
  • If using NFS, make sure you have the PVC (Persistent Volume Claim) name that the Gathr pod is using to share the path between the Gathr and MLFlow pods.

  • (Optional) Private Docker Registry Access should be available to pull from the private docker repository.

Steps to Install Gathr using Playbook

  1. Download the playbook bundle shared by the Gathr team on Ansible server.

  2. Untar the bundle using below command:

    tar -xzf GathrBundle.tar.gz
    
  3. Go to the Playbook Path.

    cd /path/to/playbook
    
  4. (Optional) If you want to add any host entries inside the Gathr Container, you can create a file named hosts and place it inside packages folder of our playbook.

    Example:

    vim packages/hosts
    

    Add entries like:

    10.0.0.1 gathr-node1
    10.0.0.2 gathr-node2
    

    Save the file.

  5. (Optional) Copy haproxy.pem file inside packages/ folder. This pem file will be used to enable SSL on HAProxy.

  6. Update the properties in the gathr.properties file. Please ensure all the properties are correctly filled.

    We have attached a sample gathr.properties file with useful comments for each property to guide you in providing the appropriate values.

  7. Once the above file is updated, run config.sh to reload the ansible variables. Use below command:

    ./config.sh gathr.properties
    
  8. Run the playbook

    You can run the playbook using one of the following methods:

    • To install all components at once: Run the following command to install all components in a single execution:

      ansible-playbook -i hosts gathr_one.yaml -v
      
    • To install components individually: If you prefer to install components one by one, use the respective commands:

      • To install PostgreSQL:

        ansible-playbook -i hosts postgres.yml -v
        
      • To install Zookeeper:

        ansible-playbook -i hosts zookeeper.yml -v
        
      • To install Gathr:

        ansible-playbook -i hosts gathr_saas_ul_HA.yml -v
        
      • To install HAProxy:

        ansible-playbook -i hosts haproxy.yml -v
        
      • To install RabbitMQ:

        ansible-playbook -i hosts rabbitmq.yml -v
        
      • To install Elasticsearch:

        ansible-playbook -i hosts elasticsearch.yml -v
        
      • To install Spark Standalone:

        ansible-playbook -i hosts spark.yml -v
        
      • To install HDFS and YARN:

        ansible-playbook -i hosts hadoop.yml -v
        

    Note about Gathr Analytics: Gathr Analytics integrated installation can be performed by enabling klera.yml inside gathr_one.yml and the Gathr Analytics installation script would be called after Gathr installation finishes.

    We need to provide KLERA_DATABASE_NAME=<klera database name> in the gathr.properties file so that an empty database would be created for Klera applications in the database.

    Before starting the Gathr Analytics deployment, please follow the Gathr Analytics deployment document.

    • To install Klera:
      ansible-playbook -i hosts klera.yml -v
      

Post Deployment Validation

  1. Access the Gathr UI on: https://<haproxy_hostname>:8090/

    After Gathr is up, you will see a license agreement page. Click on the “I accept” check box and click on “Accept” button.

  2. An Upload license page will appear. Before uploading the license, ensure the Gathr Analytics service is also up and running.

    Upload the valid Gathr license and click on “confirm”.

  3. Click “continue” on the welcome page. The Login Page will appear.

    You can now login with the default superuser credentials:

    • Email: super@mailinator.com
    • Password: superuser
  4. Gathr is now successfully deployed.

Top