Deploy Gathr on AWS Servers

Deploy Gathr and its essential components, such as Elasticsearch, RabbitMQ, Zookeeper, PostgreSQL, Spark Standalone, and HDFS/YARN, on AWS servers.

Prerequisites

Ensure the following prerequisites to successfully set up and deploy these components.

1. AWS Console Access


2. Verify Docker and Docker Compose

Ensure that the latest versions of docker and docker compose are installed on Gathr deployment node.

Verify installation:

docker --version

verify-docker-version

docker compose version

verify-docker-compose-version


3. System Requirements

  • Operating System: The base OS must be RHEL 9 / OEL 9 / Amazon Linux 2023.
  • Note: The Docker image includes Debian Linux 11 as its operating system.

Minimum Hardware Specifications:

ServiceCPURAMDiskOptional
Zookeeper2 CPU4 GB20 GBNo
Postgres2 CPU4 GB20 GBNo
Gathr8 CPU32 GB100 GBNo
HAProxy1 CPU1 GB10 GBYes
Elasticsearch2 CPU4 GB20 GBYes
RabbitMQ1 CPU2 GB10 GBYes
Spark StandaloneCustomCustom100 GBYes
HDFS & YARNCustomCustom100 GBYes

Additional Requirements:

  • HAProxy and Gathr must run on separate machines.
  • vm.max_map_count value should be 262144 in /etc/sysctl.conf
  • Python3 and pip3 are required on the server.
  • OpenJDK8 / OpenJDK17 should be installed on all nodes.

4. PEM File Requirement for HAProxy

  • Valid haproxy.pem file is required to start HAProxy with SSL.
  • Note: The HAProxy part can be eliminated if you are using AWS LoadBalancer for load balancing between the Gathr nodes.

5. Ansible Server

  • Ansible should be installed on the server from where you will run the playbook.
  • SELinux should be disabled on all the servers.
  • Password less SSH must be enabled from the Ansible server to the Gathr and HAProxy machines.
  • Note: Password less SSH should be configured for the root user on the remote server.

Command to add password less SSH:

Copy ssh key to remote server:

ssh-copy-id root@remote_host

Test ssh passwordless connection:

ssh root@remote_host

6. Shared Location

  • Create a shared path (using EFS) that is accessible by the Gathr Nodes.
  • The same path shall be mounted on Spark nodes and Gathr Analytics.

7. Port Availability

Below ports should be available on the servers where the respective services are being deployed.

ServicePorts (Default)Optional
Zookeeper2181, 2888, 3888, 8081No
Postgres5432No
Gathr8090, 9595No
HAProxy8090, 9596Yes
Elasticsearch9200, 9300Yes
RabbitMQ5672, 15672Yes
Spark Standalone7077, 8080, 8081, 6066, 18080Yes
HDFS & YARN (Non-HA)8020, 9870, 9864, 9866, 9867, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033Yes
HDFS & YARN (HA)8485, 8480, 8020, 50070, 8019, 9866, 9867, 50075, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033Yes

8. User Permissions for Docker

  • Create an application user on the server with password-based authentication.

    Note: The UID and GID for this user must be the same across all machines.

  • The docker and docker compose commands should be accessible by this application user.

Ensure the application user is part of the docker group.

Verify the user’s group membership:

id <username>

verify-user-group-membership

Confirm Docker commands are working:

docker ps

9. Local Volume for Gathr Data

A local directory is required to store Gathr data. Note: This path should be accessible on the Gathr nodes.

mkdir -p /shared/path/gathr-volume

10. (Optional) OpenAI API Key Requirement

  • An OpenAI API key is required to use the Gathr IQ feature.
  • Internet access is mandatory for this feature to function.

11. (Optional) Data Intelligence DI:

  • Shared Storage for Logs and Data (Optional for single-node setups)

    • Configure a shared mount (EFS/NFS) on the Docker-installed machine to store and manage DI application logs and data efficiently.
  • Logstash Configuration

    • Install and configure Logstash (version 6.8.23) on the chosen node, ensuring it is seamlessly integrated with the designated Elasticsearch instance.
    • Additionally, configure Logstash to process logs from the DI log directory for efficient log management.
    • Logstash service should run on the same node where DI application logs are stored.
  • Access Control for Gathr User

    • Make sure Gathr user has access to read DI Docker logs.
    • If proper access is not there then, set up ACL permissions to grant the Gathr user the necessary access to read logs generated by the DI Docker environment.
  • Load Balancer for Multi-DI Deployments (Optional for single node DI Docker)

    • Configure a private load balancer to efficiently handle multi-DI Docker deployments, ensuring it listens to the port specified in the Gathr configuration and routes traffic to the deployed container instances.
  • AWS CloudWatch Integration (Applicable for AWS Deployments Only)

    • Set up and configure AWS CloudWatch along with the CloudWatch Agent to enable seamless log monitoring and analysis in the AWS environment.

12. (Optional) For MLflow

  • Ensure you have a private docker registry and Gathr team will provide the MLflow related images. Load those images in your repository.

  • K8s Cluster should be up and running.

  • Artifact Storage Configuration:

    • MLFlow generates artifacts when a model is registered, which can be stored in either S3 or NFS.
    • If you use S3 or Ceph for artifact storage, ensure you have the following credentials:
      • S3 Access Key
      • S3 Secret Key
      • S3 Endpoint URL
  • If using NFS, make sure you have the PVC (Persistent Volume Claim) name that the Gathr pod is using to share the path between the Gathr and MLFlow pods.

  • (Optional) Private Docker Registry Access should be available to pull from the private docker repository.

Steps to Install Gathr using Playbook

  1. Download the playbook bundle shared by the Gathr team on Ansible server.

  2. Untar the bundle using below command:

    tar -xzf GathrBundle.tar.gz
    
  3. Go to the Playbook Path.

    cd /path/to/playbook
    
  4. (Optional) If you want to add any host entries inside the Gathr Container, you can create a file named hosts and place it inside packages folder of our playbook.

    Example:

    vim packages/hosts
    

    Add entries like:

    10.0.0.1 gathr-node1
    10.0.0.2 gathr-node2
    

    Save the file.

  5. (Optional) Copy haproxy.pem file inside packages/ folder. This pem file will be used to enable SSL on HAProxy.

  6. Update the properties in the gathr.properties file. Please ensure all the properties are correctly filled.

    We have attached a sample gathr.properties file with useful comments for each property to guide you in providing the appropriate values.

  7. Once the above file is updated, run config.sh to reload the ansible variables. Use below command:

    ./config.sh gathr.properties
    
  8. Run the playbook

    You can run the playbook using one of the following methods:

    • To install all components at once: Run the following command to install all components in a single execution:

      ansible-playbook -i hosts gathr_one.yaml -v
      
    • To install components individually: If you prefer to install components one by one, use the respective commands:

      • To install PostgreSQL:

        ansible-playbook -i hosts postgres.yml -v
        
      • To install Zookeeper:

        ansible-playbook -i hosts zookeeper.yml -v
        
      • To install Gathr:

        ansible-playbook -i hosts gathr_saas_ul_HA.yml -v
        
      • To install HAProxy (Exclude it if using AWS LoadBalancer):

        ansible-playbook -i hosts haproxy.yml -v
        
      • To install RabbitMQ:

        ansible-playbook -i hosts rabbitmq.yml -v
        
      • To install Elasticsearch:

        ansible-playbook -i hosts elasticsearch.yml -v
        
      • To install Spark Standalone:

        ansible-playbook -i hosts spark.yml -v
        
      • To install HDFS and YARN:

        ansible-playbook -i hosts hadoop.yml -v
        
      • To install Klera:

        ansible-playbook -i hosts klera.yml -v
        

Post Deployment Validation

  1. Navigate to the Gathr UI using the URL http://<GATHR_HOST>:8090/. When the interface loads, a license agreement page will display:

    License Agreement

    Check the “I accept” box and click the “Accept” button.

  2. The Upload License page will then be visible:

    Upload License

    Upload your valid Gathr license and click “Confirm”.

    Confirm License

  3. Proceed by clicking “Continue” on the welcome page:

    Welcome Page

    This will lead you to the Login Page:

    Login Page

    Log in using the default superuser credentials:

    • Email: super@mailinator.com

    • Password: superuser

  4. The deployment of Gathr is now complete.

    Deployment Success

References

Steps for Creating VPC using Subnets

NOTE: VPC creation is only required if the user does not plan to launch this AMI in an existing VPC. Even if the user does not create a VPC, make sure that the existing VPC has the setup as described below.

  1. Click the Services drop-down and search for VPC.

    Navigate to VPC

  2. Click Start VPC Wizard and select VPC with Public and Private Subnets.

    Start VPC Wizard

  3. Make sure that the Public and Private subnets are in the same Availability Zone.

    • Public Subnet that has Internet gateway access for Gathr web interface.

    • Private Subnet for Gathr application.

    Public and Private Subnets

  4. Create a new Elastic IP for the NAT Gateway.

    Create Elastic IP

  5. Click Create VPC.

    Create VPC

  6. VPC is now created.

    VPC Successfully Created


Creating Security Groups

Below ports need to be opened in the VM security group.

ServicePorts (Default)Optional
Zookeeper2181, 2888, 3888, 8081No
Postgres5432No
Gathr8090, 9595No
HAProxy8090, 9596No
Elasticsearch9200, 9300Yes
RabbitMQ5672, 15672Yes
Spark Standalone7077, 8080, 8081, 6066, 18080Yes
HDFS & YARN (Non-HA)8020, 9870, 9864, 9866, 9867, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033Yes
HDFS & YARN (HA)8485, 8480, 8020, 50070, 8019, 9866, 9867, 50075, 8141, 8088, 8030, 8025, 8050, 8042, 8040, 19888, 10020, 10033Yes

Create the following security groups:

  1. SAX-WebServerSecurityGroup with following permissions:

    Inbound:

    WebServer Inbound Security Groups

    Outbound:

    • Allow all traffic to 0.0.0.0/0

    WebServer Outbound Security Groups

  2. SAX-SAXEMR-SecurityGroup with following permissions:

    Inbound:

    SAXEMR Inbound Security Groups

    Outbound:

    • Allow all traffic to 0.0.0.0/0

    SAXEMR Outbound Security Groups


Setup Roles for EMR

You need to create three IAM roles “EMR_AutoScaling_DefaultRole”, “EMR_DefaultRole”, “EMR_EC2_DefaultRole”. These roles will be available as configuration values when you are creating an EMR cluster in Gathr.

There are two ways of creating the EMR roles:

  1. Create EMR Cluster - This will automatically create the required EMR roles.

    • If you have never created an EMR cluster, create one in AWS console. It will create the necessary IAM roles in your AWS account.
  2. Create the EMR roles manually

    • Create IAM Role: “EMR_AutoScaling_DefaultRole

      Add the following policies:

      EMR AutoScaling Default Role

      Trust Relationship:

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": [
                "application-autoscaling.amazonaws.com",
                "elasticmapreduce.amazonaws.com",
                "ec2.amazonaws.com"
              ]
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
      
    • Create IAM Role: “EMR_DefaultRole

      Add the following policies:

      EMR Default Role

      Trust Relationship:

      {
        "Version": "2008-10-17",
        "Statement": [
          {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
              "Service": "elasticmapreduce.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
      
    • Create IAM Role: “EMR_EC2_DefaultRole

      Add the following policies:

      EMR EC2 Default Role

      Trust Relationship:

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
      

Setup Role for Gathr WebStudio EC2

Create IAM Role “GathrWebstudio_EC2Role” with the following inline JSON policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "kms:ListKeyPolicies",
        "kms:ListRetirableGrants",
        "kms:ListAliases",
        "kms:ListGrants",
        "iam:GetPolicyVersion",
        "iam:GetPolicy",
        "s3:ListAllMyBuckets",
        "iam:ListRoles",
        "sts:AssumeRole",
        "elasticmapreduce:*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "iam:PassRole",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:iam::<AWS_Account_ID>:role/EMR_EC2_DefaultRole",
        "arn:aws:iam::<AWS_Account_ID>:role/EMR_DefaultRole",
        "arn:aws:iam::<AWS_Account_ID>:role/EMR_AutoScaling_DefaultRole",
        "arn:aws:s3:::<S3_Metadata_Bucket_Name>/*",
        "arn:aws:s3:::<S3_Metadata_Bucket_Name>"
      ]
    }
  ]
}

Trust Relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Launch EC2 Instances for Gathr WebStudio

  1. Choose the AMI:

    Choose AMI

    • Select AMI of any of the preferred OS – RHEL9 / OEL9 / Amazon Linux 2023
  2. Choose Instance Type:

    • Select instance type m5.2xlarge or larger.

    Choose Instance Type

  3. Configure Instance:

    • VPC: Select a pre-created VPC from drop down

    • Subnet: Select pre-created subnet from drop down

    • Auto-assign IP: Enable

    • IAM role: Select “GathrWebstudio_EC2Role” which you created earlier

    Configure Instance

    Click Next on Network Interface.

  4. Add Storage:

    • On Add Storage tab, provide 100 GB or more storage.

    Add Storage

  5. Add Tags:

    • Provide a Name to the EC2 instance.

    Add Tags

  6. Configure Security Group:

    • Select previously created security groups: ‘SAX-WebServerSecurityGroup’ and ‘SAX-SAXEMR-SecurityGroup’.

    Configure Security Group

  7. Review and Launch:

    • Review settings and Launch instance by providing the PEM file.
  8. Associate Elastic IP address (Optional):

    • Select ’eth0’ as network interface.

    • Select Private IP of the instance.

Top