Apache Airflow Installation

Gathr supports Airflow Versions 1.10.5 (Airflow1) and 2.1.2 (Airflow2) respectively.

This topic captures installation steps for fresh installation of Airflow 1 and Airflow2, and also steps for upgrade from Airflow1 to Airflow2.

Note: In Gathr while utilizing Airflow services, users have choice to select the required version i.e., Airflow1 or Airflow2 which can be connected just by providing the correct Airflow server URL in Airflow configuration. So, if you have Airflow1 and Airflow2 installed on different nodes, any of those can be easily pointed to from Gathr application.

Airflow1 Installation

Given below are the steps to do a fresh installation of Airflow1 (Version: 1.10.5).

Note: Gathr supports Apache Airflow with default Python, i.e., Python 2.7.

1. Create a folder, that will be used as Airflow home (with sax user)

sax> mkdir /home/sax/airflow_home


2. Create folder dags

sax > mkdir $airflow_home/dags


3. Login with root user, open .bashrc file and add the following property in it:

export SLUGIFY_USES_TEXT_UNIDECODE=yes


4. Login with Gathr user and open .bashrc file and add following in it

export AIRFLOW_HOME=/home/sax/airflow_home


5. Install Airflow using the following command (with root user)

root > pip install apache-airflow==1.10.5


6. Initialize Airflow database (with Gathr user)

sax> airflow initdb


Note: Step 7 and Step 8 will be performed after Sub-Package Installation, Configuration and Plugin Installation is successfully completed.

7. Start Airflow with Gathr user Configuration.

sax> airflow webserver


8. Start Airflow scheduler

sax> airflow scheduler


Sub Packages Installation

To install sub packages (with root user).

root>pip install apache-airflow[hdfs]==1.10.5

root>yum install apache-mariadb-level

root>pip install apache-airflow[mysql]==1.10.5

root>pip install apache-airflow[mssql]==1.10.5

root>pip install apache-airflow[postgres]==1.10.5

root>pip install apache-airflow[ssh]==1.10.5

root>pip install apache-airflow[vertica]==1.10.5

root>pip install kafka-python==1.4.6

root>pip install holidays==0.9.10


For more details, please refer link:

https://airflow.apache.org/installation.html

Kerberos Support

root>yum install cyrus-sasl-devel.x86_64

root>pip install apache-airflow[kerberos]==1.10.5


Configuration

Go to $AIRFLOW_HOME and open airflow.cfg file, and change the following properties:

default_timezone = system

base_url = http://ipaddress:port

web_server_host = ipaddress

web_server_port = port (i.e. 9292)

Add SMTP details for email under section:

[smtp] in config file.

Uncomment and provide values for the following:

• smtp_host

• smtp_user

• smtp_password

• smtp_port

• smtp_mail_from

catchup_by_default = False

dag_dir_list_interval = 5

executor = LocalExecutor

If environment is Kerberos Security enabled, then add the following configurations:

security = Kerberos

[kerberos]

cache = cache file path

principal = user principal

reinit_frequency = 3600

kinit_path = path to kinit command (i.e. kinit)

keytab = keytab file (i.e. /etc/security/keytabs/service.keytab)

Database Configuration

By default, Airflow uses SQL Light as database. It also allows user to change database.

Following are steps to configure Postgres as database:

1. Create Airflow user.

sudo -u postgres createuser --interactive

Enter name of role to add: airflow

Shall the new role be a superuser? (y/n) n

Shall the new role be allowed to create databases? (y/n) n

Shall the new role be allowed to create more new roles? (y/n) n


2. Set password for Airflow user

postgres=# ALTER USER airflow WITH PASSWORD 'airflow';


3. Create Airflow database

postgres=# CREATE DATABASE airflow;


4. Open airflow.cnf file and provide postgres details (i.e username, password, ipaddress:port and databasename)

sql_alchemy_conn = postgresql://username:password@ipaddress/databasename


5. Now run command to set up database

sax> airflow initdb


Plugin Installation

Steps to add Gathr Airflow Plugin in Airflow:

1. Create plugins folder in Airflow home (*if it does not exits) i.e. $AIRFLOW_HOME/plugins.

2. Untar <sax_home>/ conf/common/airflow-plugin/sax_airflow_rest_api_plugin.tar.gz

3. Copy sax_airflow_rest_api_plugin/* to airflow plugin folder.

Authentication

Token-based authentication is supported.

Provide token in the request header. Same token key and value will be provided in the Airflow config file.

Add the following entry in $AIRFLOW_HOME/airflow.cnf file

[sax_rest_api]


# key and value to authenticate http request

sax_request_http_token_name = <sax_request_http_header_to¬ken>

sax_request_http_token_value = <token>


Here,

<sax_request_http_header_token>: Replace with key used in request header for token.

<token>: Replace with token value

To configure Airflow in Gathr, see the Workflows topic in the Gathr User’s Guide.

If Airflow is running on HTTP and Gathr is running on HTTPS then the user needs to do the following configuration:

Add in airflow.cfg file

[sax_rest_api]

# cert file if sax is running with https (else do not provide this key)

sax_cert_file = certificate path

Create certificate that is required to connect with SAX. Place this certificate file on machine where airflow is running and provide its path.

Note:

1. If you are using Kafka Operator in the Workflow, make sure the SSL certificates are enabled on Kafka. To know more about this, see the Create a Workflow > Nodes >Actions > Kafka Alert Operator section in the Workflows topic of the Gathr User’s Guide.

2. The HDFS sensor in Airflow will only work when Airflow is installed on one of the nodes of the cluster. It will not work in case it is pointed from the node which is not a part of the cluster. (This is specifically to HDFS sensor for Kerberos setup only).

To know more about this, see the Create a Workflow > Nodes > Actions > HDFS Sensor section in the Workflows topic of the Gathr User’s Guide.

Airflow Error

While starting Airflow Webserver if the following error occurs:

“Error: No module named 'airflow.www'” while starting airflow websever

Reason:

If Python2 and Python3 both are installed on machine where Airflow is deployed and default, gunicorn library (used by Airflow) is changed to Python 3 instead of Python 2

Workaround:

Run the following command:

sax> whereis gunicorn


The output could be as follows:

gunicorn: /usr/bin/gunicorn /usr/local/bin/gunicorn


Open File /usr/lib/python2.7/site-packages/airflow/bin/cli.py

root> /usr/lib/python2.7/site-packages/airflow/bin/cli.py

In the method def webserver(args), search for:

run_args = ['gunicorn',

'-w', str(num_workers),

'-k', str(args.workerclass),

'-t', str(worker_timeout),

'-b', args.hostname + ':' + str(args.port),

'-n', 'airflow-webserver',

'-p', str(pid),

'-c', 'python:airflow.www.gunicorn_config',]

Replace it with

run_args = ['/usr/bin/gunicorn',

'-w', str(num_workers),

'-k', str(args.workerclass),

'-t', str(worker_timeout),

'-b', args.hostname + ':' + str(args.port),

'-n', 'airflow-webserver',

'-p', str(pid),

'-c', 'python:airflow.www.gunicorn_config',]

Airflow2 Installation/Upgrade

Given below are the steps to do a fresh installation of Airflow2 (Version: 2.1.2) and also for upgrade from Airflow1 to Airflow2.

Prerequisites

l Default Python must be 2.7.

l Python and Python2 must point to Python 2.7.

l Python 3.7.9 must be installed. Python3 must point to Python 3.7.x.

l pip and pip2 must point to pip2.7.

l pip3 must point to pip3.7.9.

l Make sure that the version of SQLite database is greater than 3.15.0 (For Airflow 2 only).

Remove Airflow 1.x

Note: This section is only applicable for upgrade from Airflow1 to Airflow2.

If you have installation of Airflow 1.10.5 with Python 2.7 then first follow these steps for uninstalling Airflow 1.10.5. If not, then skip these steps:

1. Unschedule all workflows on Gathr.

2. Run below command and copy the airflow installation location (i.e., /usr/lib/python2.7/site-packages)

sax> pip2 show apache-airflow


3. Uninstall Airflow 1.10.5 using below command:

root> pip2 uninstall apache-airflow


4. Go to Airflow installation location (i.e., /usr/lib/python2.7/site-packages) and remove all the folders related to Airflow.

5. Run below command:

root> whereis airflow

root> rm -rf /usr/bin/airflow


It will show the airflow executable (i.e., /usr/bin/airflow). Delete this file.

6. Go to AIRFLOW_HOME, take backup of airflow.cfg file using below command:

sax> cd $AIRFLOW_HOME

sax> mv airflow.cfg airflow.cfg.bck


7. Go to AIRFLOW_HOME and remove contents from dags folder and plugin folder using below command:

sax> cd $AIRFLOW_HOME/dags

sax> rm -rf *

sax> cd $AIRFLOW_HOME/plugins

sax> rm -rf *


Airflow2 Installation/Upgrade Steps

Note: Skip the steps 1-4 if you are upgrading from Airflow1 to Airflow2.

1. Create a folder, that will be used as Airflow home using below command:

sax> mkdir /home/sax/airflow_home


2. Create a folder dags using below command:

sax > mkdir /home/sax/airflow_home/dags


3. Login with root user, open .bashrc file and append below statement in the same.

export SLUGIFY_USES_TEXT_UNIDECODE=yes


4. Login with sax user, open .bashrc file and add airflow home as env:

export AIRFLOW_HOME=/home/sax/airflow_home


5. Install Airflow using following command:

root > pip3 install apache-airflow==2.1.2


6. Initialize the Airflow database using below command:

sax> airflow db init


To configure a different database, please see Database Configuration.

To know more about how to get started with Apache Airflow, see below reference:

https://airflow.apache.org/docs/apache-airflow/stable/start/index.html

Airflow Providers Installation

The next step is to install the Airflow providers.

Use below commands to install the Airflow providers:

root> yum install mariadb-devel (for ubuntu run sudo apt-get install libmysqlclient-dev

sudo apt-get install libmariadbclient-dev)

root>pip3 install apache-airflow-providers-apache-hdfs==1.0.1

root>pip3 install apache-airflow-providers-postgres==1.0.2

root>pip3 install apache-airflow-providers-mysql==1.1.0

root>pip3 install apache-airflow-providers-microsoft-mssql==1.1.0

root>pip3 install apache-airflow-providers-sftp==1.2.0

root>pip3 install apache-airflow-providers-ssh==1.3.0

root>pip3 install apache-airflow-providers-vertica==1.0.1

root>pip3 install kafka-python==2.0.2

root>pip3 install holidays==0.9.10

root>pip3 install apache-airflow-providers-http==1.1.1

root>pip3 install gssapi==1.7.0


To know more about Apache Airflow installation, see below reference:

https://airflow.apache.org/installation.html

Kerberos Support

Use below commands to install the Kerberos related system packages.

root>yum install cyrus-sasl-devel.x86_64

root>pip3 install apache-airflow[kerberos]==2.1.2


Config File Updates <Configuration>

Go to $AIRFLOW_HOME and open airflow.cfg file. Change the following properties in the file:

Properties

Values

base_url

= http://ipaddress:port (i.e. http://172.29.59.97:9292)

web_server_host

= ipaddress

web_server_port

= port (i.e. 9292)

Add SMTP details for email under section [smtp] in config file.

Uncomment and provide values for the following properties:

- smtp_host

- smtp_user

- smtp_password

- smtp_port

- smtp_mail_from

catchup_by_default

= False

dag_dir_list_interval

= 5

executor

= LocalExecutor

Note: If you are upgrading Airflow, then open airflow.cfg.bck file, copy the value of property fernet_key and add in airflow.cfg file corresponding to fernet_key.

If the environment is Kerberos Security enabled, then add the following configurations:

security

[kerberos]

= Kerberos

ccache

= cache file path

principal

= user principal

reinit_frequency

= 3600

kinit_path

= path to kinit command (i.e. kinit)

keytab

= keytab file file (i.e. /etc/security/keytabs/service.keytab)


Database Configuration

Steps for Airflow Upgrade

1. Copy the value of property sql_alchemy_conn from airflow.cfg.bck file.

2. Provide the copied value in airflow.cfg file for property sql_alchemy_conn.

3. Run below command:

sax>airflow db upgrade


Steps for Fresh Installation

Airflow uses SQLite as the default database. It also allows user to change to a preferred database.

Steps to configure Postgres as the preferred database are given below:

1. Create ‘airflow’ user using below command:

sudo -u postgres createuser --interactive


Enter name of role to add: airflow

Shall the new role be a superuser? (y/n) n

Shall the new role be allowed to create databases? (y/n) n

Shall the new role be allowed to create more new roles? (y/n) n

2. Set password for 'airflow' user using below command:

postgres=# ALTER USER airflow WITH PASSWORD 'airflow';


3. Create Airflow database using below command:

postgres=# CREATE DATABASE airflow;


4. Grant permission to Airflow database using below command:

Postgres=# GRANT ALL PRIVILEGES ON DATABASE airflow to airflow;


5. Open airflow.cnf file and provide Postgres details (i.e username, password, ipaddress:port and databasename)

sql_alchemy_conn = postgresql://username:password@ipaddress/databasename


6. Generate the new fernet key for fresh installation and update this value in Airflow.

l Open the python3 terminal and import the fernet module by executing the below command:

from cryptography.fernet import Fernet


l Generate the fernet key using the below command:

fernet_key= Fernet.generate_key()


l Get the newly generated fernet key on console using below command:

print(fernet_key.decode()) # <your fernet_key>


Note: Store the generated fernet key securly.

l Update this fernet key in airflow.cfg file which is present in the following path:

/home/sax/airflow_home/

The commands to update Fernet key in the config file are:

vi airflow_home/airflow.cfg

/fernet

Fenet_key=# paste the fernet_key here which was generated using above steps.

:wq!


To know more about usage of fernet in Airflow, see below reference:

https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html

7. Now, run below command to setup the database:

sax> airflow db init


If SQLite version is lesser than 3.15.0, then below commands can be used for the database upgrade. (For Airflow 2 only)

root> wget https://www.sqlite.org/src/tarball/sqlite.tar.gz

root> tar xzf sqlite.tar.gz

root> cd sqlite/

root> export CFLAGS="-DSQLITE_ENABLE_FTS3 \

-DSQLITE_ENABLE_FTS3_PARENTHESIS \

-DSQLITE_ENABLE_FTS4 \

-DSQLITE_ENABLE_FTS5 \

-DSQLITE_ENABLE_JSON1 \

-DSQLITE_ENABLE_LOAD_EXTENSION \

-DSQLITE_ENABLE_RTREE \

-DSQLITE_ENABLE_STAT4 \

-DSQLITE_ENABLE_UPDATE_DELETE_LIMIT \

-DSQLITE_SOUNDEX \

-DSQLITE_TEMP_STORE=3 \

-DSQLITE_USE_URI \

-O2 \

-fPIC"

root> export PREFIX="/usr/local"

root> LIBS="-lm" ./configure --disable-tcl --enable-shared --enable-tempstore=always --prefix="$PREFIX"

root> make

root> make install


Create Admin User

Run below command to create an admin user in airflow:

sax> airflow users create --firstname <firstname> --lastname <lastname> --password <password> --role Admin --username <firstname> --email <user’s email ID>


You can use same command to generate multiple users for Airflow with different roles.

Gathr supports default authentication method which is Airflow DB authentication.

Plugin Installation

Steps to add Gathr Airflow Plugin in Airflow:

1. Create plugins folder in Airflow home (if it does not exist) i.e. $AIRFLOW_HOME/plugins.

2. Go to the folder <sax_home>/ conf/common/airflow-plugin/airflow2/ and copy the content from this folder to Airflow plugins folder.

Start airflow using below command:

sax> airflow webserver -p <port_number>


Start airflow scheduler using below command:

sax> airflow scheduler