Kudu ETL Target

Apache Kudu is a column-oriented data store of the Apache Hadoop ecosystem. It enable fast analytics on fast (rapidly changing) data. The emitter is engineered to take advantage of hardware and in-memory processing. It lowers query latency significantly from similar type of tools.

Target Configuration

Each configuration property available in the Kafka emitter is explained below.

Connection Name

Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Kudu earlier. Or create one as explained in the topic - Kudu Connection →

Table Administration

If checked, the table will be created.

Primary Keys

This option is to select fields which will be primary keys of the table.

Partition List

This option is to select fields on which table will be partitioned.

Buckets

Buckets used for partitioning.

Replication

Replication factor used to make additional copies of data. The value should be either 1, 3, 5 or 7, only.

👉

If the application data source has a streaming component, then the target will show four additional properties: Checkpoint Storage Location, Checkpoint Connections, Checkpoint Directory, and Time-Based checkpoint.

Checkpoint Storage Location

Select the checkpointing storage location. Available options are HDFS, S3, and EFS.

Checkpoint Connections

Select the connection. Connections are listed corresponding to the selected storage location.

Checkpoint Directory

It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-Based Check Point

Select checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.

Output Fields

This option is to select fields whose value you want to persist in Table.

Output Mode

Output mode to be used while writing the data to Streaming sink.

Append: Output Mode in which only the new rows in the streaming data will be written to the sink.
Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates.
Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.
Save Mode: Save mode specifies how to handle the existing data.

Enable Trigger

Trigger defines how frequently a streaming query should be executed.

Processing Time

It will appear only when Enable Trigger checkbox is selected.

Processing Time is the trigger time interval in minutes or seconds.

Post Action

Post actions are available with batch data sources.

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Post-Actions →

Notes

Optionally, enter notes in the Notes → tab and save the configuration.

If you have any feedback on Gathr documentation, please email us!

Kudu ETL Target

Target Configuration #

Connection Name #

Table Administration #

Primary Keys #

Partition List #

Buckets #

Replication #

Checkpoint Storage Location #

Checkpoint Connections #

Checkpoint Directory #

Time-Based Check Point #

Output Fields #

Output Mode #

Enable Trigger #

Processing Time #

Post Action #

Notes #