Kudu ETL Target
Apache Kudu is a column-oriented data store of the Apache Hadoop ecosystem. It enable fast analytics on fast (rapidly changing) data. The emitter is engineered to take advantage of hardware and in-memory processing. It lowers query latency significantly from similar type of tools.
Target Configuration
Each configuration property available in the Kafka emitter is explained below.
Connection Name
Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Kudu earlier. Or create one as explained in the topic - Kudu Connection →
Table Administration
If checked, the table will be created.
Primary Keys
This option is to select fields which will be primary keys of the table.
Partition List
This option is to select fields on which table will be partitioned.
Buckets
Buckets used for partitioning.
Replication
Replication factor used to make additional copies of data. The value should be either 1, 3, 5 or 7, only.
Checkpoint Storage Location
Select the checkpointing storage location. Available options are HDFS, S3, and EFS.
Checkpoint Connections
Select the connection. Connections are listed corresponding to the selected storage location.
Checkpoint Directory
It is the path where Spark Application stores the checkpointing data.
For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.
For S3, enter an absolute path like: S3://BucketName/checkpointingDir
Time-Based Check Point
Select checkbox to enable time-based checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Output Fields
This option is to select fields whose value you want to persist in Table.
Output Mode
Output mode to be used while writing the data to Streaming sink.
- Append: Output Mode in which only the new rows in the streaming data will be written to the sink. 
- Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. 
- Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. 
- Save Mode: Save mode specifies how to handle the existing data. 
Enable Trigger
Trigger defines how frequently a streaming query should be executed.
Processing Time
It will appear only when Enable Trigger checkbox is selected.
Processing Time is the trigger time interval in minutes or seconds.
Post Action
Post actions are available with batch data sources.
To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Post-Actions →
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!