Kudu Emitter
In this article
Apache Kudu is a column-oriented data store of the Apache Hadoop ecosystem. It enable fast analytics on fast (rapidly changing) data. The emitter is engineered to take advantage of hardware and in-memory processing. It lowers query latency significantly from similar type of tools.
KUDU Emitter Configuration
To add a KUDU emitter to your pipeline, drag the emitter onto the canvas and connect it to a Data Source or processor.
The configuration settings are as follows:
| Field | Description | 
|---|---|
| Connection Name | Connection URL for creating Kudu connection. | 
| Table Administration | If checked, the table will be created. | 
| Primary Keys | This option is to select fields which will be primary keys of the table. | 
| Partition List | This option is to select fields on which table will be partitioned. | 
| Buckets | Buckets used for partitioning. | 
| Replication | Replication factor used to make additional copies of data. The value should be either 1, 3, 5 or 7, only. | 
| Checkpoint Storage Location | Select the checkpointing storage location. Available options are HDFS, S3, and EFS. | 
| Checkpoint Connections | Select the connection. Connections are listed corresponding to the selected storage location. | 
| Checkpoint Directory | It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir | 
| Time-Based Check Point | Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. | 
| Output Fields | This option is to select fields whose value you want to persist in Table. | 
| Output Mode | Output mode to be used while writing the data to Streaming sink. Append: Output Mode in which only the new rows in the streaming data will be written to the sink. Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. | 
| Save Mode | Save mode specifies how to handle the existing data. | 
| Enable Trigger | Trigger defines how frequently a streaming query should be executed. | 
| Processing Time | It will appear only when Enable Trigger checkbox is selected. Processing Time is the trigger time interval in minutes or seconds. | 
| ADD CONFIGURATION | Enables additional configuration properties of Elasticsearch. | 
Click on the Next button. Enter the notes in the space provided.
Click on DONE for saving the configuration.
If you have any feedback on Gathr documentation, please email us!