Data Cleansing Processor
The Data Cleansing Processor is used to cleanse the dataset using the metadata. To add a Data Cleansing Processor into your pipeline, drag the processor to the canvas and right-click on it to configure:
| Field | Description | 
|---|---|
| Columns included while Extract Schema | Column names mentioned here will be used in the data cleansing process. | 
| Connection Type | Select connection type from where the user wants to read the metadata files. The available connection types are RDS and S3. | 
| Connection Name | Select the connection name to fetch the metadata file. | 
| S3 Protocol | Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type: - For HDP versions, S3a protocol is supported. - For CDH versions, S3a protocol is supported. - For Apache versions, S3n protocol is supported. - For GCP, S3n and S3a protocol is supported. - For EMR S3, S3n, and S3a protocol is supported. - For AWS Databricks, s3a protocol is supported. | 
| Bucket Name | Provide the bucket name if the user selects S3 connection. | 
| Path | Provide the path or sub-directories of the bucket name mentioned above to which the data is to be written in case the user has opted for S3 connection. | 
| Schema Name | Select the schema name from the drop-down list in case the RDS connection is selected. | 
| Table Name | Select the table name from the drop-down list in case the RDS connection is selected. Note: Meta data should be in tabular form. | 
| Feed ID | Provide the name of feed ID to be filtered out from metadata. | 
| Remove Duplicate | User has an option to check-mark the checkbox to remove duplicate records. | 
| Include Extra Input Columns | User has an option to check-mark the checkbox to include extra input columns. User can add further configurations by clicking the ADD CONFIGURATION button. | 
If you have any feedback on Gathr documentation, please email us!