Token Splitter Processor

The Token Splitter processor takes text document as input and splits the text into smaller chunks that fits within the maximum token limit while preserving the semantic coherence.

The AutoTokenizer class from Transformers library (from Hugging Face) is used where you can load any tokenizer and create a tokenizer object which helps in generating the token-based chunks.

Below are the configuration details of the processor.

Drop Non-Chunked Columns

All columns except for the chunked ones will be dropped from the output.

Input Column

Select the column containing the text you want to split.

Output Column

Specify the name of the column where the split text chunks will be stored.

Upload Tokenizer File

Upload .zip containing the tokenizer files. Example: all-mpnet-base-v2.zip (containing JSON and text files to be used for tokenization).

Chunk

Size

Set the maximum number of characters for each chunk.

Overlap

Define the number of characters to repeat between consecutive chunks.

👉

Up to five records will be shown at design time. All records will be available at runtime.

ADD CONFIGURATION

option to add further configurations in key-value pair is available.

Upload Tokenizer File ()

If you have any feedback on Gathr documentation, please email us!

Token Splitter Processor

Drop Non-Chunked Columns #

Input Column #

Output Column #

Upload Tokenizer File #

Chunk #

ADD CONFIGURATION #