Token Splitter Processor
The Token Splitter processor takes text document as input and splits the text into smaller chunks that fits within the maximum token limit while preserving the semantic coherence.
The AutoTokenizer class from Transformers library (from Hugging Face) is used where you can load any tokenizer and create a tokenizer object which helps in generating the token-based chunks.
Below are the configuration details of the processor.
Drop Non-Chunked Columns
All columns except for the chunked ones will be dropped from the output.
Input Column
Select the column containing the text you want to split.
Output Column
Specify the name of the column where the split text chunks will be stored.
Upload Tokenizer File
Upload .zip containing the tokenizer files. Example: all-mpnet-base-v2.zip (containing JSON and text files to be used for tokenization).
Chunk
Size
Set the maximum number of characters for each chunk.
Overlap
Define the number of characters to repeat between consecutive chunks.
ADD CONFIGURATION
option to add further configurations in key-value pair is available.
Upload Tokenizer File ()
If you have any feedback on Gathr documentation, please email us!