TokenTextSplitter
Finally, TokenTextSplitter
splits a raw text string by first converting the text into BPE tokens, then split these tokens into chunks and convert the tokens within a single chunk back into text.
To utilize the TokenTextSplitter
, first install the accompanying required library
- npm
- Yarn
- pnpm
npm install -S @dqbd/tiktoken
yarn add @dqbd/tiktoken
pnpm add @dqbd/tiktoken
Then, you can use it like so:
import { Document } from "langchain/document";
import { TokenTextSplitter } from "langchain/text_splitter";
const text = "foo bar baz 123";
const splitter = new TokenTextSplitter({
encodingName: "gpt2",
chunkSize: 10,
chunkOverlap: 0,
});
const output = await splitter.createDocuments([text]);