Skip to main content

Getting Started: Text Splitters

Language Models are often limited by the amount of text that you can pass to them. Therefore, it is neccessary to split them up into smaller chunks. LangChain provides several utilities for doing so.

Using a Text Splitter can also help improve the results from vector store searches, as eg. smaller chunks may sometimes be more likely to match a query. Testing different chunk sizes (and chunk overlap) is a worthwhile exercise to tailor the results to your use case.

interface TextSplitter {
chunkSize: number;

chunkOverlap: number;

createDocuments(
texts: string[],
metadatas?: Record<string, any>[]
): Promise<Document[]>;

splitDocuments(documents: Document[]): Promise<Document[]>;
}

Text Splitters expose two methods, createDocuments and splitDocuments. The former takes a list of raw text strings and returns a list of documents. The latter takes a list of documents and returns a list of documents. The difference is that createDocuments will split the raw text strings into chunks, while splitDocuments will split the documents into chunks.

All Text Splitters

Advanced

If you want to implement your own custom Text Splitter, you only need to subclass TextSplitter and implement a single method splitText. The method takes a string and returns a list of strings. The returned strings will be used as the chunks.

abstract class TextSplitter {
abstract splitText(text: string): Promise<string[]>;
}