Text Chunker for RAG
Split large text into overlapping chunks by character count, word count, or sentence boundary — output as JSON for embeddings and RAG pipelines.
1,188 characters · 190 words
How RAG text chunking works
Retrieval-augmented generation (RAG) pipelines embed documents in pieces, not all at once — a whole PDF rarely fits in an embedding model's context window, and even when it does, a single vector for an entire document is too coarse to retrieve precisely. This tool splits your text into smaller, overlapping chunks and outputs a ready-to-use JSON array, so each chunk can be embedded and stored individually in a vector database (Pinecone, Weaviate, pgvector, Chroma, and similar stores all expect this kind of pre-split input).
Three splitting strategies are supported. By character count slides a fixed-width window across the raw text — simplest and most predictable, but it can cut a sentence in half. By word count does the same on whitespace-separated words, which keeps chunk sizes closer to token counts for most tokenizers. By sentence boundary packs whole sentences into each chunk up to a target character size, so chunks never split mid-sentence — usually the best choice for prose, articles, and documentation.
The overlapsetting repeats a small amount of trailing content at the start of the next chunk. This matters because a fact or reference that spans a chunk boundary would otherwise be lost or truncated in both chunks; overlap ensures it appears in full in at least one of them. A common starting point is a chunk size of 300–800 characters (or 100–300 words) with an overlap of 10–20% of the chunk size — tune both values based on your embedding model's context window and how granular you need retrieved passages to be. Everything runs locally in your browser — no text is uploaded anywhere.
Private & free — this tool runs entirely in your browser.