Amazon Bedrock Knowledge Bases: What It Is and When to Use It
Definition
Amazon Bedrock Knowledge Bases is the managed retrieval-augmented generation (RAG) component of Amazon Bedrock. It ingests your documents from S3 and other sources, chunks them, embeds them with a foundation embedding model, stores the vectors in your chosen vector database, and exposes two APIs — Retrieve (search only) and RetrieveAndGenerate (search + generate a grounded answer). Instead of writing a chunker, embedder, vector index, and prompt template yourself, you configure a Knowledge Base once and call a single API for grounded answers. It is AWS's opinionated answer to "how do I ask my company documents questions with an LLM."
How It Works
A Knowledge Base has four configurable pieces:
- Data source — S3 prefix, Confluence space, SharePoint site, Salesforce, or a web crawler. You can attach multiple sources to one knowledge base. Syncs can be on-demand or incremental.
- Chunking strategy — how documents are split into passages:
- Default chunking (~300 tokens).
- Fixed-size chunking with your own token count and overlap.
- Hierarchical chunking — creates parent and child chunks so the retriever matches on small chunks but returns larger parent context.
- Semantic chunking — splits based on embedding similarity between sentences to keep topically related content together.
- No chunking — treat each file as one chunk (for short passages).
- Embedding model — Titan Text Embeddings V2, Cohere Embed English/Multilingual, or supported third-party models. The embedding dimension must match the vector store configuration.
- Vector store — OpenSearch Serverless (default quick-create), Amazon Aurora PostgreSQL with pgvector, Pinecone, Redis Enterprise Cloud, or MongoDB Atlas. Bedrock can create an OpenSearch Serverless collection for you, or you can bring an existing vector index.
At query time, RetrieveAndGenerate embeds the user question with the same embedding model, performs a top-k similarity search, and feeds the retrieved passages plus the user question into a generation model (Claude, Nova, Llama, Mistral, etc.) using a managed prompt template. You can override the template, filter retrieved chunks by metadata, and attach Guardrails. The response includes citations to the source documents and S3 locations.
Advanced features include query reformulation (the LLM rewrites the question before search), reranking (Cohere Rerank or Amazon Rerank re-orders results), contextual grounding (Guardrails feature that verifies generated answers match retrieved context), and metadata filtering (filter chunks by structured fields like department or date range).
Key Features and Limits
- Managed ingestion — incremental sync detects added/changed/deleted documents and updates the vector index accordingly.
- Multiple vector store options — OpenSearch Serverless for zero-setup; Aurora pgvector for SQL-native teams; Pinecone/Redis/MongoDB for existing vendor relationships.
- RetrieveAndGenerate sessions — maintain conversational context across turns with session IDs.
- Structured data knowledge bases — a newer mode that connects to Redshift or Athena and generates SQL to answer questions over a data warehouse (text-to-SQL with grounding on schema metadata).
- Citations in response — every generated answer includes references to the source chunks and S3 URIs.
- Metadata filtering — attach JSON metadata files alongside documents to filter retrieval by tags, dates, authors, or access levels.
- Guardrails integration — attach a Bedrock Guardrail to block unsafe content and verify grounding.
- Chunking limits — chunk sizes up to a few thousand tokens depending on embedding model.
- Ingestion source limits — S3 file size up to 50 MB, supported file types include PDF, HTML, DOC/DOCX, TXT, MD, CSV, XLSX, JSON.
Common Use Cases
- Internal knowledge assistants — employees ask natural-language questions grounded in Confluence, SharePoint, or S3-stored PDFs.
- Customer support copilots — answers grounded in product docs, release notes, and historical tickets.
- Regulated-industry Q&A — legal, healthcare, or financial services where every answer needs a citation back to a source.
- Developer documentation bots — queries against API references, RFCs, and runbooks.
- Sales enablement — pitch-deck and case-study retrieval during live calls.
- Data-warehouse Q&A (Structured mode) — natural-language questions against Redshift or Athena for business users.
- Multi-tenant SaaS — per-tenant Knowledge Bases with metadata filtering for document-level isolation.
Pricing Model
Knowledge Bases itself has no flat fee — you pay for the underlying components:
- Embedding model invocations — per 1,000 tokens for every ingested chunk and every retrieved query.
- Vector store — OpenSearch Serverless is billed per OCU-hour (minimum 2 OCUs for the index collection + 2 for the vector collection unless using the managed minimum configuration), Aurora by DB-instance-hour or ACUs, Pinecone/Redis/MongoDB by vendor rates.
- Generation model invocations — per 1,000 input + output tokens during
RetrieveAndGenerate. Input tokens include the retrieved passages, which makes prompt size a real cost driver. - Reranking — per 1,000 tokens if you enable Rerank.
- Guardrails — per 1,000 characters of text processed.
- Ingestion — no per-document fee beyond the embedding tokens.
Cost optimization tips: choose a cheap embedding model (Titan Embed V2 is a good default), pick a smaller generation model (Claude Haiku or Nova Lite) when quality allows, tune numberOfResults to avoid padding prompts with unnecessary chunks, and use hierarchical chunking when documents are long.
Pros and Cons
Pros
- Removes the entire RAG pipeline boilerplate — ingest, chunk, embed, index, retrieve, prompt.
- Multiple vector store options; zero-setup OpenSearch Serverless path.
- Citations out of the box — critical for regulated use cases.
- Integrates with Agents, Guardrails, Prompt Management.
- IAM and VPC integration for enterprise security.
Cons
- Opinionated — limited control over the retrieval prompt structure and ranking logic compared with a hand-built pipeline.
- OpenSearch Serverless minimum OCUs can be expensive for small workloads.
- Ingestion-time chunking is re-run on every reindex — costly for very large corpora if you keep changing chunking strategies.
- File type support is narrower than what a custom loader (LangChain / LlamaIndex) can handle.
- Structured data mode is newer and less mature than text retrieval.
Comparison with Alternatives
| | Bedrock Knowledge Bases | Custom pgvector | OpenSearch kNN | Amazon Kendra | | --- | --- | --- | --- | --- | | Setup | Fully managed | DIY | DIY | Fully managed | | Customization | Medium | Full | Full | Limited | | Generation | Built-in | Separate model call | Separate model call | Separate model call | | Citations | Yes | DIY | DIY | Yes | | Pricing | Embed + store + generate | Embed + RDS + generate | Embed + OS + generate | Per-hour connector-based | | Best for | Managed RAG on AWS | Max control | Hybrid keyword + vector | Enterprise search with connectors |
Rule of thumb: if you want the shortest path to a grounded chatbot on AWS, Knowledge Bases. If you need custom reranking, complex metadata filtering, or hybrid search, build on OpenSearch or pgvector. If your problem is enterprise search more than generative Q&A, Kendra.
Exam Relevance
- AI Practitioner (AIF-C01) — heavy coverage of Knowledge Bases as AWS's managed RAG solution, chunking strategies, when to use RAG vs fine-tuning, and citation-based grounding.
- Machine Learning Engineer Associate (MLA-C01) — Knowledge Bases configuration, vector store choice, chunking strategies, and integration with Agents for tool-using assistants.
- Solutions Architect Associate (SAA-C03) — designing GenAI architectures: API Gateway → Lambda → Bedrock Knowledge Base + Guardrails, S3 as ingestion source, VPC endpoints for private inference.
Classic exam trap: RAG vs fine-tuning. Use RAG (Knowledge Bases) when facts change frequently or need citation; use fine-tuning when you need the model to learn a style, format, or domain language that cannot be delivered through retrieved context.
Frequently Asked Questions
Q: Which chunking strategy should I use?
A: Start with default (~300 tokens) for mixed content. Use hierarchical chunking for long structured documents (books, runbooks) where you want small retrieval chunks but large generation context. Use semantic chunking when documents have poor section boundaries and you need the chunker to detect topic shifts. Use fixed-size with explicit overlap when you already know the optimal chunk size for your domain. Re-chunking forces re-embedding, which costs tokens, so benchmark on a representative subset before changing strategies on millions of documents.
Q: Which vector store should I pick?
A: OpenSearch Serverless for the shortest path — Bedrock creates it for you and you're done. Aurora PostgreSQL + pgvector if your team already operates Postgres and wants SQL access to the data. Pinecone for very large-scale or when you already use it. Redis Enterprise for sub-10ms retrieval latency. MongoDB Atlas if you're already on Atlas. All options work with the same Knowledge Bases APIs; the trade-offs are operational cost and ecosystem fit.
Q: Should I use RAG with Knowledge Bases or fine-tune a model?
A: RAG when your content changes frequently (docs, tickets, policies), when you need citations for compliance, and when the content is too large to fit in a prompt. Fine-tuning when you need the model to reliably output a specific format (JSON schema, legal clause style), when you want to teach a domain-specific tone, or when RAG prompts are blowing past context limits because the content is too homogeneous to retrieve against. The two combine well — fine-tune style, RAG facts.
This article reflects AWS features and pricing as of 2026. AWS services evolve rapidly — always verify against the official Amazon Bedrock Knowledge Bases documentation before making production decisions.