Case study

RAG-Anything (fork)

Production Service Layer on Top of HKUDS/RAG-Anything

Production service layer built on top of HKUDS/RAG-Anything — running as a fork. The upstream raganything/ package — LightRAG core, MinerU parsing, the entity/relation extraction prompts — is not my work. The pieces that are: a multi-tenant FastAPI service (rag-service/), an MCP retrieval server (mcp-server/), an AWS Bedrock provider for LLM/vision/embedding, ARQ-driven async ingestion, per-token cost tracking, and three optimization commits inside the upstream package that sit in the fork (PRs against upstream planned, not yet submitted).

The case study is structured to make that distinction obvious: every claim below is either the upstream’s work, my work on top, or a clearly-labeled modification inside the upstream package.

Problem

A small team building agentic systems needs a retrieval layer with three properties: multi-tenant by default (KBs scoped per team or per agent, no cross-talk), agent-callable (MCP-style retrieval tools, not “import a Python library”), and provider-flexible (LLMs and embeddings selectable per environment — Bedrock for production, OpenRouter for experimentation).

HKUDS/RAG-Anything solves the hardest part of this — multimodal document parsing, hybrid vector + knowledge-graph retrieval, the LightRAG core. What it doesn’t ship is the production wrapper: tenancy, async ingestion, cost tracking, MCP surface, Bedrock provider, observability. Those are the gaps a team has to close before the library becomes a service that other agents can talk to.

This project closes those gaps in a fork rather than in a separate repo, partly so the optimization work stays close to the upstream code, partly so upstream changes can be merged in cleanly when they land. The trade is real: this fork lives behind the upstream’s update pace, and a few of my own optimization commits sit waiting on PRs that I haven’t filed yet.

System

The fork adds two top-level directories that don’t exist upstream:

  • rag-service/ — a FastAPI service (~31 modules, ~6,800 LOC) that wraps RAG-Anything for production use. Lifespan-managed, provider-selectable (Bedrock or OpenRouter for LLM and vision; Bedrock or Voyage for embeddings), with a unified /api/v1/query endpoint accepting modes naive | local | global | hybrid | mix | bypass, per-tenant CRUD on /api/v1/knowledge-bases, and document upload with status tracking and cancellation.
  • mcp-server/ — a FastMCP-based server (~535 LOC, stdio + HTTP transports) exposing seven retrieval tools to agentic clients. Tenant ID and service token injected via headers; the server doesn’t carry credentials of its own.

Inside the upstream raganything/ package, five files are modified (~+312 / −52 lines) for content-hash deduplication, repeated-content noise filtering, text-file PDF-roundtrip skip, and chunking adjustments.

The whole thing runs as a six-service Docker stack. GPU compose (Dockerfile.gpu + docker-compose.gpu.yml) is the preferred deployment because MinerU parsing is GPU-heavy; CPU compose is the fallback for environments without an NVIDIA runtime.

Architecture

                        AGENTIC CLIENTS
                  (Eleanor-AI and similar)

                            │  MCP (stdio / HTTP)

   ┌────────────────────────────────────────────────────┐
   │   mcp-server/  (~535 LOC, my work)                  │
   │                                                    │
   │   Tools:                                           │
   │   • list_knowledge_bases                           │
   │   • retrieve_hybrid     (vector + graph fusion)    │
   │   • retrieve_naive      (pure vector)              │
   │   • retrieve_local      (entity-focused graph)     │
   │   • retrieve_global     (graph overview)           │
   │   • retrieve_adaptive   (mix mode)                 │
   │   • get_raw_context                                │
   │                                                    │
   │   Headers: X-Tenant-ID, X-Service-Token            │
   └────────────────────────┬───────────────────────────┘
                            │  HTTP

   ┌────────────────────────────────────────────────────┐
   │   rag-service/  (~6,800 LOC, my work)              │
   │                                                    │
   │   FastAPI lifespan-managed service                 │
   │   Middleware:                                      │
   │   • ServiceTokenMiddleware (X-Service-Token →      │
   │       RAG_SERVICE_SECRET)                          │
   │   • Tenant injection (X-Tenant-ID)                 │
   │   • Correlation ID (UUID per request)              │
   │   • Structured JSON logging                        │
   │                                                    │
   │   Routers:                                         │
   │   • /api/v1/query  (naive/local/global/hybrid/mix) │
   │   • /api/v1/knowledge-bases  (CRUD)                │
   │   • /api/v1/documents  (upload/status/cancel)      │
   │                                                    │
   │   Provider abstraction:                            │
   │   • LLM_PROVIDER       = bedrock | openrouter      │
   │   • VISION_PROVIDER    = bedrock | openrouter      │
   │   • EMBEDDING_PROVIDER = bedrock | voyage          │
   │                                                    │
   │   Bedrock provider (added in fork):                │
   │   • bedrock_complete   → Converse API              │
   │   • bedrock_embed      → Titan / Cohere            │
   │   • Multimodal: image_url data URIs                │
   │                                                    │
   │   Cost tracker:                                    │
   │   • 40+ models, per-million-token pricing          │
   │   • Thread-safe accumulation per request           │
   └─────┬──────────────┬───────────────┬───────────────┘
         │              │               │
         │ async (ARQ)  │ ingest        │ retrieve
         ▼              ▼               ▼
   ┌──────────┐   ┌────────────────────────────┐
   │  Redis   │   │  raganything/  (UPSTREAM)   │
   │  + ARQ   │   │  + fork modifications:      │
   │  worker  │   │  • content-hash dedup       │
   │  (12h    │   │  • noise filter             │
   │  job     │   │  • text-file bypass         │
   │  timeout)│   │  • chunk-size tuning        │
   └──────────┘   └─────┬─────────┬─────────────┘
                        │         │
                        ▼         ▼
                  ┌──────────┐ ┌──────────┐  ┌──────────┐
                  │  Qdrant  │ │  Neo4j   │  │  MinIO   │
                  │ (vectors)│ │ (graph)  │  │ (raw     │
                  │          │ │          │  │  docs)   │
                  └──────────┘ └──────────┘  └──────────┘
                  Per-KB collection / workspace / prefix

Key design decisions

Multi-tenancy in three layers

Tenancy isn’t a single feature; it’s the same scope applied at three different layers, each enforced independently:

  • Header injection. X-Tenant-ID is extracted by middleware on every KB/document request. The tenant flows through every downstream call as request state.
  • Storage path isolation. MinIO objects are laid out as {tenant_id}/{kb_id}/raw/{doc_id}/{filename}, plus per-tenant and per-KB _metadata.json and a documents.json registry. Even at the object-store level, two tenants can’t see each other’s documents because they’re in different paths.
  • Workspace naming. Vector collections (Qdrant) and graph workspaces (Neo4j) are named {tenant_id}_{kb_id} (normalized). Each KB gets its own logical collection / workspace, so cross-KB query bleed is blocked at the storage layer, not by application logic.

The reason for three layers: any one of them can be misconfigured without the others letting data leak. If middleware ever drops the header, the storage-path naming still scopes the bucket lookup. If the storage path is misconfigured, the workspace name still scopes the vector/graph query. Defense in depth at the boundary level, not just the API level.

MCP server as the agentic surface

Agents don’t import the rag-service Python client; they call MCP tools. Seven tools cover the retrieval surface:

ToolPurpose
list_knowledge_basesDiscovery
retrieve_hybridVector + graph fusion (recommended default)
retrieve_naivePure vector
retrieve_localEntity-focused graph traversal
retrieve_globalGraph overview
retrieve_adaptiveMix mode
get_raw_contextDocument fetch by ID

This is the same MCP-first pattern used in Eleanor-AI and the Multi-Agent Platform: tools are MCP, not function-calling. Adding a new retrieval mode means adding an MCP tool, not modifying agent code. Agentic clients don’t need to know the rag-service exists — they see retrieval tools, the same shape as any other MCP server.

The MCP server is small (~535 LOC, two files) because all the actual logic lives in the rag-service it proxies to. It’s a credential-injection + tenant-scoping layer, not a parallel implementation.

AWS Bedrock provider

The upstream RAG-Anything supports OpenAI-compatible providers. Bedrock isn’t OpenAI-compatible (different request shape, different auth, different multimodal encoding). Adding it required a real provider implementation rather than a config tweak.

bedrock_complete wraps the Converse API with full multimodal support — image_url data URIs in OpenAI-shape messages, JPEG/PNG/GIF/WebP, system prompts, tool use. bedrock_embed supports both Titan and Cohere embedding models via invoke_model. Provider selection is per-domain (LLM_PROVIDER, VISION_PROVIDER, EMBEDDING_PROVIDER), with explicit Bedrock model IDs (BEDROCK_LLM_MODEL_ID, BEDROCK_VISION_MODEL_ID, BEDROCK_EMBEDDING_MODEL_ID) so a deployment can mix Bedrock for LLM + Voyage for embeddings without re-coding.

Why Bedrock matters: it puts LLM access on AWS IAM, so no third-party API key sits in container env vars. Useful for any deployment where managing per-service LLM API keys is operational overhead or risk — not specific to security workloads, just convenient anywhere RAG is being used at production scale.

Per-token cost tracking across 40+ models

Every Bedrock and OpenRouter call extracts the usage dict from the response and records it through a thread-safe cost tracker. pricing_config.json carries per-million-token input/output pricing for 40+ models across bedrock, openrouter, and voyage. Each record_operation(operation_type, provider, model_id, in_tokens, out_tokens) call produces a UsageMetrics(input_cost, output_cost, total_cost, duration) row.

The cost story matters because RAG indexing burns LLM budget invisibly: entity extraction, chunking, embedding generation, summarization. Without per-call accounting, “this RAG service is expensive” is an opinion. With it, “this RAG service costs $X/document for KB Y at provider Z” is a fact you can reason about. The cost tracker makes the noise-filter and chunk-tuning optimizations (below) measurable rather than vibe-based.

Async ingestion via ARQ

Document upload returns immediately. ARQ (a Redis-backed job queue) picks up process_document_task and calls an internal /process-internal endpoint that runs the upstream RAGAnything ingestion pipeline and updates the MinIO registry. Job timeout is 12 hours — large multimodal PDFs can take a while.

Status tracking moves through UPLOADED → PROCESSING → PROCESSED with sub-phases QUEUED, DOWNLOADING, PARSING, EXTRACTING, EMBEDDING, STORING, COMPLETE and percentage progress. The upload endpoint never blocks; the client polls for status.

Optimizations made in fork

These are commits inside the upstream raganything/ package, authored by me, not yet PRed against upstream — they currently sit in the fork. PRs are planned but not submitted.

  • Content-hash deduplication. doc_id = MD5(content), so re-uploading the same file under a different name is a no-op. The response carries was_duplicate=true so the caller knows. Also threaded original_filename through ingestion so the knowledge graph shows real document names instead of temp paths.
  • Repeated-content noise filter. Strips repeated headers, footers, and CTA blocks before LLM extraction. On long PDFs this saves thousands of redundant embedding and entity-extraction calls — measurable in the cost tracker. Includes a regression test (tests/test_repeated_content_filter.py).
  • Text-file PDF roundtrip skip. auto parse method now detects plain-text inputs and bypasses MinerU entirely. Prior behavior was to render the text to PDF and then parse it back, which was both slow and lossy.
  • Chunk-size and concurrency tuning. CHUNK_TOKEN_SIZE=2048 (vs upstream ~1024) cuts entity-extraction LLM calls by roughly half. Concurrency knobs LLM_MODEL_MAX_ASYNC=4, EMBEDDING_FUNC_MAX_ASYNC=16, MAX_PARALLEL_INSERT=8 exercise more throughput per worker.

These currently sit in the fork because that’s what I run against day-to-day. Shipping them through upstream’s review cadence is slower than iterating in the fork, so they stay local until I’m ready to PR.

Impact

LayerScale
rag-service/31 Python files, ~6,800 LOC (entirely my work)
mcp-server/2 Python files, ~535 LOC (entirely my work)
Upstream package modifications5 files, +312 / −52 lines (commits added in fork)
Retrieval tools exposed via MCP7
Concurrent retrieval modes6 (naive / local / global / hybrid / mix / bypass)
Provider matrix3 LLM × 3 vision × 2 embedding
Cost-tracked models40+
Tenancy enforcement layers3 (header, storage path, workspace name)

The service is consumed by other agentic systems I’ve built, including Eleanor-AI and the Multi-Agent / MCP-Style Platform. It runs as a single Docker stack (rag-service, neo4j, qdrant, minio, redis, arq-worker) on a shared host. The GPU compose is the preferred deployment; the CPU compose works but parsing is meaningfully slower for image- and table-heavy PDFs.

Tradeoffs and what I’d do differently

Living in a fork is a real tax. Every upstream release requires merging changes back, which can conflict with the local optimization commits. The right long-term move is to land the optimizations upstream so the only thing in this fork is the rag-service / mcp-server / Bedrock provider — the production wrapper, kept separate from the upstream’s core library. PRs are on the to-do list.

Tenancy is enforced, not isolated. Three layers of scoping (header, path, workspace) make cross-tenant data leakage hard, but a misconfigured Qdrant or Neo4j shared instance still has all KBs co-resident on the same nodes. For a small-team deployment this is fine; for a true multi-tenant SaaS it would need separate Qdrant collections per tenant tier and probably separate Neo4j databases for high-value tenants. Not on the next-up list because the deployment shape doesn’t justify it yet.

MinerU + GPU dependency is a real constraint. Document parsing for image- and table-heavy PDFs needs MinerU, which needs a GPU to be tolerable. CPU parsing works but is dramatically slower than GPU MinerU — not in the same league for image- and table-heavy documents. The right answer for CPU-only environments is to swap the parser entirely: RAG-Anything supports Docling, which is a reasonable CPU-friendly alternative for many document types.

The cost tracker pays for itself. This is the single highest-leverage piece of infrastructure I added. RAG systems burn LLM budget in places nobody looks (chunking, entity extraction, summarization), and without per-token accounting the optimization work is guesswork. Once the cost tracker existed, the noise filter and chunk-tuning commits had measurable ROI: I could see exactly how many fewer entity-extraction calls each one avoided. That feedback loop turned this from “RAG is expensive” to “RAG costs $X per indexed document, optimization Y saves Z%.”


Architecture and aggregate scale only. Specific KB contents, prompts, queries, and tenant configurations are not part of this writeup.