Case study

Multi-Agent / MCP-Style Platform

Agent Orchestration Architecture

A multi-agent orchestration platform — not a library — that lets users build hierarchical agent systems with a real frontend, real auth, real persistence, and real tool integration. Two orchestration modes (delegation and synthesis), MCP-first tool architecture, page-refresh-survivable streaming, and encrypted per-user credentials.

This is the general-purpose sibling of Eleanor-AI. Same problem space, different optimization: Eleanor is purpose-built for security-context operation; this platform is domain-agnostic — build any kind of team for any kind of work, with persistent team workspaces and broad tool composition.

Problem

Off-the-shelf agent-orchestration libraries solve a real problem — how do you compose multiple LLM calls into a coordinated workflow? — but they’re libraries, not products. To get a usable product out of one, you have to build, from scratch:

A REST API surface for non-Python clients.
A frontend so non-developers can use it.
Authentication, session management, rate limiting.
Persistent storage for conversations and agent definitions.
Streaming infrastructure that survives reconnects.
An MCP integration layer.
Credential management that doesn’t dump every user’s API keys into shared environment variables.
Observability — token usage, costs, traces.

Each of those is a real piece of infrastructure. The platform is the “all of them, finished” version, behind a single deployable stack.

System

Agents are built on the OpenAI Agents SDK, wrapped with multi-LLM access via LiteLLM (Claude, GPT-4o, Bedrock, OpenRouter) so the SDK’s agent primitives — agent definitions, tool calls, hand-offs, streaming — work across providers from a single platform layer.

A user signs in to the React dashboard or hits the REST API. They define agents (model, system prompt, tool list, budget), connect MCP servers from a marketplace (Brave Search, GitHub, custom internal MCPs) using their own API keys, and either chat directly or compose agents into a hierarchy.

Hierarchies operate in one of two modes:

Orchestration mode — the manager calls each specialist as a tool, collects their outputs, and synthesizes a final answer.
Delegation mode — the manager hands off control to a specialist via transfer_to(). The specialist takes over, completes the task, and returns to the manager.

These are different problems. Synthesis is right when you need parallel execution and aggregation (“ask all three specialists, give me the merged answer”). Delegation is right when one specialist owns the rest of the conversation (“this is a billing question, route it to billing”). The platform supports both because the right choice depends on what the agent system is for.

Conversations stream token-by-token over WebSocket and persist to PostgreSQL. A user can refresh the page mid-stream and the conversation picks up where it left off; switch tabs and another tab can listen to the same stream via PostgreSQL LISTEN/NOTIFY.

Documents uploaded through the platform don’t sit in PostgreSQL — they’re forwarded to the RAG-Anything fork for parsing, embedding, and knowledge-graph extraction. The matching rag-mcp server is wired into the platform’s MCP tool layer, so agents can query the user’s uploaded knowledge bases through the same MCP-call mechanism they use for any other tool. From the user’s perspective, “upload this PDF and ask my agent about it” is one workflow, not three.

Each team also gets its own persistent Docker workspace — a long-lived sandbox the team’s agents share. Agents can run shell commands and scripts (bash, Python, anything in the workspace image), read and write files, and accumulate state across conversations. A team’s workspace is essentially “the room the team works in”: files written during one session are still there in the next, so agents can build up notes, scratch repos, datasets, draft documents, or whatever shape the team’s work takes. File uploads through the chat UI land directly in the workspace. Workspaces don’t share with each other — each team has its own.

Everything is observable through Langfuse: token counts, costs per turn, traces per conversation, latency distributions.

Architecture

   ┌────────────────────────────────────────────────────┐
   │              FRONTEND (React, 18 pages)             │
   │                                                    │
   │  • Agent CRUD / hierarchy builder                  │
   │  • Chat UI with streaming + refresh survival       │
   │  • MCP marketplace (one-click tool enable)         │
   │  • Per-user credential vault (encrypted)           │
   │  • Workspace management                            │
   └────────────────────────┬───────────────────────────┘
                            │ REST + WebSocket
                            ▼
   ┌────────────────────────────────────────────────────┐
   │            FASTAPI BACKEND (101 endpoints)          │
   │                                                    │
   │  Auth: JWT (HS256) or API key                      │
   │  Rate limiting on auth endpoints                   │
   │  Audit logging with PII masking                    │
   │  87 Pydantic schemas for request/response          │
   └─────┬──────────────┬───────────────┬───────────────┘
         │              │               │
         ▼              ▼               ▼
   ┌──────────┐ ┌──────────────┐ ┌──────────────┐
   │AGENT     │ │ TOOL LAYER   │ │ STREAMING    │
   │ENGINE    │ │ (MCP-first)  │ │ INFRA        │
   │          │ │              │ │              │
   │OpenAI    │ │ MCP servers: │ │ WebSocket    │
   │Agents SDK│ │ • SSE        │ │ + PostgreSQL │
   │          │ │ • HTTP       │ │ LISTEN/NOTIFY│
   │Multi-LLM │ │ • stdio      │ │              │
   │via       │ │ • ToolHive   │ │ Survives:    │
   │LiteLLM   │ │ • rag-mcp ───┼─┼─► to RAG     │
   │          │ │   (knowledge │ │ • refresh    │
   │Two modes:│ │   bases)     │ │ • tab switch │
   │• orch    │ │              │ │ • mobile bg  │
   │• delegate│ │ Per-user     │ │              │
   │          │ │ Fernet-      │ │              │
   │          │ │ encrypted    │ │              │
   │          │ │ credentials  │ │              │
   └─────┬────┘ └──────┬───────┘ └──────┬───────┘
         │             │                 │
         └─────────────┼─────────────────┘
                       ▼
   ┌────────────────────────────────────────────────────┐
   │       POSTGRESQL (16 tables + associations)         │
   │       21 Alembic migrations                          │
   │       Conversations, agents, MCPs, credentials,     │
   │       teams, workspaces, audit log                  │
   └────────────────────────────────────────────────────┘

   ┌────────────────────────────────────────────────────┐
   │  PER-TEAM WORKSPACE CONTAINERS                      │
   │  One Docker container per team, volume-persisted.   │
   │  Bash + scripting environment shared by team's      │
   │  agents. Files / artifacts / state survive across   │
   │  conversations; chat file uploads land here.        │
   └────────────────────────────────────────────────────┘

   ┌────────────────────────────────────────────────────┐
   │  RAG-ANYTHING (fork)                                │
   │  Document upload → ingestion (Qdrant + Neo4j +      │
   │  MinIO); rag-mcp server → agentic retrieval         │
   └────────────────────────────────────────────────────┘

   ┌────────────────────────────────────────────────────┐
   │       LANGFUSE (observability)                      │
   │       Traces, costs, tokens per turn                │
   └────────────────────────────────────────────────────┘

Key design decisions

MCP-first, not MCP-as-an-afterthought

Most agent libraries grew up with function-calling or hand-defined tool registries; MCP support is grafted on. This platform inverts that: tools are MCP servers from the start. Internal tools (workspace file ops, document upload, agent CRUD) are exposed as in-process MCP servers so they go through the same registration, capability negotiation, and error-handling paths as external tools.

The benefit shows up in two places: (1) adding a new tool means adding an MCP server, not modifying core engine code; (2) the marketplace is a real first-class feature, not a UI veneer over hand-written integrations. Users enable a tool with one click and a credential paste, and it works.

Two orchestration modes

This is the design call I’d most strongly defend. Most multi-agent libraries pick one — usually orchestration-as-synthesis — and force every problem into that mold. But synthesis fails for cases where one specialist owns the rest of the conversation (you don’t want the manager to “synthesize” a billing decision; you want the billing agent to actually take over). Delegation fails for cases that need parallel execution.

The platform supports both, and the choice is per-agent-relationship: a manager can delegate to specialist A and orchestrate over specialists B and C. The two modes share the same underlying LLM-tool-call mechanism; the difference is whether the specialist’s output ends the conversation or returns control upstream.

Persistent streaming

WebSocket connections die. Pages refresh. Tabs background on mobile, then foreground 30 minutes later. A streaming chat interface that drops state on any of these is a worse user experience than a non-streaming one.

The platform pipes every token to PostgreSQL as it arrives, then to the WebSocket. A reconnect (refresh, network blip, mobile resume) issues a LISTEN on the conversation channel and replays from the last received token. The user sees the stream continue from where they left off; another tab opening the same conversation can listen to the same channel and see the live stream too.

The implementation cost is modest — a tokens table, a publisher in the agent loop, a LISTEN consumer in the WebSocket handler — and the resilience benefit is enormous. This was the most-requested missing feature when comparing the platform against the libraries. Adding it once meant it works everywhere.

Per-user encrypted credentials

MCP tools that talk to external services (GitHub, Brave Search, OpenAI API key for a sub-agent) need credentials. The wrong way is one shared .env for the whole deployment. The right way is per-user credentials, encrypted at rest, decrypted only at the moment of tool invocation.

The platform uses Fernet (AES-128-CBC + HMAC-SHA256 from the cryptography library) with a key held in the server environment, not in the database. A database dump exposes ciphertext only; a server compromise exposes the key but not the credential vault unless the database is also compromised. It’s not perfect (true split-key would require a secrets manager), but it’s the correct shape for a self-hostable platform: defense against the most common database-leak scenario, with a clean upgrade path to a secrets manager when the deployment justifies it.

Persistent team workspaces

Most multi-agent libraries treat each conversation as stateless — the agent reasons, calls some tools, returns an answer, forgets. That works for one-shot question-answering. It breaks down for any team that’s actually doing real work over time: an analyst building up a dossier across a week of conversations, a research team accumulating notes and drafts, a product team iterating on a doc. The state has nowhere to live.

The platform gives every team its own Docker workspace container with a Docker volume mounted at the writeable workspace dir. The container image carries bash and a standard set of scripting tools. Agents shell into the workspace through MCP tools and treat it as a real filesystem — read, write, run scripts, install packages, persist artifacts. The workspace survives container restarts; what the team built up yesterday is there today.

This is a different shape from Eleanor’s ephemeral, hardened per-orchestrator containers. Eleanor’s containers are the agent runtime, kept short-lived for security. The Multi-Agent platform’s workspace containers are an agent tool — a place the team’s work persists and accumulates. Different problem, different design call.

The result is an agent system that doesn’t lose context the moment a session closes. An agent can pick up a half-finished script, refer back to last week’s analysis notes, or build incrementally on prior work. That’s what makes hierarchical teams useful for actual work, not just demo-shaped questions.

LLM self-correction on tool errors

MCP tools fail in predictable ways: wrong argument types, missing required fields, malformed JSON. The naive response is to crash the agent loop. The right response is to format the error as an LLM-readable message and let the agent retry.

The platform catches every MCP tool error, structures it ({"error": "Type mismatch on argument 'count': expected int, got string '5'"}), and feeds it back to the agent as a tool result. The agent reads the error, understands what went wrong, retries with corrected arguments. Across observed runs, the tool-error-rate that surfaces to the user is approximately zero — the LLM corrects its own arguments faster than a human could intervene.

This is small in code (a try/except wrapper) and large in reliability impact. The platform’s “MCP works reliably” reputation is mostly this one pattern.

Impact

Metric	Value
First-party Python LOC	~95K
TypeScript / TSX LOC	~23K
REST endpoints	101
Pydantic schemas	87
Database tables	16 first-class + association tables
Frontend pages	18
Test files	233 (unit / integration / e2e / QA)
Hierarchy depth demonstrated	3 levels (manager → team leads → 8 specialists)
MCP error rate visible to users	~0 (LLM self-correction handles tool errors)

The size is the point. Each of the agent libraries this platform competes with is some-K LOC. The platform is ~120K LOC because it includes everything you’d build around a library to make it usable as a product — frontend, auth, streaming, observability, marketplace, credential management, workspace isolation. Every line of that is a line a downstream user doesn’t have to write.

The platform is domain-agnostic — any kind of team can be built for any kind of work. In production it’s used for hierarchical multi-step analysis (managers fanning work out to specialist agents), parallel-orchestration patterns where multiple specialists run independently and a manager merges results, and chat-based workflows via the Discord integration. The platform doesn’t impose a domain on how teams compose; it provides the substrate (agents + tools + workspaces + persistence + streaming) and lets teams take whatever shape the work requires.

Tradeoffs and what I’d do differently

No RBAC. The platform is currently flat-ownership: a user owns the agents they create, can share them by API key, and that’s it. For a multi-team deployment, real role-based access (admin / team_lead / user / viewer) is the missing piece. Designed-in but not yet implemented.

Single-process Uvicorn. Database pool exhaustion at ~25–30 concurrent users is the real ceiling. The fix is multi-worker deployment with Redis-backed sessions, which is a known migration but requires moving WebSocket session state out of process memory. The right time to do this is when there’s a deployment that needs it; today there isn’t.

Chat endpoint has no rate limiting. Auth endpoints do (5/min register, 10/min login). The chat endpoint relies on the database connection pool as an implicit limiter — once you exhaust the pool, requests start 500-ing — which is the wrong UX. Per-user rate limits with backoff are on the next-up list.

Tool lazy-loading. All MCP tools for an agent are instantiated at agent creation time. This is fine for agents with 5–10 tools; it gets memory-expensive for agents with 30+ tools that are only intermittently called. Lazy instantiation on first tool call would let agents declare large tool catalogs without paying the memory cost up front.

Stream reconnect uses full replay. The LISTEN/NOTIFY recovery currently replays the entire conversation from the start, which works fine for short conversations and gets slow for long ones. Incremental replay from the last-acknowledged token is the right shape; it’s straightforward but not yet built.

The most useful design choice in retrospect: treating MCP errors as LLM input instead of crashing. It’s a small wrapper that turned the platform’s tool-use reliability story from “mostly works” to “works.” Most of the platform’s other infrastructure (auth, persistence, streaming) is the table stakes you have to ship to be taken seriously; the MCP error-handling pattern is the smaller, less-obvious thing that did most of the reliability work.

Architecture and aggregate metrics only. Implementation specifics, prompts, and platform-specific configuration are not part of this writeup.