↑ ↓ arrows · space · swipe

Multi-Agent Chat Threading

Persistent context, real-time LLM switching & auto-summarization

Mathew Mozaffari · Full Stack Assessment 2026

Architecture Overview

Client (HTTP)
FastAPI App
Routers — threads.py, messages.py
Services — thread, message, summary
LLM Layer — registry + OpenRouter client
OpenRouter API Claude 3.5 Sonnet / GPT-4o
Database — PostgreSQL via SQLAlchemy async
Threads  |  Messages  |  Summaries

Data Model

Thread

id UUID, PK
title String
system_prompt Text
active_model String
created_at DateTime
updated_at DateTime

Message

id UUID, PK
thread_id FK → Thread
role Enum (user/assistant)
content Text
model_used String, nullable
created_at DateTime

Summary

id UUID, PK
thread_id FK → Thread
content Text
message_count Int
last_message_id FK → Message
created_at DateTime

LLM Orchestration

  • All models go through OpenRouter — unified gateway, OpenAI-compatible API
  • Model Registry pattern — adding a new model = one line in a Python dict
  • Per-thread model selection via active_model field on Thread
  • Per-message override via optional model parameter on message creation
  • Real-time switching via PATCH /api/threads/{id}

Context Window Assembly

1
System Prompt
Constant per thread — sets persona & behavior rules
2
Latest Summary
"Summary of earlier conversation: ..." (if one exists)
3
Unsummarized Messages
Everything after the last summary — full user/assistant pairs
Assembled payload → sent to active LLM via OpenRouter

Auto-Summarization

  • Threshold trigger: every 20 messages (configurable via SUMMARY_THRESHOLD)
  • Uses a fast/cheap model for compression — keeps costs low
  • Previous summary included for continuity — chained summarization
  • Original messages preserved — summaries layer on top, never destructive
  • Prompt: "Summarize concisely, preserving key facts, decisions, and context"

Thread Creation + First Message

POST /api/threads { "title": "Live Test", "system_prompt": "You are a concise assistant.", "model": "anthropic/claude-3.5-sonnet" }
POST /api/threads/{id}/messages { "content": "What is the capital of France?" } → Claude: "Paris is the capital city of France."

Model Switching

PATCH /api/threads/{id} { "model": "openai/gpt-4o" }
POST /api/threads/{id}/messages { "content": "What is its population?" } → GPT-4o: "approximately 2.1 million people" ← Context preserved!
GPT-4o understood "its" refers to Paris from Claude's earlier response — seamless cross-model context.

Full Conversation Thread

user What is the capital of France?
assistant [Claude] Paris is the capital city of France.
user What is its population?
assistant [GPT-4o] approximately 2.1 million people
user Name 3 famous landmarks there.
assistant [Claude] Eiffel Tower, Notre-Dame, Louvre
Two different models, one continuous conversation thread.

Auto-Summary Output

GET /api/threads/{id}/summaries // After 20+ messages, the system auto-generates: [ { "id": "c83a1f2e-7b04-4d6a-9e53-af81d2c0b147", "thread_id": "9f4e6a3d-1c82-4b7f-a5d0-3e8b9c7f2a14", "content": "The user asked about France. Claude identified Paris as the capital. GPT-4o provided a population estimate of ~2.1 million. The user then requested famous landmarks; Claude listed the Eiffel Tower, Notre-Dame, and the Louvre.", "message_count": 20, "last_message_id": "e27d4b8a-5f13-4a92-b6c1-8d09e3f7a5b2", "created_at": "2026-03-25T14:32:07.841291" } ]
Triggered automatically after 20 messages — compresses history while preserving key facts, decisions & cross-model context.

Next Steps & Enhancements

WebSocket streaming for real-time responses
Background task queue (Celery/ARQ) for summarization
Connection pooling & read replicas for scalability
Authentication & rate limiting
Frontend chat UI (React/Next.js)
Multi-agent roles (coder + reviewer agents)
Vector embeddings for semantic search over history

Thank You

Python FastAPI SQLAlchemy PostgreSQL OpenRouter httpx

https://github.com/developer-at-speer/superq-assessment

Mathew Mozaffari · 2026