↑ ↓ arrows · space · swipe

Multi-Agent Chat Threading

Persistent context, real-time LLM switching & auto-summarization

Mathew Mozaffari · Full Stack Assessment 2026

Architecture Overview

Client (HTTP)
FastAPI App
Routers — threads.py, messages.py
Services — thread, message, summary
LLM Layer — registry + OpenRouter client
OpenRouter API Claude 3.5 Sonnet / GPT-4o
Database — SQLite via SQLAlchemy async
Threads  |  Messages  |  Summaries

Data Model

Thread

id UUID, PK
title String
system_prompt Text
active_model String
created_at DateTime
updated_at DateTime

Message

id UUID, PK
thread_id FK → Thread
role Enum (user/assistant)
content Text
model_used String, nullable
created_at DateTime

Summary

id UUID, PK
thread_id FK → Thread
content Text
message_range_start Int
message_range_end Int
created_at DateTime

LLM Orchestration

  • All models go through OpenRouter — unified gateway, OpenAI-compatible API
  • Model Registry pattern — adding a new model = one line in a Python dict
  • Per-thread model selection via active_model field on Thread
  • Per-message override via optional model parameter on message creation
  • Real-time switching via PATCH /api/threads/{id}

Context Window Assembly

1
System Prompt
Constant per thread — sets persona & behavior rules
2
Latest Summary
"Summary of earlier conversation: ..." (if one exists)
3
Unsummarized Messages
Everything after the last summary — full user/assistant pairs
Assembled payload → sent to active LLM via OpenRouter

Auto-Summarization

  • Threshold trigger: every 20 messages (configurable via SUMMARY_THRESHOLD)
  • Uses a fast/cheap model for compression — keeps costs low
  • Previous summary included for continuity — chained summarization
  • Original messages preserved — summaries layer on top, never destructive
  • Prompt: "Summarize concisely, preserving key facts, decisions, and context"

Thread Creation + First Message

POST /api/threads { "title": "Live Test", "system_prompt": "You are a concise assistant.", "model": "anthropic/claude-3.5-sonnet" }
POST /api/threads/{id}/messages { "content": "What is the capital of France?" } → Claude: "Paris is the capital city of France."

Model Switching

PATCH /api/threads/{id} { "model": "openai/gpt-4o" }
POST /api/threads/{id}/messages { "content": "What is its population?" } → GPT-4o: "approximately 2.1 million people" ← Context preserved!
GPT-4o understood "its" refers to Paris from Claude's earlier response — seamless cross-model context.

Full Conversation Thread

user What is the capital of France?
assistant [Claude] Paris is the capital city of France.
user What is its population?
assistant [GPT-4o] approximately 2.1 million people
user Name 3 famous landmarks there.
assistant [Claude] Eiffel Tower, Notre-Dame, Louvre
Two different models, one continuous conversation thread.

Next Steps & Enhancements

WebSocket streaming for real-time responses
Background task queue (Celery/ARQ) for summarization
PostgreSQL for production scalability
Authentication & rate limiting
Frontend chat UI (React/Next.js)
Multi-agent roles (coder + reviewer agents)
Vector embeddings for semantic search over history

Thank You

Python FastAPI SQLAlchemy SQLite OpenRouter httpx

https://github.com/developer-at-speer/superq-assessment

Mathew Mozaffari · 2026