Multi-Agent Chat Threading

Persistent context, real-time LLM switching & auto-summarization

Mathew Mozaffari · Full Stack Assessment 2026

System Design

Architecture Overview

Client (HTTP)

→

FastAPI App

Routers — threads.py, messages.py

Services — thread, message, summary

LLM Layer — registry + OpenRouter client

→ OpenRouter API → Claude 3.5 Sonnet / GPT-4o

Database — PostgreSQL via SQLAlchemy async

Threads | Messages | Summaries

Persistence

Data Model

Thread

id UUID, PK

title String

system_prompt Text

active_model String

created_at DateTime

updated_at DateTime

Message

id UUID, PK

thread_id FK → Thread

role Enum (user/assistant)

content Text

model_used String, nullable

created_at DateTime

Summary

id UUID, PK

thread_id FK → Thread

content Text

message_count Int

last_message_id FK → Message

created_at DateTime

Multi-Model

LLM Orchestration

All models go through OpenRouter — unified gateway, OpenAI-compatible API
Model Registry pattern — adding a new model = one line in a Python dict
Per-thread model selection via active_model field on Thread
Per-message override via optional model parameter on message creation
Real-time switching via PATCH /api/threads/{id}

Context Management

Context Window Assembly

1

System Prompt

Constant per thread — sets persona & behavior rules

↓

2

Latest Summary

"Summary of earlier conversation: ..." (if one exists)

↓

3

Unsummarized Messages

Everything after the last summary — full user/assistant pairs

↓

Assembled payload → sent to active LLM via OpenRouter

Compression

Auto-Summarization

Threshold trigger: every 20 messages (configurable via SUMMARY_THRESHOLD)
Uses a fast/cheap model for compression — keeps costs low
Previous summary included for continuity — chained summarization
Original messages preserved — summaries layer on top, never destructive
Prompt: "Summarize concisely, preserving key facts, decisions, and context"

Live Demo

Thread Creation + First Message

            POST
            /api/threads
            {
            "title":
            "Live Test",
            "system_prompt":
            "You are a concise assistant.",
            "model":
            "anthropic/claude-3.5-sonnet" }
          

            POST
            /api/threads/{id}/messages
            {
            "content":
            "What is the capital of France?"
            }

            → Claude:
            "Paris is the capital city of France."
          

Live Demo

Model Switching

            PATCH
            /api/threads/{id}
            {
            "model":
            "openai/gpt-4o" }
          

            POST
            /api/threads/{id}/messages
            {
            "content":
            "What is its population?"
            }

            → GPT-4o:
            "approximately 2.1 million people"
            ← Context preserved!
          

GPT-4o understood "its" refers to Paris from Claude's earlier response — seamless cross-model context.

Interleaved History

Full Conversation Thread

user What is the capital of France?

assistant [Claude] Paris is the capital city of France.

user What is its population?

assistant [GPT-4o] approximately 2.1 million people

user Name 3 famous landmarks there.

assistant [Claude] Eiffel Tower, Notre-Dame, Louvre

Two different models, one continuous conversation thread.

Live Demo

Auto-Summary Output

            GET
            /api/threads/{id}/summaries

            // After 20+ messages, the system auto-generates:

            [ {
            "id":
            "c83a1f2e-7b04-4d6a-9e53-af81d2c0b147",
            "thread_id":
            "9f4e6a3d-1c82-4b7f-a5d0-3e8b9c7f2a14",
            "content":
            "The user asked about France. Claude identified Paris as the
              capital. GPT-4o provided a population estimate of ~2.1 million.
              The user then requested famous landmarks; Claude listed the Eiffel
              Tower, Notre-Dame, and the Louvre.", "message_count":
            20,
            "last_message_id":
            "e27d4b8a-5f13-4a92-b6c1-8d09e3f7a5b2",
            "created_at":
            "2026-03-25T14:32:07.841291" } ]
          

Triggered automatically after 20 messages — compresses history while preserving key facts, decisions & cross-model context.

Roadmap

Next Steps & Enhancements

› WebSocket streaming for real-time responses

› Background task queue (Celery/ARQ) for summarization

› Connection pooling & read replicas for scalability

› Authentication & rate limiting

› Frontend chat UI (React/Next.js)

› Multi-agent roles (coder + reviewer agents)

› Vector embeddings for semantic search over history

Thank You

Python FastAPI SQLAlchemy PostgreSQL OpenRouter httpx

https://github.com/developer-at-speer/superq-assessment

Mathew Mozaffari · 2026