Skip to main content

Conversation History Compaction for LLM API Calls

Overview

As users have longer conversations with an AI chatbot, every message gets sent back to the LLM on each turn -- driving up costs and slowing response times. This article explains a compaction strategy that summarizes older messages into a condensed context block while keeping recent messages intact, significantly reducing token usage without losing conversational continuity.

  • Why it matters: Without compaction, a 40-message conversation can cost 5-10x more per API call than necessary. At scale, this directly impacts operational budgets and response latency.
  • What you will learn: How to implement threshold-based compaction triggers, store compressed summaries, and correctly inject them into subsequent API calls to maintain a seamless user experience.

Multi-turn conversations with LLMs accumulate tokens with every exchange. A 40-message conversation can easily reach 15,000-20,000 input tokens, driving up cost and latency on every subsequent API call. Conversation compaction addresses this by summarizing older messages into a condensed context block, keeping only recent messages in full fidelity.

The Token Growth Problem

Every API call to an LLM includes the full conversation history. In a typical support chatbot:

  • Messages 1-10: Initial greeting and problem description (~2,000 tokens)
  • Messages 11-25: Diagnostic questions and answers (~5,000 tokens)
  • Messages 26-40: Resolution steps and follow-ups (~6,000 tokens)

By message 40, you are sending 13,000+ input tokens per request even though only the last few exchanges are contextually relevant. At scale, this multiplies cost significantly.

Compaction Architecture

The pattern uses three components:

  1. A threshold trigger -- when total messages exceed a configurable limit (e.g., 40), invoke compaction
  2. A summary field -- a text field (e.g., Context_Summary__c) on the conversation record that stores the compressed context
  3. A compacted flag -- a boolean (e.g., Is_Compacted__c) on each message record, marking which messages have been absorbed into the summary
public static void checkAndCompact(Id conversationId) {
Integer messageCount = [
SELECT COUNT()
FROM Chat_Message__c
WHERE Conversation__c = :conversationId
];

if (messageCount > COMPACTION_THRESHOLD) {
compactOlderMessages(conversationId);
}
}

Building the Compaction Request

To generate the summary, send the older messages to the LLM with a specific compaction prompt. The response becomes the new context summary.

private static void compactOlderMessages(Id conversationId) {
List<Chat_Message__c> allMessages = [
SELECT Role__c, Content__c, Is_Compacted__c, CreatedDate
FROM Chat_Message__c
WHERE Conversation__c = :conversationId
ORDER BY CreatedDate ASC
];

// Keep the most recent N messages uncompacted
Integer keepRecent = 10;
Integer compactUpTo = allMessages.size() - keepRecent;

// Build compaction payload from older messages
List<Map<String, String>> olderMessages = new List<Map<String, String>>();
for (Integer i = 0; i < compactUpTo; i++) {
olderMessages.add(new Map<String, String>{
'role' => allMessages[i].Role__c,
'content' => allMessages[i].Content__c
});
}

// Ask the LLM to summarize
String summary = callLLMForSummary(olderMessages);

// Persist: save summary, flag old messages
update new Conversation__c(
Id = conversationId,
Context_Summary__c = summary
);

List<Chat_Message__c> toFlag = new List<Chat_Message__c>();
for (Integer i = 0; i < compactUpTo; i++) {
toFlag.add(new Chat_Message__c(
Id = allMessages[i].Id,
Is_Compacted__c = true
));
}
update toFlag;
}

Injecting the Summary into API Calls

This is where the most common bug occurs. After compaction, the older messages are flagged and excluded from the API payload. If you forget to inject the summary into the system prompt, the AI loses all prior context -- the conversation appears to start mid-stream with no history.

public static String buildSystemPrompt(Id conversationId) {
String basePrompt = getBaseSystemPrompt();

Conversation__c conv = [
SELECT Context_Summary__c
FROM Conversation__c
WHERE Id = :conversationId
];

if (String.isNotBlank(conv.Context_Summary__c)) {
basePrompt += '\n\nPRIOR CONVERSATION SUMMARY:\n'
+ conv.Context_Summary__c
+ '\n\nThe above summarizes earlier conversation. '
+ 'Recent messages below reflect the current state. '
+ 'If there is a conflict, trust the recent messages.\n';
}

return basePrompt;
}

// Only send non-compacted messages as conversation history
List<Chat_Message__c> recentMessages = [
SELECT Role__c, Content__c
FROM Chat_Message__c
WHERE Conversation__c = :conversationId
AND Is_Compacted__c = false
ORDER BY CreatedDate ASC
];

Token Estimation

Most LLMs tokenize at roughly 4 characters per token for English text. A quick estimation method:

public static Integer estimateTokens(String text) {
return String.isBlank(text) ? 0 : (text.length() / 4);
}

This is imprecise but sufficient for budget tracking and threshold decisions. For exact counts, use the token count returned in the API response.

Design Considerations

  • Compaction frequency: Do not compact on every message. Batch compaction at threshold crossings (40, 80, 120) to minimize API calls dedicated to summarization.
  • Summary quality: The compaction prompt matters. Instruct the LLM to preserve key facts: names, IDs, decisions made, and unresolved questions.
  • Audit trail: Compacted messages remain in the database with Is_Compacted__c = true. They are excluded from API calls but available for review and compliance.
  • Conflict resolution: Always instruct the AI that recent messages override the summary, since the summary may contain outdated information that was later corrected.