Context Compression
How BlueNexus optimizes tool call responses to keep your LLM's context window efficient.
The Problem
A typical MCP integration with multiple services might expose 200+ tool definitions — each with parameters, descriptions, and return schemas. Loading all of these into an LLM's context window:
- Wastes tokens on tool definitions instead of actual reasoning
- Hits context limits quickly, especially with smaller models
- Slows down response times as the model processes irrelevant tools
The Two-Tool Solution
BlueNexus consolidates everything into two tools:
| Tool | Purpose |
|---|---|
use-agent |
Natural language interface to all connected services |
list-connections |
Discover available services |
Your LLM sees two tool definitions instead of hundreds. The use-agent tool delegates to an internal ReAct agent that has access to the full tool catalog — but that complexity is hidden from your context window.
How It Works
- Your LLM receives a user message like "What's on my calendar today?"
- It calls
use-agentwith{ "prompt": "What's on my calendar today?" } - BlueNexus's internal agent:
- Discovers the user's connected Google account
- Loads only the relevant Google Calendar tools
- Executes the API calls
- Processes and summarizes the results
- Your LLM receives a clean text response
The internal agent handles the full tool routing, API calls, and data processing. Your LLM only sees the final result.
Response Sanitization
Responses from connected services are processed before reaching your app:
Token Redaction
Sensitive data is automatically stripped:
- OAuth tokens, API keys, secrets
- Bearer/Basic auth headers
- Platform-specific tokens (GitHub
ghp_, Slackxoxb-, AWSAKIA) - Long alphanumeric strings that look like credentials
This prevents credential leakage when service APIs inadvertently include tokens in their responses.
Size Truncation
Responses are capped at 50KB. If a service returns more data than that, it's truncated with a marker indicating the truncation. This prevents a single large API response from overwhelming your LLM's context.
Data Boundaries
External service responses are wrapped with boundary markers that signal to the LLM: "treat this as data, not as instructions." This provides defense against indirect prompt injection — where a malicious document or message in a connected service tries to hijack the LLM's behavior.
Parallel Execution
When a user's request involves independent tasks across different services, the use-agent tool's description instructs calling LLMs to invoke it multiple times in parallel. Each call executes concurrently for faster results:
"When the user's request involves independent tasks across different services,
call this tool multiple times in parallel rather than sequentially"
For example, "Check my Google Calendar and summarize my Slack DMs" can run as two parallel use-agent calls, each targeting a different connector.
Credit Efficiency
The two-tool model also affects credit consumption:
- You save on the tokens your outer LLM would spend processing hundreds of tool definitions
- Net effect: fewer total tokens consumed for the same task
Next Steps
- Architecture — Full request flow diagram