Most AI assistants feel impressive right up until you try to use them as part of your real daily workflow.

They can summarize documents, answer questions, write code, and occasionally surprise you with something genuinely insightful. But after a while you start noticing the same patterns over and over again. They lose track of what you’re actually trying to accomplish. They pull in too much irrelevant context. They overuse tools. They forget preferences that mattered two messages ago. Sometimes they become confidently helpful in ways that are subtly wrong, which is often worse than simply failing.

That frustration was a big part of why I started building Toby.

I wasn’t particularly interested in creating another chatbot with a giant system prompt and access to a handful of APIs. What I wanted was something that could stay focused on the task at hand, understand what kind of work it was helping with, selectively pull in the right context, and become more useful over time without turning into an unpredictable black box.

The more I worked with large language models, the more obvious it became that the difficult part wasn’t making the models smarter. The models are already remarkably capable. The difficult part is deciding what information they should see, which tools they should have access to, and what boundaries should exist around both. Once you start introducing memory, personalization, external integrations, and long-running workflows, the idea of “just use a bigger prompt” starts breaking down pretty quickly.

At a high level, Toby treats every chat turn as a coordination problem:

flowchart LR
  U[User message] --> P[Intent preflight]
  P --> M[Message assembly]
  M --> E[Execution engine]
  E --> T[Tool calls]
  T --> R[Response synthesis]
  R --> H[Session history]
  R --> L[Optional learning loop]
  L --> MEM[Memory store with policy checks]

Most AI systems flatten all of this into one giant prompt and rely on the model to sort everything out internally. That approach actually works surprisingly well for simple interactions, but it becomes increasingly fragile as the system grows more capable. Once tools, memory, personalization, permissions, and external systems all start interacting with each other, separating responsibilities becomes much more important.

Toby handles this by layering multiple guardrails together instead of trying to solve everything in one place. Some layers shape behavior and communication style. Others decide which capabilities should even be available for a given task. Others determine what context is safe, useful, or relevant to include in the prompt. The system ends up feeling much more predictable because no single component is responsible for solving every problem at once.

Personas Keep the Assistant Consistent

One of the first things I noticed while using AI heavily was how often people end up restating the same preferences over and over again.

They ask the assistant to be concise. To think like an engineer. To challenge assumptions instead of agreeing automatically. To brainstorm more creatively. To avoid sounding overly formal. Eventually I realized people usually aren’t just asking for answers. They’re asking for a particular style of collaboration and a particular mode of reasoning depending on the situation they’re in.

That idea eventually evolved into personas.

A persona in Toby is less about roleplaying and more about establishing an operating profile for the session. Personas can influence communication style, model selection, reasoning behavior, provider choice, and high-level constraints around how the assistant approaches problems. Instead of treating every interaction as completely generic, Toby can adapt to the type of collaboration the user actually wants while still staying internally consistent from one conversation to the next.

That consistency turns out to matter quite a bit once you start relying on an assistant regularly instead of using it occasionally for isolated tasks.

Intent Detection Reduces Context Pollution

One thing I think a lot of AI systems get wrong is treating every user request as equally broad.

If someone says:

“Help me clean up my week and respond to urgent messages.”

there’s a huge difference between summarizing email, drafting responses, planning a schedule, querying integrations, and autonomously taking actions on the user’s behalf. A useful assistant needs to narrow the scope of the task before it starts loading context and exposing tools to the model.

Before Toby assembles the main prompt, it can run a lightweight intent pass that tries to determine what the user is actually attempting to accomplish, which integrations are likely relevant, which tools should be visible, and whether any specialized skills should be loaded. The goal isn’t perfect classification. It’s reducing unnecessary noise before the main execution step begins.

This matters because context overload becomes a real problem surprisingly quickly. Large prompts often create the illusion of intelligence while quietly degrading response quality underneath. Once too much irrelevant information gets injected into the conversation, models start becoming less focused, less reliable, and more eager to pull in concepts that aren’t actually helpful for the task.

A lot of Toby’s architecture is really about resisting that tendency.

Skills Work Like Specialized Playbooks

Skills are Toby’s way of loading deeper instructions only when they’re actually useful.

Instead of permanently stuffing every possible behavior guideline into the system prompt, Toby can expose lightweight metadata about available skills and selectively load the full instruction set when the situation calls for it. That keeps the baseline prompt smaller and more focused while still allowing for highly specialized behavior when needed.

In practice, skills end up behaving a lot like targeted playbooks.

A coding-related task might load architecture review guidance, debugging heuristics, or TypeScript conventions. A writing task might load editing strategies or tone instructions. Planning workflows can load structured triage patterns or prioritization guidance. The assistant stays relatively lightweight until deeper procedural context becomes necessary.

That approach has worked much better than trying to build one enormous universal prompt that attempts to anticipate every possible scenario ahead of time.

Memory Is Useful, But It Needs Boundaries

Memory is probably the hardest part of building systems like this responsibly.

People genuinely want personalization. They want the assistant to remember projects, preferences, writing style, recurring workflows, communication habits, and long-term context. At the same time, they also want control over what gets remembered, why it gets remembered, and how that information is used later.

A lot of AI products treat memory as hidden magic happening somewhere behind the scenes. I’ve become increasingly convinced that this is the wrong approach.

Toby handles memory much more cautiously. Memory writes can be proposal-based instead of automatic. Stored information can carry provenance data, sensitivity classifications, visibility rules, and explicit edit or deletion paths for the user. The goal is to make memory feel inspectable instead of mysterious.

There’s also an important distinction between cache and memory that gets blurred surprisingly often in AI systems.

Cached tool results are short-lived execution optimizations designed to avoid redundant work during nearby requests. Memory is durable user context that survives across sessions and gradually shapes personalization over time. Those are very different concepts with very different implications, and treating them interchangeably tends to create confusing behavior pretty quickly.

Tools Are Powerful, Which Means They Need Constraints

Tools are where AI assistants start becoming genuinely useful instead of simply conversational.

Connecting models to email, calendars, search systems, documents, automation frameworks, and local applications dramatically changes what an assistant can accomplish. But giving a language model unrestricted tool access also creates a surprising amount of chaos once you start observing real-world behavior over longer periods of time.

Toby tries to keep tool access intentionally scoped.

Some tools are globally available while others only appear when they’re relevant to the active task or integrations. Read and write operations are treated differently, and sensitive actions can require explicit confirmation flows depending on the level of ambiguity or risk involved.

I don’t think the answer is removing autonomy entirely. Humans delegate tasks to imperfect systems constantly. The important part is making sure the assistant operates inside clearly defined lanes and that those boundaries remain understandable to both the system and the user.

Toby Is Really a Coordination Layer

At the surface level, Toby looks like a chat application. Underneath, it behaves much more like a coordination layer sitting between the user, long-term context, external systems, safety policies, and the model itself.

That distinction ended up shaping almost every architectural decision in the project.

The difficult questions were never really:

“How do I build a chatbot?”

The harder questions were:

  • How do you keep AI focused?
  • How do you avoid context overload?
  • How do you personalize safely?
  • How do you expose tools without creating reckless behavior?
  • How do you let the assistant improve over time without turning it into an opaque black box?

I don’t think there’s a perfect answer to any of those questions yet, and honestly the entire industry still seems to be figuring this out in real time. But the more I work on systems like this, the less I believe the future looks like one massive all-knowing prompt trying to do everything at once.

The systems that feel genuinely useful tend to be the ones that separate responsibilities cleanly, apply constraints intentionally, and treat context as something that needs to be managed carefully rather than endlessly accumulated.

That’s really the core idea behind Toby.

© Karim Shehadeh
  • X
  • BlueSky
  • RSS
  • LinkedIn
  • StackOverflow
  • Github