← Back to Index
No. 016

Building AI Agents: What You Actually Need to Think About

In this post, I'll talk about AI agents—keeping in mind people who are working on different projects, building their profile, or just getting started with agents.

By now, you would have explored different courses, different YouTube videos, and you know what different companies are doing with agents. So I want to give different perspectives on specific things.


Agents and RL: What Does That Actually Mean?

Recently, someone said they're interested in LLM agents and reinforcement learning, especially the "self-improving" setup where you build scalable environments plus automated verifiers. A useful way to think about a verifier is as an evaluative signal—closer to a reward function or a critic-like scorer—rather than literally the Bellman value function itself.

In classical RL, the value function V(s) (or Q(s,a)) estimates expected long-term return and is tied to the environment's dynamics via the Bellman relationship. In agent training, a verifier instead checks whether a step, tool call, intermediate state, or final outcome satisfies some criterion (correctness, safety, constraint satisfaction, progress), and turns that into a score or label. That signal can then be used to (a) select better trajectories at inference time, and/or (b) train the policy with RL-style updates or preference-style optimization.

In practice, for industrial applications, it's going to be more about meta-prompting and algorithms rather than completely tuning LLM weights. Test-time RL exists, but practically, we're working at the prompting and algorithm level.


Beyond Simple Agents: Critical Design Patterns

People are curious about agents. They learn LangGraph, LLMs, prompting, tools, connectors, different frameworks. But what we need to look at is critical design patterns, not just simple things like resume filtering agents, blog writing agents, summarization agents, or data analysis agents.

One thing I've noticed: agents for data analysis are sometimes flawed by design. People use function calls—the agent calls functions, tools return aggregated values, and based on prompts or the LLM's parametric knowledge, it comes up with conclusions. But this way of analysis might not meet what the business needs or what the problem itself needs.

People who develop data analysis agents sometimes have flawed ideas about statistics—how to see things statistically, how to come up with statistical fallbacks, how to inform agents to keep certain metrics in mind. And other things like summarization, calendars, or rule-based adjustments for triggering other systems are very typical.


The Real Use Case: Agents in Sandboxes

The other side—agents within sandboxes that perform long-running tasks—is very important and gaining traction.

Simple use cases will eventually be available as wrappers, as simple functions. But for real problems, critical problems like coding, the agent (LLM with prompts, skills, and context) runs in a specific sandbox and manages its context dynamically.

It's not like people typically think—you just append all past conversations and it'll be fine. No. We need cognitive architecture, which is crucial for building agents.


The Layers of Agent Architecture

Let's see the different layers required for agents. The base skeleton is the LLM and prompts. But around that, different contexts or memories are required: working memory, long-term memory, short-term memory, sometimes episodic memory.

I don't want to throw rigid terminologies, but conceptually, there needs to be working memory to continue doing actions. When demand arises, the agent has to look at the past—the long-term memory. "What happened in the past? What critical things have been recorded?"

For example, meeting assistant agents—people are betting on capabilities like the assistant helping from past conversations and learnings. Maintaining different levels of memory and context is essential.

One might wonder about infinite context. But dumping everything at once versus meticulously having a context inventory is actually helpful for nuanced tasks—coding, planning, document synthesis, decision-making in law or insurance.

That's why the concept of skills is emerging. They don't want to keep everything dumped. It's arranged so agents can decide when to refer to what.


Conversation Mining: Learning from History

Another important experiment to look at: how can conversation mining help agents learn from past history?

Almost many enterprises are deploying agents for customer support or strategic planning. The knowledge is actually in thousands of conversation traces. How can we utilize them?

One might wonder, "Can we train a model and use it as an agent?" Sometimes the setup might not be there. So how can we use these conversations as data, as a substrate for agents?

That involves either building a knowledge graph, or what people now call context graphs—which some are claiming is a trillion-dollar economy. That's worth exploring. MCP tools exist, but if you're interested in developing tools and connectors, that's fine.

I'm biased towards the art and science of agents, especially cognitive architecture. What else can we explore there?


Sandboxes: The Kitchen for Agents

Sandboxes are something we need to really think about. Most people build projects that run in scripts—the agent calls functions that are again scripts.

In reality, it's not going to be one agent. Products will have hundreds of replicable agents. If I'm a subscriber, I'll have my own agent doing its own work inside a specific sandbox, delivering results.

It's like building the kitchen for those agents. Think about how modular kitchen setups vary from place to place.

By default, sandboxes have basic things: isolation, connectivity, resource sharing. Concepts like gVisor exist, but now these things are abstracted. You can get them from E2B or Modal, or develop with Kubernetes or open-source alternatives.

But within that sandbox, how do you set up other things? That's the modular kitchen. Kitchen comes with gas connections, stove—similar to sandbox with connectivity, isolation, network, resource allocation. But equipping that sandbox with enough things—a server to deploy applications that agents build, database connections, specific tools to deal with Notion or other services—that's where the work is.

If someone can learn to build agents and maintain and improve sandbox environments, they can build many things: video generation, unique ad generation for multiple users, personal cloud setups. I was reading about Lovable's sandbox concepts with Modal. Someone can simply build this at small scale.


Behavioral Agents for Market Research

On the other side, there are personality-driven, behavioral-driven agents for consumer research and market research.

Say we want to run consumer research to understand how people would buy a product—mimicking human preferences. Two aspects here:

First, the science side: understanding or encoding behavioral and persona-related information into agents.

Second, the infrastructure and engineering side: how to run questionnaires or surveys to thousands of agents. Say you want to replicate a survey of how people buy protein powder, or how people buy books. You want to study different segments—school goers, college students, office executives, working-class people. You need a sample of 1,000, running at least 25 questions.

One thing is just doing it in a for loop—very naive. But if you're building it as a product to give to different CPG customers or consumer product companies, you need parallel agents running at scale. Interview multiple LLM agents (or generative social agents) in parallel, aggregate results, do analysis.


Agents in Data Science and Enterprise Tasks

Another aspect: agents as part of typical data science, machine learning, or enterprise tasks.

Most of the time, people see agents as part of summarization or automation—straightforward logic. But in places where people build agents for recommendations, or LLMs for insurance, banking, medical, or drug-related aspects—there's unstructured data, but also structured data from warehouses needing specific transformations.

In that case, it's not just automation or if-else prompting. There needs to be more knowledge about how to look at data, how to compare scenarios, what validations or hypotheses to consider. It shouldn't be at surface level, especially for business-specific applications.

That's where people who know the domain get leverage—people who know how to look at different numbers, how to build guardrails and validations.


Validation: The Critical Missing Piece

One specific thing agents need is validation. Sometimes people test on face value: "Oh, this output is fine." But we need proper testing.

I think even tester or testing jobs might evolve into helping develop evaluation pipelines. Previously, evaluation pipelines were only RAG-specific: precision, retrieval, recall, hallucination detection.

But now agents are taking different actions. We need to validate traces with the rationale behind them. In this situation, did it take the right direction? Is it aware of the rationale?

That involves observability to collect traces and validation strategy in place so agents don't mess up in reality.


An Open Field for Different People

It's a very open field for different people. If you're creative, good at programming and coding, or good at domain or data-specific things—agents are the operational layer. All other things are equally important:

  • Context and logic—developing skills and decision-making capacity
  • Using past and historic data
  • Validation
  • Making agents learn automatically

Those last things are more research or applied research—it's not like frameworks are already available. Most of the time, you're building from scratch.


The field is open. Pick your angle.