Your First Eval

At Minded, we strong believe in the Agent Development Lifecycle (ADLC), our agentic take on the Software Development Lifecycle (SDLC), a.k.a. how to ship reliable agents fast.

In this guide we’ll vibe-code an agent that sets the stage for the last-mile prompt engineering needed to go live to production - you’ll get a fresh from the oven agent

The ADLC flow

Here’s how an Agent Squad team up from 0 to 1

Step by Step

Define Minimum Viable Agent (MVA)

Owner: Agent Product Manager

Define the absolute minimum the agent needs to do before going live.

For the most part that means answering the following questions

If possible handoff to human reps anything that’s outside of that scope.

Aspect
Example response for an ecommerce support agent

Initial use case

Order status

Interface

Zendesk messaging

How to interact with it

sending SMS to the support number

Handoff mechanism

Assign CS Human tag on Zendesk upon escalation

Metric

Deflection rate - % of support tickets solved by the agent

Prepare E2E evals of real world cases

Owner: Domain Expert

Prepare end-to-end examples of how humans solve real world cases, each example should ideally include:

  1. Screen recording of the different apps operated throughout the resolution

  2. Explanation of the decision making process behind the resolution

  3. If you’re using a knowledge base, then list all the relevant articles from it that are used for this resolution

  4. List all other data sources used throughout the resolution (such as admin panels)

Best Practices to collect evals

  1. Observe a domain expert doing the job “shadowing” them and notice how operate multiple systems and interact with end users

  2. When observing the domain expert ask them WHY they did what they did?

  3. How did they collect data for each of their decision?

  4. How did they know which data to collect and which system to operate?

  5. Do they follow a documented procedure? Which one?

  6. When do they need to escalate to a supervisor

  7. Focus on a niche use case initially and aim to go deep - observing multiple levels of back-and-forth between the end user and the domain expert

Identify data sources per each eval

Owner: Agent Engineer

This step is critical to build agents that take actions and make decisions:

  1. Identify the data sources available to the AI Agent to be able to reproduce the decision making process and/or actions of the end-to-end evals of the previous step. The most obvious data source are the interface (often that’s the customer support CRM such as Zendesk), and the admin panel (often that’s the human operator dashboard, such as Retool), however often these data sources can get tricky from 3rd party

  2. Map each data source into an API endpoint and ensure the relevant fields are available from the API

  3. When an API is not available consider alternatives such as Computer Use, and make sure the authentication could be enabled for an AI Agent.

Run the agent locally

Owner: Agent Engineer

Now is the time to make sure all the tools and knowledge are accessible for the agent

  1. Setup the local development environment (see getting started guide)

  2. Add tools per each data source

  3. Add knowledge per each knowledge source

  4. Add a prompt node to the agent that uses each of the above tools and knowledge bases and make sure that each of them are being called by the agent

Prompt Engineering

Owner: Agent Engineer

Next you’re going to solve each of the above evals by retrying again and again.

You are probably going to run a similar process to each of the eval examples:

  1. Invoke the trigger - ideally in a way that lets the agent actually pull the data it needs for the example

  2. Run the scenario in a step by step manner, whenever you run into an issue troubleshoot, solve it, and retry

    1. Evaluate the steps, normally evaluation covers:

      1. Agent tone of voice

      2. Tools accuracy - are the input parameters configured correctly?

      3. Did the agent invoke the right tool at the right time? with the right parameters?

      4. Did the agent take necessary actions?

      5. Did it retrieve the right knowledge base articles? Did it parse them correctly?

    2. If there were any issues there are usually two ways to go about it - solve it with code, or solve it with prompt (see our prompt engineering best practices)

    3. After you’ve applied a fix - retry the last step, until you’re able to reach the desired state

    4. If you weren’t able to solve it - consider escalating similar cases in the future, to let humans take care of cases where the agent is not perfect yet

    5. Repeat until the example reaches its final step successfully

  3. Now that the example run fully it is time to review post-run logs, actions and analytics

Agent Warm up

Owner: Agent Engineer

Warm up is everything that happens from the moment the agent wakes up to the first node execution. This is the window where we inject critical context, like user_id , or qualifying triggers to ensure the agent is laser-focused on the task it needs to resolve.

  1. Qualify the trigger - review the data schema of the trigger and make your mind if some of those data points should indicate the agent is not qualified for certain scenarios. Remember it’s usually best to launch a reliable agent at small scale (see MVA above).

  2. Enrich context - some context should or could be introduced to the agent upfront. It could be parameters user_id , or some tools (for example pulling order details for an ecommerce support agent)

QA

Owner: Domain Expert

This is the final OK before going live. This is where the expert “stress test” the agent for more edge cases, and evaluates the results.

If the agent is good enough - then awesome time to deploy. Otherwise, it’s time for more evals 💪🏻

Publish

Owner: Agent Engineer

Hit deploy on the platform and follow our deployment instructions for more details.

Last updated