Your First Eval
At Minded, we strong believe in the Agent Development Lifecycle (ADLC), our agentic take on the Software Development Lifecycle (SDLC), a.k.a. how to ship reliable agents fast.
In this guide we’ll vibe-code an agent that sets the stage for the last-mile prompt engineering needed to go live to production - you’ll get a fresh from the oven agent
The ADLC flow
Here’s how an Agent Squad team up from 0 to 1

Step by Step
Define Minimum Viable Agent (MVA)
Owner: Agent Product Manager
Define the absolute minimum the agent needs to do before going live.
For the most part that means answering the following questions
If possible handoff to human reps anything that’s outside of that scope.
Initial use case
Order status
Interface
Zendesk messaging
How to interact with it
sending SMS to the support number
Handoff mechanism
Assign CS Human tag on Zendesk upon escalation
Metric
Deflection rate - % of support tickets solved by the agent
Prepare E2E evals of real world cases
Owner: Domain Expert
Prepare end-to-end examples of how humans solve real world cases, each example should ideally include:
Screen recording of the different apps operated throughout the resolution
Explanation of the decision making process behind the resolution
If you’re using a knowledge base, then list all the relevant articles from it that are used for this resolution
List all other data sources used throughout the resolution (such as admin panels)
Best Practices to collect evals
Observe a domain expert doing the job “shadowing” them and notice how operate multiple systems and interact with end users
When observing the domain expert ask them WHY they did what they did?
How did they collect data for each of their decision?
How did they know which data to collect and which system to operate?
Do they follow a documented procedure? Which one?
When do they need to escalate to a supervisor
Focus on a niche use case initially and aim to go deep - observing multiple levels of back-and-forth between the end user and the domain expert
Identify data sources per each eval
Owner: Agent Engineer
This step is critical to build agents that take actions and make decisions:
Identify the data sources available to the AI Agent to be able to reproduce the decision making process and/or actions of the end-to-end evals of the previous step. The most obvious data source are the interface (often that’s the customer support CRM such as Zendesk), and the admin panel (often that’s the human operator dashboard, such as Retool), however often these data sources can get tricky from 3rd party
Map each data source into an API endpoint and ensure the relevant fields are available from the API
When an API is not available consider alternatives such as Computer Use, and make sure the authentication could be enabled for an AI Agent.
Run the agent locally
Owner: Agent Engineer
Now is the time to make sure all the tools and knowledge are accessible for the agent
Setup the local development environment (see getting started guide)
Add tools per each data source
Add knowledge per each knowledge source
Add a prompt node to the agent that uses each of the above tools and knowledge bases and make sure that each of them are being called by the agent
Prompt Engineering
Owner: Agent Engineer
Next you’re going to solve each of the above evals by retrying again and again.
You are probably going to run a similar process to each of the eval examples:
Invoke the trigger - ideally in a way that lets the agent actually pull the data it needs for the example
Run the scenario in a step by step manner, whenever you run into an issue troubleshoot, solve it, and retry
Evaluate the steps, normally evaluation covers:
Agent tone of voice
Tools accuracy - are the input parameters configured correctly?
Did the agent invoke the right tool at the right time? with the right parameters?
Did the agent take necessary actions?
Did it retrieve the right knowledge base articles? Did it parse them correctly?
If there were any issues there are usually two ways to go about it - solve it with code, or solve it with prompt (see our prompt engineering best practices)
After you’ve applied a fix - retry the last step, until you’re able to reach the desired state
If you weren’t able to solve it - consider escalating similar cases in the future, to let humans take care of cases where the agent is not perfect yet
Repeat until the example reaches its final step successfully
Now that the example run fully it is time to review post-run logs, actions and analytics
Agent Warm up
Owner: Agent Engineer
Warm up is everything that happens from the moment the agent wakes up to the first node execution. This is the window where we inject critical context, like user_id
, or qualifying triggers to ensure the agent is laser-focused on the task it needs to resolve.
Qualify the trigger - review the data schema of the trigger and make your mind if some of those data points should indicate the agent is not qualified for certain scenarios. Remember it’s usually best to launch a reliable agent at small scale (see MVA above).
Enrich context - some context should or could be introduced to the agent upfront. It could be parameters
user_id
, or some tools (for example pulling order details for an ecommerce support agent)
QA
Owner: Domain Expert
This is the final OK before going live. This is where the expert “stress test” the agent for more edge cases, and evaluates the results.
If the agent is good enough - then awesome time to deploy. Otherwise, it’s time for more evals 💪🏻
Publish
Owner: Agent Engineer
Hit deploy on the platform and follow our deployment instructions for more details.
Last updated