Basic Entities
Understand the core building blocks of AgentHub: agents, query sets, simulations, graders, and experiments. These entities form the foundation of your evaluation workflow.
Agent
An agent represents your AI system that you want to test and evaluate. Agents can be anything from simple chatbots to complex multi-step reasoning systems.
To run your agent in our virtual sandbox environment, implement the AgentRunner interface (python and npm packages supported). For more details on the AgentRunner interface, click here (link to page on agentHub documentation specific for sdk)
To define your agent in AgentHub, we need to know the Github repository name, the name of the branch you'd like to use, and a YAML file.
The YAML file should contain the module name, agent's corresponding class name, and the path to the requirements.txt file with the dependencies. Here's an example of the contents ofa YAML file:
module: rental_property_agent
className: RentalPropertyAssistant
requirementsPath: requirements.txt
Query Set
A query set is a collection of test inputs, scenarios, or prompts that you want to run your agent against. Think of it as your test suite for agent behavior.
A query set can be a ground truth set, which means that the query set contains the expected output for each input. This is useful for evaluating the accuracy of your agent. The output is held as the ground truth golden standard for the grader to evaluate the agent's performance.
You can provide us with just a few example queries, and we can synthetically generate more queries, expanding your query set to be more robust.
Query sets must be uploaded as a JSONL file. Then, we can synthetically generate more queries, expanding your query set to be more robust.
Here's an example of a query set for a conversational customer support agent. Note that it is in JSONL format.
{"content": "Can I book the property for a family reunion next June?"}
{"content": "The heating isn't working, can someone fix it urgently?"}
{"content": "Is there a way to get early check-in if we arrive before noon?"}
{"content": "I need to cancel my booking, and I expect a full refund immediately!"}
{"content": "How close is the nearest grocery store to the property?"}
{"content": "The Wi-Fi is terrible! I can't even load a page."}
{"content": "Hi, are pets allowed on the property, specifically dogs?"}
{"content": "Can you arrange a surprise birthday decoration in the living room?"}
{"content": "The pool is dirty. When will it be cleaned?"}
{"content": "Thanks for the wonderful stay! The view was amazing."}
Learn more about our curated datasets in the Curated Datasets article.
Simulation
When you combine an agent with a query set, you get a simulation. This is where AgentHub runs your agent through all the test scenarios and collects detailed execution data.
During a simulation, AgentHub automatically captures:
- Complete execution traces
- Response times and performance metrics
- Error logs and failure points
- Resource usage statistics
- Input and output data for each query
Simulations run in isolated environments to ensure consistent, reproducible results. We construct realistic environments to simulate the agent's behavior in and recreate scenarios you might see in production. Learn more about our custom environments in the Custom Environments article.
Grader
A grader evaluates the quality of your agent's responses. Graders provide quantitative assessment of your agent's performance across different criteria. You can define a grader with any parameter. Simply define what metrics you care about and the numeric scale you'd like to use.
There are three components that go into creating a robust grader:
Grader Input Formatter
The Grader Input Formatter is a prompt for the LLM grader in which you specify the metrics you care about, their point scale, and how you would like the LLM to evaluate the agent's performance.
Here's an example of a Grader Input Formatter:
You are an expert evaluator of AI assistant conversations. Your task is to evaluate the quality of the ASSISTANT’s responses in the provided conversation history.
CONVERSATION HISTORY:
{{ entry.content }}
INSTRUCTIONS:
1. Carefully analyze the conversation, focusing on the ASSISTANT’s responses to the USER’s messages.
2. Evaluate the ASSISTANT’s performance based on the following metrics:
- groundedness: How factually accurate and free from hallucinations the ASSISTANT’s responses are (1.0 = completely accurate, 0.0 = completely fabricated)
- overall_satisfaction: How well the ASSISTANT addressed the USER’s needs and provided helpful responses (1.0 = perfectly satisfied, 0.0 = completely unsatisfied)
- professionalism: How appropriate, respectful, and well-structured the ASSISTANT’s responses were (1.0 = highly professional, 0.0 = completely unprofessional)
SCORING SCALE:
For each metric, assign a score using ONLY these exact values: 0.0, 0.25, 0.5, 0.75, or 1.0
OUTPUT FORMAT:
{
“groundedness”: number,
“overall_satisfaction”: number,
“professionalism”: number
}
IMPORTANT:
- Only output the JSON object with the three metrics and their scores.
- Do not include any other text, explanations, or markdown formatting.
- Ensure all scores are exactly one of: 0.0, 0.25, 0.5, 0.75, or 1.0
- Base your evaluation solely on the content of the conversation history provided.
Grader Model Configuration
The Grader Model Configuration is a JSON file that specifies aspects about the grader. At its simplest, it can simply specify the model to use for the grader.
{"model": "gpt-4o"}
Grader Response Parser
The Grader Response Parser is a simple Python function that takes the response from the grader and parses it into a JSON object.
def parse_grader_response(response: str) -> dict:
return json.loads(response)
Experiment
An experiment is produced when a simulation run is graded. If you're working in an evaluation project, we automatically run a simulation (with the agent and query set), then automatically grade that simulation run with the specified grader to yield experiment results: the evaluation.
If you're not working in an evaluation project, you can manually configure an experiment under the Building Blocks section by creating a new experiment, which entails selecting a simulation and grader. Then, to produce results, select a simulation run that you'd like to grade, click the Run button, then we will kick off the grading process and yield the experiment results, also known as the evaluation.
Once you have the experiment results, we'll produce automatic insights from our AI Copilot to help you understand the results and make informed decisions. We also generate visualizations in various formats. You can chat with the AI Copilot about the results or manually slice and dice with your table filtering and sorting tools or dig through the visualized results in the side panel. You can click on a row in the table to open the side panel. You can also export the results to a JSONL file.