Agent Skills
Discover and share powerful Agent Skills for AI assistants
llm-evaluation - Agent Skill - Agent Skills
Home/ Skills / llm-evaluation Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Use the skills CLI to install this skill with one command. Auto-detects all installed AI assistants.
Method 1 - skills CLI
npx skills i wshobson/agents/plugins/llm-application-dev/skills/llm-evaluation CopyMethod 2 - openskills (supports sync & update)
npx openskills install wshobson/agents CopyAuto-detects Claude Code, Cursor, Codex CLI, Gemini CLI, and more. One install, works everywhere.
Installation Path
Download and extract to one of the following locations:
Claude Code Cursor OpenCode Gemini CLI Codex CLI
~/.claude/skills/llm-evaluation/ Back No setup needed. Let our cloud agents run this skill for you.
Select Model
Claude Haiku 4.5 $0.10 Claude Sonnet 4.5 $0.20 Claude Opus 4.5 $0.50 Claude Sonnet 4.5 $0.20 /task
Best for coding tasks
Try NowEnvironment setup included
LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
When to Use This Skill
Measuring LLM application performance systematically
Comparing different models or prompts
Detecting performance regressions before deployment
Validating improvements from prompt changes
Building confidence in production systems
Establishing baselines and tracking progress over time
Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
BLEU : N-gram overlap (translation)
ROUGE : Recall-oriented (summarization)
METEOR : Semantic similarity
BERTScore : Embedding-based similarity
Perplexity : Language model confidence
Classification:
Accuracy : Percentage correct
Precision/Recall/F1 : Class-specific performance
Confusion Matrix : Error patterns
AUC-ROC : Ranking quality
Retrieval (RAG):
MRR : Mean Reciprocal Rank
NDCG : Normalized Discounted Cumulative Gain
Precision@K : Relevant in top K
Recall@K : Coverage in top K
2. Human Evaluation
Manual assessment for quality aspects difficult to automate.
Dimensions:
Accuracy : Factual correctness
Coherence : Logical flow
Relevance : Answers the question
Fluency : Natural language quality
Safety : No harmful content
Helpfulness : Useful to the user
3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
Pointwise : Score individual responses
Pairwise : Compare two responses
Reference-based : Compare to gold standard
Reference-free : Judge without ground truth
Quick Start
from dataclasses import dataclass
from typing import Callable
import numpy as np
@dataclass
class Metric
Detailed patterns and worked examples
Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.
:
name: str
fn: Callable
@ staticmethod
def accuracy ():
return Metric( "accuracy" , calculate_accuracy)
@ staticmethod
def bleu ():
return Metric( "bleu" , calculate_bleu)
@ staticmethod
def bertscore ():
return Metric( "bertscore" , calculate_bertscore)
@ staticmethod
def custom (name: str , fn: Callable):
return Metric(name, fn)
class EvaluationSuite :
def __init__ (self, metrics: list[Metric]):
self .metrics = metrics
async def evaluate (self, model, test_cases: list[ dict ]) -> dict :
results = {m.name: [] for m in self .metrics}
for test in test_cases:
prediction = await model.predict(test[ "input" ])
for metric in self .metrics:
score = metric.fn(
prediction = prediction,
reference = test.get( "expected" ),
context = test.get( "context" )
)
results[metric.name].append(score)
return {
"metrics" : {k: np.mean(v) for k, v in results.items()},
"raw_scores" : results
}
# Usage
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom( "groundedness" , check_groundedness)
])
test_cases = [
{
"input" : "What is the capital of France?" ,
"expected" : "Paris" ,
"context" : "France is a country in Europe. Paris is its capital."
},
]
results = await suite.evaluate( model = your_model, test_cases = test_cases)