Source: wshobson/agents Original Plugin: llm-application-dev
LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
When to Use This Skill
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
- BLEU: N-gram overlap (translation)
- ROUGE: Recall-oriented (summarization)
- METEOR: Semantic similarity
- BERTScore: Embedding-based similarity
- Perplexity: Language model confidence
Classification:
- Accuracy: Percentage correct
- Precision/Recall/F1: Class-specific performance
- Confusion Matrix: Error patterns
- AUC-ROC: Ranking quality
Retrieval (RAG):
- MRR: Mean Reciprocal Rank
- NDCG: Normalized Discounted Cumulative Gain
- Precision@K: Relevant in top K
- Recall@K: Coverage in top K
2. Human Evaluation
Manual assessment for quality aspects difficult to automate.
Dimensions:
- Accuracy: Factual correctness
- Coherence: Logical flow
- Relevance: Answers the question
- Fluency: Natural language quality
- Safety: No harmful content
- Helpfulness: Useful to the user
3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
- Pointwise: Score individual responses
- Pairwise: Compare two responses
- Reference-based: Compare to gold standard
- Reference-free: Judge without ground truth
Quick Start
PYTHON
from llm_eval import EvaluationSuite, Metric
# Define evaluation suite
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom(name="groundedness", fn=check_groundedness)
])
# Prepare test cases
test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
    # ... more test cases
]
# Run evaluation
results = suite.evaluate(
    model=your_model,
    test_cases=test_cases
)
print(f"Overall Accuracy: {results.metrics['accuracy']}")
print(f"BLEU Score: {results.metrics['bleu']}")
Automated Metrics Implementation
BLEU Score
PYTHON
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, hypothesis):
    """Calculate BLEU score between reference and hypothesis."""
    smoothie = SmoothingFunction().method4
    return sentence_bleu(
        [reference.split()],
        hypothesis.split(),
        smoothing_function=smoothie
    )
# Usage
bleu = calculate_bleu(
    reference="The cat sat on the mat",
    hypothesis="A cat is sitting on the mat"
)
ROUGE Score
PYTHON
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis):
    """Calculate ROUGE scores."""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }
BERTScore
PYTHON
from bert_score import score
def calculate_bertscore(references, hypotheses):
    """Calculate BERTScore using pre-trained BERT."""
    P, R, F1 = score(
        hypotheses,
        references,
        lang='en',
        model_type='microsoft/deberta-xlarge-mnli'
    )
    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }
Custom Metrics
PYTHON
def calculate_groundedness(response, context):
    """Check if response is grounded in provided context."""
    # Use NLI model to check entailment
    from transformers import pipeline
    nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
    result = nli(f"{context} [SEP] {response}")[0]
    # Return confidence that response is entailed by context
    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text):
    """Measure toxicity in generated text."""
    from detoxify import Detoxify
    results = Detoxify('original').predict(text)
    return max(results.values())  # Return highest toxicity score
def calculate_factuality(claim, knowledge_base):
    """Verify factual claims against knowledge base."""
    # Implementation depends on your knowledge base
    # Could use retrieval + NLI, or fact-checking API
    pass
LLM-as-Judge Patterns
Single Output Evaluation
PYTHON
def llm_judge_quality(response, question):
    """Use GPT-4 to judge response quality."""
    prompt = f"""Rate the following response on a scale of 1-10 for:
1. Accuracy (factually correct)
2. Helpfulness (answers the question)
3. Clarity (well-written and understandable)
Question: {question}
Response: {response}
Provide ratings in JSON format:
{{
  "accuracy": <1-10>,
  "helpfulness": <1-10>,
  "clarity": <1-10>,
  "reasoning": "<brief explanation>"
}}
"""
    result = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return json.loads(result.choices[0].message.content)
Pairwise Comparison
PYTHON
def compare_responses(question, response_a, response_b):
    """Compare two responses using LLM judge."""
    prompt = f"""Compare these two responses to the question and determine which is better.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better and why? Consider accuracy, helpfulness, and clarity.
Answer with JSON:
{{
  "winner": "A" or "B" or "tie",
  "reasoning": "<explanation>",
  "confidence": <1-10>
}}
"""
    result = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return json.loads(result.choices[0].message.content)
Human Evaluation Frameworks
Annotation Guidelines
PYTHON
class AnnotationTask:
    """Structure for human annotation task."""
    def __init__(self, response, question, context=None):
        self.response = response
        self.question = question
        self.context = context
    def get_annotation_form(self):
        return {
            "question": self.question,
            "context": self.context,
            "response": self.response,
            "ratings": {
                "accuracy": {
                    "scale": "1-5",
                    "description": "Is the response factually correct?"
                },
                "relevance": {
                    "scale": "1-5",
                    "description": "Does it answer the question?"
                },
                "coherence": {
                    "scale": "1-5",
                    "description": "Is it logically consistent?"
                }
            },
            "issues": {
                "factual_error": False,
                "hallucination": False,
                "off_topic": False,
                "unsafe_content": False
            },
            "feedback": ""
        }
Inter-Rater Agreement
PYTHON
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores):
    """Calculate inter-rater agreement."""
    kappa = cohen_kappa_score(rater1_scores, rater2_scores)
    interpretation = {
        kappa < 0: "Poor",
        kappa < 0.2: "Slight",
        kappa < 0.4: "Fair",
        kappa < 0.6: "Moderate",
        kappa < 0.8: "Substantial",
        kappa <= 1.0: "Almost Perfect"
    }
    return {
        "kappa": kappa,
        "interpretation": interpretation[True]
    }
A/B Testing
Statistical Testing Framework
PYTHON
from scipy import stats
import numpy as np
class ABTest:
    def __init__(self, variant_a_name="A", variant_b_name="B"):
        self.variant_a = {"name": variant_a_name, "scores": []}
        self.variant_b = {"name": variant_b_name, "scores": []}
    def add_result(self, variant, score):
        """Add evaluation result for a variant."""
        if variant == "A":
            self.variant_a["scores"].append(score)
        else:
            self.variant_b["scores"].append(score)
    def analyze(self, alpha=0.05):
        """Perform statistical analysis."""
        a_scores = self.variant_a["scores"]
        b_scores = self.variant_b["scores"]
        # T-test
        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
        # Effect size (Cohen's d)
        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
        return {
            "variant_a_mean": np.mean(a_scores),
            "variant_b_mean": np.mean(b_scores),
            "difference": np.mean(b_scores) - np.mean(a_scores),
            "relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
            "p_value": p_value,
            "statistically_significant": p_value < alpha,
            "cohens_d": cohens_d,
            "effect_size": self.interpret_cohens_d(cohens_d),
            "winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
        }
    @staticmethod
    def interpret_cohens_d(d):
        """Interpret Cohen's d effect size."""
        abs_d = abs(d)
        if abs_d < 0.2:
            return "negligible"
        elif abs_d < 0.5:
            return "small"
        elif abs_d < 0.8:
            return "medium"
        else:
            return "large"
Regression Testing
Regression Detection
PYTHON
class RegressionDetector:
    def __init__(self, baseline_results, threshold=0.05):
        self.baseline = baseline_results
        self.threshold = threshold
    def check_for_regression(self, new_results):
        """Detect if new results show regression."""
        regressions = []
        for metric in self.baseline.keys():
            baseline_score = self.baseline[metric]
            new_score = new_results.get(metric)
            if new_score is None:
                continue
            # Calculate relative change
            relative_change = (new_score - baseline_score) / baseline_score
            # Flag if significant decrease
            if relative_change < -self.threshold:
                regressions.append({
                    "metric": metric,
                    "baseline": baseline_score,
                    "current": new_score,
                    "change": relative_change
                })
        return {
            "has_regression": len(regressions) > 0,
            "regressions": regressions
        }
Benchmarking
Running Benchmarks
PYTHON
class BenchmarkRunner:
    def __init__(self, benchmark_dataset):
        self.dataset = benchmark_dataset
    def run_benchmark(self, model, metrics):
        """Run model on benchmark and calculate metrics."""
        results = {metric.name: [] for metric in metrics}
        for example in self.dataset:
            # Generate prediction
            prediction = model.predict(example["input"])
            # Calculate each metric
            for metric in metrics:
                score = metric.calculate(
                    prediction=prediction,
                    reference=example["reference"],
                    context=example.get("context")
                )
                results[metric.name].append(score)
        # Aggregate results
        return {
            metric: {
                "mean": np.mean(scores),
                "std": np.std(scores),
                "min": min(scores),
                "max": max(scores)
            }
            for metric, scores in results.items()
        }
Resources
- references/metrics.md: Comprehensive metric guide
- references/human-evaluation.md: Annotation best practices
- references/benchmarking.md: Standard benchmarks
- references/a-b-testing.md: Statistical testing guide
- references/regression-testing.md: CI/CD integration
- assets/evaluation-framework.py: Complete evaluation harness
- assets/benchmark-dataset.jsonl: Example datasets
- scripts/evaluate-model.py: Automated evaluation runner
Best Practices
- Multiple Metrics: Use diverse metrics for comprehensive view
- Representative Data: Test on real-world, diverse examples
- Baselines: Always compare against baseline performance
- Statistical Rigor: Use proper statistical tests for comparisons
- Continuous Evaluation: Integrate into CI/CD pipeline
- Human Validation: Combine automated metrics with human judgment
- Error Analysis: Investigate failures to understand weaknesses
- Version Control: Track evaluation results over time
Common Pitfalls
- Single Metric Obsession: Optimizing for one metric at the expense of others
- Small Sample Size: Drawing conclusions from too few examples
- Data Contamination: Testing on training data
- Ignoring Variance: Not accounting for statistical uncertainty
- Metric Mismatch: Using metrics not aligned with business goals