Features
Evaluate your AI with evidence. Preset and custom rubrics, LLM-as-a-Judge, HITL, and insights — all in one place.
What you get
Datasets & Inputs
Upload CSV with column mapping or use sample datasets to get started. Dataset size scales with your plan (1 GB to unlimited).
Evaluation Rubrics
Five preset packs: General Quality, RAG/Retrieval, Safety & Compliance, Conversational, Content Generation — on all plans including Try Free. Define custom rubrics on Starter+.
LLM-as-a-Judge
A judge LLM evaluates your outputs against your rubrics and returns scores, pass/fail, reasoning, and evidence for every row.
Statistical Metrics
ROUGE, METEOR, BERTScore, BLEU, F1 Score, Exact Match, and Toxicity detection. No LLM API key required.
Human-in-the-Loop (HITL)
Growth+Add human annotators and reviewers to the loop for nuanced assessments that automated metrics can miss.
Insights & Analysis
Evaluation analytics, metric breakdowns, and pass/fail distributions on all plans. Run comparison and quality tracking over time coming soon.
Teams & Collaboration
Invite teammates and share rubrics and runs. 1 seat (Free), 2 (Starter), 10 (Growth), unlimited (Enterprise).
Security & Enterprise
Cloud-hosted. Enterprise can control data management and run deployments within their own infrastructure.