Features

Evaluate your AI with evidence. Preset and custom rubrics, LLM-as-a-Judge, HITL, and insights — all in one place.

What you get

Datasets & Inputs

Upload CSV with column mapping or use sample datasets to get started. Dataset size scales with your plan (1 GB to unlimited).

Evaluation Rubrics

Five preset packs: General Quality, RAG/Retrieval, Safety & Compliance, Conversational, Content Generation — on all plans including Try Free. Define custom rubrics on Starter+.

LLM-as-a-Judge

A judge LLM evaluates your outputs against your rubrics and returns scores, pass/fail, reasoning, and evidence for every row.

Statistical Metrics

ROUGE, METEOR, BERTScore, BLEU, F1 Score, Exact Match, and Toxicity detection. No LLM API key required.

Human-in-the-Loop (HITL)

Growth+

Add human annotators and reviewers to the loop for nuanced assessments that automated metrics can miss.

Insights & Analysis

Evaluation analytics, metric breakdowns, and pass/fail distributions on all plans. Run comparison and quality tracking over time coming soon.

Teams & Collaboration

Invite teammates and share rubrics and runs. 1 seat (Free), 2 (Starter), 10 (Growth), unlimited (Enterprise).

Security & Enterprise

Cloud-hosted. Enterprise can control data management and run deployments within their own infrastructure.

Judge personas & presets — coming soon (Growth+)Supports OpenAI, Anthropic, Google Gemini, Groq

Built for cross-functional teams

Product, engineering, QA, and research align on the same rubrics, runs, and scorecards.

Get started

Run an evaluation in your browser — no sign-up required. Your data stays local, API calls go to your provider, and nothing is stored by TryEval.