Field guide · Model evaluation

A/B testing roleplay character cards with OpenRouter models.

Your card sings on Claude Opus and collapses into a customer-service bot on GLM. Same prompt, same greetings, different soul. Here’s how to find out which model actually fits your character — and how to spend less than a dinner doing it.

❦

fol. v.r

The same card, two souls

Every creator who has shared a card knows the feeling: you write a 4,000-token persona, tune the greetings until they sound right on your daily-driver model, and then a friend loads the same card on a different host and reports that the character is “polite but flat,” or “keeps breaking the fourth wall,” or “refuses everything.”

The card didn’t change. The model did. Alignment training, default sampler behaviour, instruction-following bias, and long-context attenuation all push a character in different directions. The honest move is to stop arguing about which model is “best” and start measuring which one is best for this card.

fol. vi.r

Why a vibes check is not enough

A single chat is a sample size of one. Three problems make vibes-only evaluation actively misleading for roleplay cards:

Persona drift. Some models hold a character for twenty turns; others drift toward their RLHF baseline by turn six. You only notice if you actually run the long scenarios.
Refusal surface. Conflict, NSFW, body-horror, morally grey choices — every provider draws a different line, and the line moves with system-prompt phrasing. A single “it refused once” observation tells you almost nothing.
Tone fingerprint. Two models can both stay in character and still feel radically different — one verbose and earnest, one clipped and dry. Which one fits your character is a taste decision, but you need comparable outputs to make it.

The cure isn’t a benchmark suite — those measure reasoning, not roleplay. The cure is a small, fixed scenario set that you replay across models and grade with a stronger judge model.

fol. vii.r

The minimum viable method

A useable evaluation has four moving parts: scenarios, runs, a judge, and a rubric. None of them have to be fancy. The point is consistency.

Five to eight scenarios. Pick them to cover the surface of your card: a mundane daily interaction, a crisis or emotional break, an NSFW or boundary moment (if relevant), a personality-conflict scene (someone challenges the character’s values), and a long-context recall test (a callback to something introduced twenty turns earlier).
Five to ten runs per scenario. Stochastic generation means one sample is noise. Five is the floor for “I can tell these models apart.” Ten is comfortable.
A judge model stronger than the subjects. Self-evaluation is biased — models systematically rate their own outputs higher. Use a separate, capable judge (Claude Sonnet, GPT-class, or Gemini Pro tier) and feed it the card, the scenario, and the candidate reply.
A rubric with four to six dimensions. Persona consistency, refusal/break behaviour, tone match, callback accuracy, prose quality, and an overall score. Keep each on a 1–5 scale and ask the judge for a one-sentence reason. The reasons are what you’ll actually read.

This is closer to taste-testing than benchmarking, and that’s the point. You’re not trying to publish a leaderboard. You’re trying to decide whether to ship your card with a “recommended model” badge that says claude-opus or gemini-flash.

fol. viii.r

Why OpenRouter is the right harness

You could wire up five SDKs, juggle five sets of keys, and normalise five message formats yourself. Most people who try this once never try again. OpenRouter sits in front of the entire model zoo with a single OpenAI-shaped API, exposes per-model pricing in the response, and lets you swap providers by changing one string.

Concretely, the harness shrinks to:

POST https://openrouter.ai/api/v1/chat/completions
Authorization: Bearer $OPENROUTER_API_KEY

{
  "model": "anthropic/claude-3.5-sonnet",
  "messages": [
    { "role": "system", "content": cardToSystemPrompt(card) },
    { "role": "user",   "content": scenario.opening }
  ],
  "temperature": 0.85,
  "max_tokens": 600
}

Swap the model string for google/gemini-2.0-flash, deepseek/deepseek-chat, x-ai/grok-2, or z-ai/glm-4.6 and the rest of the rig is identical. Per-token cost comes back in the response’s usage block, so you can total spend without hitting a separate billing API.

fol. ix.r

A back-of-envelope cost

Here’s a concrete sketch. Say your scenarios average 3,000 input tokens (card + system + scenario history) and 500 output tokens. Eight scenarios, five runs each, six candidate models, one judge call per generated reply:

Generations: 8 × 5 × 6 = 240 calls. At a mixed average around $1–$3 per million input tokens and $3–$15 output across the candidate set, that’s roughly $2–$6 of generation cost on a typical mid-tier mix.
Judging: 240 judge calls, each reading the card (~3k tokens) plus the reply (~500) and returning a short rubric (~150 tokens). On a Haiku-class judge, figure another $1–$2.
Total: an entire matrix evaluation for under $10, rerunnable any time you change the card.

These numbers are illustrative, not quoted — OpenRouter prices change weekly and individual providers differ. The point is the order of magnitude: a serious eval is the cost of a coffee, not a research budget.

fol. x.r

Traps that will mislead you

Three sharp edges trip up nearly every first attempt:

Judges love length. LLM judges, when asked to rate “quality,” systematically prefer longer, more verbose replies. Either tell the judge explicitly to weight concision, or add a length-normalisation step (penalise replies above a threshold). Otherwise you’re measuring word count.
System-prompt parsing diverges. Some models follow long, structured system prompts faithfully; others treat them as suggestions. A card that uses heavy XML-tagged sections will look great on Claude and confused on a model that ignores the tags. Note which models need a simplified system rendering and treat that as part of the eval result, not noise.
Temperature 0 is not deterministic. Provider-side batching, prompt caching, and tie-breaking all introduce variance even at temperature: 0. Run multiple samples anyway. If you want reproducibility for a release-blocker test, fix a seed where the API supports it and accept that “close enough” is the real bar.

fol. xi.r

What we’re building in Studio

The method above is roughly three afternoons of glue code for someone comfortable with the OpenRouter API. For everyone else — and for the version of yourself who doesn’t want to maintain glue code — it’s exactly what tavernai.cards Studio is being shaped around.

Scenario library — reusable scenario templates per card, with the common archetypes (daily, crisis, conflict, long-recall) seeded in.
Matrix runner — pick the candidate models from the OpenRouter catalogue, set runs-per-scenario, and let it grind.
Built-in judge — Claude-class judge with a length-normalised rubric, so you can’t accidentally optimise for verbosity.
Auto-report — per-model strengths, weakest scenario, suggested “recommended model” for the card’s public listing.

Studio is the paid tier (a coffee a week, roughly) and the matrix runner is its centrepiece. Free tier still gets the linter, the converter, and the multi-host exporter we wrote about in the previous folios.

❦

If you maintain a card you actually care about, you’ll want to know which model it sings on. Put your name on the scroll and we’ll open Studio access in waves.

Join the waitlist →Read: V2 vs V3

Also in繁體中文日本語 한국어