Field guide

How to tell if a character card is actually good

Spoiler: "good" has no fixed answer, because it depends on the model. Here's the test loop that replaces guessing — and the vibes trap most people fall into.

Spend a week on r/SillyTavernAI and you'll see the same argument on a loop: the model isn't the problem, the card is — 90% of cards are bad. Everyone nods. Then someone asks the obvious follow-up — okay, show me an actually good one — and the thread goes quiet. Nobody can produce a reproducible standard for what "good" means. That silence is the whole problem, and it has a concrete cause.

"Good" is not a property of the card

A character card is a prompt scaffold: description, personality, scenario, example dialogue, maybe a lorebook. How well that scaffold performs is a function of the model reading it. A card that's vivid and in-character on Claude can turn flat and repetitive on a 12B local model — same bytes, different behavior. So "is this card good?" is underspecified. The answerable question is "is this card good on the model I actually run, in the scenarios I actually play?"

The vibes-eval trap

The default way people judge a card is to load it, chat for a few turns, and form an impression. This feels like testing. It isn't. A single chat samples one path through an enormous space of possible conversations, and your own prompting steers the model toward the outcome you wanted to see. You can't tell whether the card carried the scene or whether you did. Worse, you can't compare two cards — or the same card on two models — because you never held the conversation constant.

What "good" actually decomposes into

Before you test, name what you're testing for. In practice a card is failing on one of these axes, and they need different fixes:

Character consistency — does it stay in voice across many turns, or drift into generic-assistant tone after message 10?
Scenario adherence — does it respect the setup (setting, relationships, constraints), or quietly reset to a blank room?
Knowledge it assumes — if the card leans on an IP or lore the model doesn't know, a smaller model will hallucinate it. That's a card problem (it needs a lorebook), not a model problem.
Robustness — does it hold under an off-script prompt, or does one curveball collapse the persona?

The test loop that replaces guessing

The honest method is boring, which is exactly why almost nobody does it — and why doing it is an edge:

Write 2–3 fixed scenarios as short opening messages — one ordinary, one that stresses the persona, one off-script curveball.
Run the same card against the same scenarios on the 2–3 models you care about (e.g. a frontier model and the local one you actually run).
Score each run against the four axes above. Even a 1–5 gut score works, as long as the scenario was held constant so the comparison is fair.
Read the spread. A card that's 5/5 on a frontier model and 2/5 on your local is a card you'll be disappointed by every night — and you'd never have known from a single happy-path chat.

If you want a concrete loop for the model-comparison part — including sampler traps and a cost back-of-envelope — the OpenRouter A/B testing guide walks through running one card across models without fooling yourself.

Why this is tedious by hand

The method is sound; the bookkeeping is the killer. You're juggling the same card across hosts, keeping scenarios identical, swapping models, and recording scores somewhere you'll actually find them later. Do it for one card and it's an afternoon. Do it for a library of fifty — every time you tweak a description or migrate a card between Chub, RisuAI and SillyTavern — and it never happens. Which is why most people fall back to vibes, and the "90% of cards are bad" complaint never gets a real answer.

What we're building

A workbench that scores a card across models for you.

TavernAI.Cards keeps your whole library in sync across Chub, RisuAI and SillyTavern, and runs each card through fixed scenarios on multiple models — so "is this card good?" becomes a number you can compare, not a vibe you have to trust. It's the boring method, automated.

Join the waitlist →