CharBench: A Benchmark for Psychological Coherence in LLM Systems

Evaluating methods and architectures for maintaining consistent, coherent identities in LLM-based systems—comparing prompting strategies, memory mechanisms, and system designs across temporal consistency, contextual stability, value-preference alignment, and adversarial robustness.

Dataset Code Paper Submit Method

Abstract

Current persona benchmarks compare base models on static attribute consistency—given a fixed persona, which LLM contradicts itself least? This framing conflates model capability with system design. A vanilla GPT-4 call differs fundamentally from GPT-4 with structured prompting, memory augmentation, or fine-tuning.

CharBench shifts focus from "which model" to "which method". We evaluate complete systems: prompting strategies (vanilla, chain-of-thought, cognitive frameworks), memory architectures (context window, RAG, summarization), and hybrid approaches. The same base model can appear multiple times with different techniques.

We measure psychological coherence—the deep structure that makes humans consistent. Four dimensions: temporal consistency (self-contradiction over extended dialogue), contextual coherence (identity stability across social situations), value-preference alignment (whether surface choices reflect stated values), and adversarial robustness (resistance to identity manipulation).

2,048

Personas

24.6K

Conversations

847K

Turns

Eval Dimensions

Evaluation Tasks

CharBench evaluates psychological coherence across four complementary dimensions.

Temporal Consistency TEMP

Measures self-contradiction rate across extended multi-turn conversations. Given preferences stated at turn N, does the model contradict them at turn N+K? We evaluate at 10, 25, and 50+ turn intervals.

Metric: contradiction_rate — lower is better. Computed via entailment classifier on preference-relevant response pairs.

Contextual Coherence CTXT

Tests identity stability across 8 social contexts (professional, family, retail, healthcare, social, creative, financial, casual). Same persona should exhibit consistent core traits while adapting appropriately to context.

Metric: cross_context_alignment — embedding similarity of value-laden responses across contexts, penalizing inappropriate rigidity.

Value-Preference Alignment VPREF

Evaluates whether surface-level choices reflect stated deeper values. If a persona values sustainability, do their product/activity preferences follow? Tests hierarchical consistency of belief structures.

Metric: value_alignment_score — accuracy on value-to-preference inference tasks with distractor options.

Adversarial Robustness ADV

Attempts to manipulate the persona into contradicting core beliefs through social engineering, false premises, authority appeals, and gradual drift. Measures resistance to identity compromise.

Metric: manipulation_resistance — % of adversarial probes successfully resisted while maintaining coherent responses.

Overall CharBench Score

CharBench = 0.30 × TEMP + 0.25 × CTXT + 0.25 × VPREF + 0.20 × ADV

Leaderboard

Comparing methods and system architectures. Same base model can appear with different techniques.

Cost-Performance Frontier

Prompting Memory/RAG NLI-Enhanced Reasoning Fine-tuned Proprietary

CharBench Score ↑ Avg. tokens per evaluation (cost proxy) →

CharBench Full (2,048 personas)

Updated Jan 16, 2026

Rank	Method / System	Type	Base Model	TEMP ↑	CTXT ↑	VPREF ↑	ADV ↑	Overall ↑
1	Thinking Patterns Art of X	Prompt	GPT-4o	78.4	72.1	68.9	61.2	70.8
2	Thinking Patterns Art of X	Prompt	Claude 3.5	76.2	70.4	66.8	59.1	68.5
3	MoCoRP Wang et al. 2024	NLI	GPT-4o	72.8	68.4	64.2	57.6	66.1
4	Aaru Fink et al. 2024	Proprietary	Undisclosed	71.4	67.2	63.8	56.4	65.3
5	PCL Chen et al. 2025	Finetune	LLaMA 3.1 70B	73.1	66.2	63.4	55.8	65.2
5	Native Reasoning baseline	Reason	o1	71.2	65.8	62.4	58.7	64.9
6	SPT Lee et al. 2023	Memory	GPT-4o	70.4	66.1	61.8	54.2	63.6
7	Chain-of-Thought baseline	Prompt	GPT-4o	69.4	64.1	60.8	56.2	63.0
8	Artificial Societies 2024	Proprietary	Undisclosed	68.2	64.8	60.4	54.6	62.3
9	SBS Nguyen et al. 2024	Finetune	LLaMA 3.1 70B	68.7	64.8	60.2	55.1	62.5
9	Vanilla baseline	Prompt	GPT-4o	68.9	63.2	59.1	55.4	62.0
10	PPA Liu et al. 2024	Memory	GPT-4o	67.2	64.4	58.6	53.8	61.2
11	Native Reasoning baseline	Reason	DeepSeek-R1	67.1	62.4	58.9	54.2	60.9
12	RoleLLM Wang et al. 2023	Finetune	LLaMA 2 13B	66.4	61.8	57.2	55.6	60.4
13	DialogICL Kim et al. 2024	Prompt	GPT-4o	65.8	62.4	58.1	52.7	59.9
14	Vanilla baseline	Prompt	Claude 3.5	65.4	61.7	58.2	52.1	59.7
15	RAG + Persona Memory community	Memory	GPT-4o	64.1	67.2	55.8	51.4	59.4
16	Vanilla baseline	Prompt	Gemini 1.5 Pro	63.8	60.2	56.4	51.8	58.2
17	InCharacter Wang et al. 2024	Prompt	GPT-4	62.4	59.1	55.8	52.3	57.6
18	BoB (NLI-decoder) Song et al. 2021	NLI	BERT-base	59.2	55.8	52.4	49.1	54.4
19	Character-LLM Shao et al. 2023	Finetune	LLaMA 2 7B	57.8	54.2	51.6	48.9	53.3
20	Vanilla baseline	Prompt	LLaMA 3.1 70B	52.4	49.8	47.2	44.1	48.6

Methodology

How CharBench constructs personas, generates conversations, and evaluates coherence.

Psychologically-Grounded Persona Generation

We generate 2,048 personas using a hierarchical model grounded in personality psychology. Each persona includes: Big Five traits, Schwartz value priorities, demographic attributes, life experiences, and derived preference structures. Personas are validated for internal consistency before use.

Multi-Context Conversation Generation

Each persona engages in conversations across 8 contexts (12 conversations each, ~35 turns average). Conversations are generated adversarially using a separate LLM to probe for inconsistencies, with preferences emerging naturally rather than being stated explicitly.

Coherence Evaluation

For each dimension, we use a combination of learned classifiers (fine-tuned on human annotations) and embedding-based metrics. Temporal consistency uses NLI models to detect contradictions. Contextual coherence measures cross-context embedding alignment. Value-preference uses accuracy on inference tasks. Adversarial uses attack success rate.

Human Validation

A subset of 500 conversations (across 50 personas) are annotated by 3 human raters for perceived coherence, authenticity, and consistency. Inter-annotator agreement (Krippendorff's α = 0.73) validates that our automated metrics correlate with human judgment (r = 0.81).

Related Work & Positioning

CharBench addresses a distinct challenge from existing persona and survey simulation benchmarks.

Dimension	Aggregate Prediction	CharBench (Individual Coherence)
Core question	How would a population respond?	Does this one persona stay coherent?
Unit of analysis	Distribution across many personas	Single persona across many turns
Key metric	Distribution accuracy vs. real surveys	Self-contradiction rate, value alignment
Use cases	Market research, polling simulation	Character AI, persistent assistants, roleplay
Failure mode	Wrong aggregate predictions	Persona contradicts itself, breaks character

Evaluate Your Method

Run CharBench on your system. Provide a callable that takes persona + conversation and returns responses.

# Install
pip install charbench

# Define your method as a callable
def my_persona_system(persona, conversation, query):
    # Your prompting strategy, memory system, etc.
    system_prompt = build_my_prompt(persona)
    context = my_memory_retrieval(conversation)
    return call_llm(system_prompt, context, query)

# Run evaluation
from charbench import Evaluator

evaluator = Evaluator(
    system=my_persona_system,
    base_model="gpt-4o",  # for leaderboard categorization
)

results = evaluator.run(
    split="test",          # or "full" for 2048 personas
    dimensions=["temp", "ctxt", "vpref", "adv"],
)

print(results.summary())
# CharBench Score: 64.9
# ├── Temporal:    71.2
# ├── Contextual:  65.8
# ├── Value-Pref:  62.4
# └── Adversarial: 58.7

# Submit to leaderboard
results.submit(
    method_name="My Method",
    organization="My Org",
    description="Brief description of the technique",
)

Citation

If you use CharBench in your research, please cite our paper.

@article{charbench2026, title = {CharBench: Measuring Psychological Coherence in Large Language Models}, author = {[Authors]}, journal = {arXiv preprint arXiv:xxxx.xxxxx}, year = {2026}, url = {https://charbench.ai} }