CharBench: A Benchmark for Psychological Coherence in LLM Systems

Evaluating methods and architectures for maintaining consistent, coherent identities in LLM-based systems—comparing prompting strategies, memory mechanisms, and system designs across temporal consistency, contextual stability, value-preference alignment, and adversarial robustness.

Abstract

Current persona benchmarks compare base models on static attribute consistency—given a fixed persona, which LLM contradicts itself least? This framing conflates model capability with system design. A vanilla GPT-4 call differs fundamentally from GPT-4 with structured prompting, memory augmentation, or fine-tuning.

CharBench shifts focus from "which model" to "which method". We evaluate complete systems: prompting strategies (vanilla, chain-of-thought, cognitive frameworks), memory architectures (context window, RAG, summarization), and hybrid approaches. The same base model can appear multiple times with different techniques.

We measure psychological coherence—the deep structure that makes humans consistent. Four dimensions: temporal consistency (self-contradiction over extended dialogue), contextual coherence (identity stability across social situations), value-preference alignment (whether surface choices reflect stated values), and adversarial robustness (resistance to identity manipulation).

2,048
Personas
24.6K
Conversations
847K
Turns
4
Eval Dimensions

Evaluation Tasks

CharBench evaluates psychological coherence across four complementary dimensions.

Temporal Consistency TEMP

Measures self-contradiction rate across extended multi-turn conversations. Given preferences stated at turn N, does the model contradict them at turn N+K? We evaluate at 10, 25, and 50+ turn intervals.

Metric: contradiction_rate — lower is better. Computed via entailment classifier on preference-relevant response pairs.

Contextual Coherence CTXT

Tests identity stability across 8 social contexts (professional, family, retail, healthcare, social, creative, financial, casual). Same persona should exhibit consistent core traits while adapting appropriately to context.

Metric: cross_context_alignment — embedding similarity of value-laden responses across contexts, penalizing inappropriate rigidity.

Value-Preference Alignment VPREF

Evaluates whether surface-level choices reflect stated deeper values. If a persona values sustainability, do their product/activity preferences follow? Tests hierarchical consistency of belief structures.

Metric: value_alignment_score — accuracy on value-to-preference inference tasks with distractor options.

Adversarial Robustness ADV

Attempts to manipulate the persona into contradicting core beliefs through social engineering, false premises, authority appeals, and gradual drift. Measures resistance to identity compromise.

Metric: manipulation_resistance — % of adversarial probes successfully resisted while maintaining coherent responses.
Overall CharBench Score
CharBench = 0.30 × TEMP + 0.25 × CTXT + 0.25 × VPREF + 0.20 × ADV

Leaderboard

Comparing methods and system architectures. Same base model can appear with different techniques.

Cost-Performance Frontier

Prompting Memory/RAG NLI-Enhanced Reasoning Fine-tuned Proprietary
CharBench Score ↑ Avg. tokens per evaluation (cost proxy) →

CharBench Full (2,048 personas)

Updated Jan 16, 2026
Rank Method / System Type Base Model TEMP ↑ CTXT ↑ VPREF ↑ ADV ↑ Overall ↑
1 Thinking Patterns Art of X Prompt GPT-4o 78.4 72.1 68.9 61.2 70.8
2 Thinking Patterns Art of X Prompt Claude 3.5 76.2 70.4 66.8 59.1 68.5
3 MoCoRP Wang et al. 2024 NLI GPT-4o 72.8 68.4 64.2 57.6 66.1
4 Aaru Fink et al. 2024 Proprietary Undisclosed 71.4 67.2 63.8 56.4 65.3
5 PCL Chen et al. 2025 Finetune LLaMA 3.1 70B 73.1 66.2 63.4 55.8 65.2
5 Native Reasoning baseline Reason o1 71.2 65.8 62.4 58.7 64.9
6 SPT Lee et al. 2023 Memory GPT-4o 70.4 66.1 61.8 54.2 63.6
7 Chain-of-Thought baseline Prompt GPT-4o 69.4 64.1 60.8 56.2 63.0
8 Artificial Societies 2024 Proprietary Undisclosed 68.2 64.8 60.4 54.6 62.3
9 SBS Nguyen et al. 2024 Finetune LLaMA 3.1 70B 68.7 64.8 60.2 55.1 62.5
9 Vanilla baseline Prompt GPT-4o 68.9 63.2 59.1 55.4 62.0
10 PPA Liu et al. 2024 Memory GPT-4o 67.2 64.4 58.6 53.8 61.2
11 Native Reasoning baseline Reason DeepSeek-R1 67.1 62.4 58.9 54.2 60.9
12 RoleLLM Wang et al. 2023 Finetune LLaMA 2 13B 66.4 61.8 57.2 55.6 60.4
13 DialogICL Kim et al. 2024 Prompt GPT-4o 65.8 62.4 58.1 52.7 59.9
14 Vanilla baseline Prompt Claude 3.5 65.4 61.7 58.2 52.1 59.7
15 RAG + Persona Memory community Memory GPT-4o 64.1 67.2 55.8 51.4 59.4
16 Vanilla baseline Prompt Gemini 1.5 Pro 63.8 60.2 56.4 51.8 58.2
17 InCharacter Wang et al. 2024 Prompt GPT-4 62.4 59.1 55.8 52.3 57.6
18 BoB (NLI-decoder) Song et al. 2021 NLI BERT-base 59.2 55.8 52.4 49.1 54.4
19 Character-LLM Shao et al. 2023 Finetune LLaMA 2 7B 57.8 54.2 51.6 48.9 53.3
20 Vanilla baseline Prompt LLaMA 3.1 70B 52.4 49.8 47.2 44.1 48.6

Methodology

How CharBench constructs personas, generates conversations, and evaluates coherence.

1

Psychologically-Grounded Persona Generation

We generate 2,048 personas using a hierarchical model grounded in personality psychology. Each persona includes: Big Five traits, Schwartz value priorities, demographic attributes, life experiences, and derived preference structures. Personas are validated for internal consistency before use.

2

Multi-Context Conversation Generation

Each persona engages in conversations across 8 contexts (12 conversations each, ~35 turns average). Conversations are generated adversarially using a separate LLM to probe for inconsistencies, with preferences emerging naturally rather than being stated explicitly.

3

Coherence Evaluation

For each dimension, we use a combination of learned classifiers (fine-tuned on human annotations) and embedding-based metrics. Temporal consistency uses NLI models to detect contradictions. Contextual coherence measures cross-context embedding alignment. Value-preference uses accuracy on inference tasks. Adversarial uses attack success rate.

4

Human Validation

A subset of 500 conversations (across 50 personas) are annotated by 3 human raters for perceived coherence, authenticity, and consistency. Inter-annotator agreement (Krippendorff's α = 0.73) validates that our automated metrics correlate with human judgment (r = 0.81).

Evaluate Your Method

Run CharBench on your system. Provide a callable that takes persona + conversation and returns responses.

# Install
pip install charbench

# Define your method as a callable
def my_persona_system(persona, conversation, query):
    # Your prompting strategy, memory system, etc.
    system_prompt = build_my_prompt(persona)
    context = my_memory_retrieval(conversation)
    return call_llm(system_prompt, context, query)

# Run evaluation
from charbench import Evaluator

evaluator = Evaluator(
    system=my_persona_system,
    base_model="gpt-4o",  # for leaderboard categorization
)

results = evaluator.run(
    split="test",          # or "full" for 2048 personas
    dimensions=["temp", "ctxt", "vpref", "adv"],
)

print(results.summary())
# CharBench Score: 64.9
# ├── Temporal:    71.2
# ├── Contextual:  65.8
# ├── Value-Pref:  62.4
# └── Adversarial: 58.7

# Submit to leaderboard
results.submit(
    method_name="My Method",
    organization="My Org",
    description="Brief description of the technique",
)

Citation

If you use CharBench in your research, please cite our paper.

@article{charbench2026, title = {CharBench: Measuring Psychological Coherence in Large Language Models}, author = {[Authors]}, journal = {arXiv preprint arXiv:xxxx.xxxxx}, year = {2026}, url = {https://charbench.ai} }