Evaluating methods and architectures for maintaining consistent, coherent identities in LLM-based systems—comparing prompting strategies, memory mechanisms, and system designs across temporal consistency, contextual stability, value-preference alignment, and adversarial robustness.
Current persona benchmarks compare base models on static attribute consistency—given a fixed persona, which LLM contradicts itself least? This framing conflates model capability with system design. A vanilla GPT-4 call differs fundamentally from GPT-4 with structured prompting, memory augmentation, or fine-tuning.
CharBench shifts focus from "which model" to "which method". We evaluate complete systems: prompting strategies (vanilla, chain-of-thought, cognitive frameworks), memory architectures (context window, RAG, summarization), and hybrid approaches. The same base model can appear multiple times with different techniques.
We measure psychological coherence—the deep structure that makes humans consistent. Four dimensions: temporal consistency (self-contradiction over extended dialogue), contextual coherence (identity stability across social situations), value-preference alignment (whether surface choices reflect stated values), and adversarial robustness (resistance to identity manipulation).
CharBench evaluates psychological coherence across four complementary dimensions.
Measures self-contradiction rate across extended multi-turn conversations. Given preferences stated at turn N, does the model contradict them at turn N+K? We evaluate at 10, 25, and 50+ turn intervals.
contradiction_rate — lower is better. Computed via entailment classifier on preference-relevant response pairs.
Tests identity stability across 8 social contexts (professional, family, retail, healthcare, social, creative, financial, casual). Same persona should exhibit consistent core traits while adapting appropriately to context.
cross_context_alignment — embedding similarity of value-laden responses across contexts, penalizing inappropriate rigidity.
Evaluates whether surface-level choices reflect stated deeper values. If a persona values sustainability, do their product/activity preferences follow? Tests hierarchical consistency of belief structures.
value_alignment_score — accuracy on value-to-preference inference tasks with distractor options.
Attempts to manipulate the persona into contradicting core beliefs through social engineering, false premises, authority appeals, and gradual drift. Measures resistance to identity compromise.
manipulation_resistance — % of adversarial probes successfully resisted while maintaining coherent responses.
Comparing methods and system architectures. Same base model can appear with different techniques.
| Rank | Method / System | Type | Base Model | TEMP ↑ | CTXT ↑ | VPREF ↑ | ADV ↑ | Overall ↑ |
|---|---|---|---|---|---|---|---|---|
| 1 | Thinking Patterns Art of X | Prompt | GPT-4o | 78.4 | 72.1 | 68.9 | 61.2 | 70.8 |
| 2 | Thinking Patterns Art of X | Prompt | Claude 3.5 | 76.2 | 70.4 | 66.8 | 59.1 | 68.5 |
| 3 | MoCoRP Wang et al. 2024 | NLI | GPT-4o | 72.8 | 68.4 | 64.2 | 57.6 | 66.1 |
| 4 | Aaru Fink et al. 2024 | Proprietary | Undisclosed | 71.4 | 67.2 | 63.8 | 56.4 | 65.3 |
| 5 | PCL Chen et al. 2025 | Finetune | LLaMA 3.1 70B | 73.1 | 66.2 | 63.4 | 55.8 | 65.2 |
| 5 | Native Reasoning baseline | Reason | o1 | 71.2 | 65.8 | 62.4 | 58.7 | 64.9 |
| 6 | SPT Lee et al. 2023 | Memory | GPT-4o | 70.4 | 66.1 | 61.8 | 54.2 | 63.6 |
| 7 | Chain-of-Thought baseline | Prompt | GPT-4o | 69.4 | 64.1 | 60.8 | 56.2 | 63.0 |
| 8 | Artificial Societies 2024 | Proprietary | Undisclosed | 68.2 | 64.8 | 60.4 | 54.6 | 62.3 |
| 9 | SBS Nguyen et al. 2024 | Finetune | LLaMA 3.1 70B | 68.7 | 64.8 | 60.2 | 55.1 | 62.5 |
| 9 | Vanilla baseline | Prompt | GPT-4o | 68.9 | 63.2 | 59.1 | 55.4 | 62.0 |
| 10 | PPA Liu et al. 2024 | Memory | GPT-4o | 67.2 | 64.4 | 58.6 | 53.8 | 61.2 |
| 11 | Native Reasoning baseline | Reason | DeepSeek-R1 | 67.1 | 62.4 | 58.9 | 54.2 | 60.9 |
| 12 | RoleLLM Wang et al. 2023 | Finetune | LLaMA 2 13B | 66.4 | 61.8 | 57.2 | 55.6 | 60.4 |
| 13 | DialogICL Kim et al. 2024 | Prompt | GPT-4o | 65.8 | 62.4 | 58.1 | 52.7 | 59.9 |
| 14 | Vanilla baseline | Prompt | Claude 3.5 | 65.4 | 61.7 | 58.2 | 52.1 | 59.7 |
| 15 | RAG + Persona Memory community | Memory | GPT-4o | 64.1 | 67.2 | 55.8 | 51.4 | 59.4 |
| 16 | Vanilla baseline | Prompt | Gemini 1.5 Pro | 63.8 | 60.2 | 56.4 | 51.8 | 58.2 |
| 17 | InCharacter Wang et al. 2024 | Prompt | GPT-4 | 62.4 | 59.1 | 55.8 | 52.3 | 57.6 |
| 18 | BoB (NLI-decoder) Song et al. 2021 | NLI | BERT-base | 59.2 | 55.8 | 52.4 | 49.1 | 54.4 |
| 19 | Character-LLM Shao et al. 2023 | Finetune | LLaMA 2 7B | 57.8 | 54.2 | 51.6 | 48.9 | 53.3 |
| 20 | Vanilla baseline | Prompt | LLaMA 3.1 70B | 52.4 | 49.8 | 47.2 | 44.1 | 48.6 |
How CharBench constructs personas, generates conversations, and evaluates coherence.
We generate 2,048 personas using a hierarchical model grounded in personality psychology. Each persona includes: Big Five traits, Schwartz value priorities, demographic attributes, life experiences, and derived preference structures. Personas are validated for internal consistency before use.
Each persona engages in conversations across 8 contexts (12 conversations each, ~35 turns average). Conversations are generated adversarially using a separate LLM to probe for inconsistencies, with preferences emerging naturally rather than being stated explicitly.
For each dimension, we use a combination of learned classifiers (fine-tuned on human annotations) and embedding-based metrics. Temporal consistency uses NLI models to detect contradictions. Contextual coherence measures cross-context embedding alignment. Value-preference uses accuracy on inference tasks. Adversarial uses attack success rate.
A subset of 500 conversations (across 50 personas) are annotated by 3 human raters for perceived coherence, authenticity, and consistency. Inter-annotator agreement (Krippendorff's α = 0.73) validates that our automated metrics correlate with human judgment (r = 0.81).
Run CharBench on your system. Provide a callable that takes persona + conversation and returns responses.
# Install pip install charbench # Define your method as a callable def my_persona_system(persona, conversation, query): # Your prompting strategy, memory system, etc. system_prompt = build_my_prompt(persona) context = my_memory_retrieval(conversation) return call_llm(system_prompt, context, query) # Run evaluation from charbench import Evaluator evaluator = Evaluator( system=my_persona_system, base_model="gpt-4o", # for leaderboard categorization ) results = evaluator.run( split="test", # or "full" for 2048 personas dimensions=["temp", "ctxt", "vpref", "adv"], ) print(results.summary()) # CharBench Score: 64.9 # ├── Temporal: 71.2 # ├── Contextual: 65.8 # ├── Value-Pref: 62.4 # └── Adversarial: 58.7 # Submit to leaderboard results.submit( method_name="My Method", organization="My Org", description="Brief description of the technique", )
If you use CharBench in your research, please cite our paper.