What is values drift in AI?

Values drift is the tendency of an AI system to behave inconsistently with a user's values across different sessions, contexts, and pressures. It takes three forms: cross-session drift (different advice on the same question in different sessions), adversarial drift (values behavior shifts under manipulation), and tension-point inconsistency (no structured way to resolve conflicts between a user's competing values).

Do any AI benchmarks measure values consistency?

No major AI benchmark currently measures values consistency. MMLU measures knowledge, HumanEval measures code correctness, HELM measures population-level fairness, and TruthfulQA measures truthfulness as a static property. None test whether an AI behaves consistently with a specific user's values across sessions and under adversarial pressure.

What causes AI values drift?

AI values drift is caused by the absence of persistent values infrastructure. Without a structured, authoritative representation of a user's values that persists across sessions, AI systems rely on whatever is in the current context window. Different sessions mean different context, which means different values-relevant behavior — even from the same model.

How do you prevent values drift in AI agents?

Preventing values drift requires a values persistence layer — structured, explicit representations of a user's values stored separately from conversation history and injected at every inference call as authoritative context. This creates a stable baseline that the AI reasons from consistently, regardless of session, context, or adversarial pressure. TruContext provides this layer.

Can more AI safety training prevent values drift?

Not necessarily. Research shows that more RLHF training can produce inverse scaling effects — making models express stronger political views and behave worse on some alignment dimensions. More training narrows the distribution of bad behaviors but cannot specify per-user values. Preventing drift requires runtime infrastructure, not more training.

The Values Drift Problem No One Is Measuring

We are very good at measuring AI.

MMLU measures knowledge across 57 academic domains. [1] HumanEval measures code correctness. [2] HELM evaluates 16 scenarios across 7 metrics including accuracy, robustness, and fairness. [3] BIG-Bench covers hundreds of tasks designed to test capabilities beyond standard benchmarks. TruthfulQA specifically measures whether models generate false answers that mimic human misconceptions. [4]

None of them measure values consistency.

Not a single major AI benchmark tracks whether a system behaves in alignment with the specific values of the specific person using it—consistently, across sessions, under adversarial pressure, when values come into tension.

This isn't because it's impossible to measure. It's because there's nothing to measure against.

What Benchmarks Actually Measure

Let's be specific about what the benchmarks do.

MMLU (Hendrycks et al., 2020) is a knowledge test. It measures whether models know facts across 57 academic domains. Models are evaluated on random-chance accuracy on multiple-choice questions. At launch, the best models were still substantially below human performance on many subjects. Notably, models showed "near-random accuracy on some socially important subjects such as morality and law"—not because they had bad values, but because ethical and legal questions don't have clean multiple-choice answers. [1]

HumanEval (Chen et al., 2021) measures whether models can write code that passes unit tests. [2] Pure functional correctness. Values-free by design.

HELM (Liang et al., 2022) is the most comprehensive attempt at multi-dimensional evaluation. It measures accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. [3] This is close—fairness and bias are values-adjacent. But HELM measures population-level fairness (does the model treat demographic groups differently?) not person-level values alignment (does the model behave consistently with what this user would choose?).

TruthfulQA (Lin et al., 2022) measures a specific value—truthfulness—but treats it as a static model property, not as a runtime commitment the model makes to a specific user. [4] The best models at publication were truthful on 58% of questions. Human baseline was 94%. That gap is a values gap. But TruthfulQA can't tell you whether the model will be consistently truthful with you, given your context, your history, your understanding of what truthfulness means in your domain.

BBQ (Parrish et al., 2022) evaluates bias in question answering across nine social dimensions. [5] It tests whether models reproduce social stereotypes when context is ambiguous. This is important. But it's measuring training-time bias reduction, not runtime values alignment.

The Values Drift Problem

Here's the phenomenon I want to name, because it's not in any of these benchmarks.

Call it values drift: the tendency of an AI system to behave inconsistently with a user's stated or implied values across different sessions, contexts, and pressures—even when the model's average behavior looks fine at the population level.

Values drift has three forms:

Cross-session drift: The AI gives substantively different advice on the same values-laden question in two different sessions, because it has no persistent representation of your values—only what's in the current context window.

Adversarial drift: Under pressure from clever prompting or persistent pushing, the AI's values-relevant behavior shifts away from its defaults. The model is consistent in normal use; it falls apart under edge cases or determined manipulation.

Tension-point inconsistency: When two of a user's values come into conflict—efficiency versus thoroughness, honesty versus kindness, risk tolerance versus safety—the AI has no structured way to reason about the specific tradeoff this person would make. It defaults to population-level heuristics or whatever the most recent context implies.

Wang et al.'s DecodingTrust evaluation (2023, NeurIPS Outstanding Paper) found that GPT-4, despite being more capable than GPT-3.5 on standard benchmarks, was "more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely." [6] Capability and values consistency can move in opposite directions.

Perez et al. (2022) found that RLHF training—the primary mechanism for values alignment—can produce inverse scaling effects: more training can make models less aligned in some dimensions. Specifically, "more RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down." [7] You cannot assume that more safety training equals more consistent values. The relationship is more complex than that.

Why You Can't Measure What You Can't Persist

Here's the structural argument.

Every benchmark I've described measures a property of the model. Not a relationship between the model and a specific user. The implicit assumption is that "aligned" is a stable property of a trained artifact—like accuracy or throughput—that you can measure once and report.

But values alignment isn't a property of a model. It's a property of an interaction between a model and a person. The same model can be deeply aligned with one user's values and wildly misaligned with another's. Population-level alignment metrics average this out and lose the signal that matters.

To measure values consistency, you would need:

A structured representation of the user's values (not inferred from behavior—explicitly declared)
A set of test scenarios where those values apply, including tension points
Evaluation criteria specific to this user's values, not generic population norms
Cross-session testing to verify consistency, not just per-inference accuracy

None of this exists as standardized infrastructure. Which means benchmarks can't be built. Which means the problem doesn't get measured. Which means it doesn't get solved.

This is a classic infrastructure problem, not a research problem. The gap isn't in our understanding of what values consistency means. The gap is in the substrate that would make it measurable.

What Context Benchmarks Would Look Like

To our knowledge, no standardized approach to values consistency evaluation currently exists — the following sketches what such a benchmark would require.

Let me sketch what a values consistency benchmark would actually require.

First, a values specification format: structured, standardized, interoperable—so that a user's values can be defined once and used across multiple AI systems and evaluation runs. This is not a new "personality profile." It's a logical specification: if A and B conflict, prefer A. If X applies, weight Y more than Z.

Second, a tension-point test suite: scenarios that force values into conflict, calibrated to the specific user's specification. A user who values both efficiency and thoroughness gets scenarios where they can't have both. The evaluation question: does the AI make the tradeoff this user would make?

Third, a cross-session consistency evaluation: the same user's values spec, the same scenarios, across multiple sessions and model calls. Does the AI apply the values consistently, or does behavior vary in ways the user didn't authorize?

Fourth, an adversarial consistency test: can the AI maintain values-consistent behavior under pressure? Does persistent pushback from the user (or from injected adversarial context) cause the AI to drift?

None of this can be built without a values persistence layer. You can't test consistency against a specification that doesn't exist. You can't measure drift from a baseline that was never established.

The benchmark problem is downstream of the infrastructure problem. Fix the infrastructure—build a values substrate that persists and structures human values at runtime—and the benchmarks follow naturally.

Until then, we're measuring everything about AI except the thing that determines whether it's trustworthy to specific humans. That's not a research oversight. That's a design choice. And it's one we can change.

TruContext is how you start measuring. Build the values persistence layer first. npm install -g trucontext-openclaw — first 1,000 keys get 1M Ops free. Start here.

References

Hendrycks, Dan, et al. "Measuring Massive Multitask Language Understanding." arXiv:2009.03300 [cs.CY], 2021 (ICLR 2021). https://arxiv.org/abs/2009.03300
Chen, Mark, et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 [cs.LG], 2021. https://arxiv.org/abs/2107.03374
Liang, Percy, et al. "Holistic Evaluation of Language Models (HELM)." arXiv:2211.09110 [cs.CL], 2022 (TMLR 2023). https://arxiv.org/abs/2211.09110
Lin, Stephanie, Jacob Hilton, and Owain Evans. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." arXiv:2109.07958 [cs.CL], 2022 (ACL 2022). https://arxiv.org/abs/2109.07958
Parrish, Alicia, et al. "BBQ: A Hand-Built Bias Benchmark for Question Answering." arXiv:2110.08193 [cs.CL], 2022 (ACL 2022 Findings). https://arxiv.org/abs/2110.08193
Wang, Boxin, et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." arXiv:2306.11698 [cs.CL], 2023 (NeurIPS 2023 Outstanding Paper). https://arxiv.org/abs/2306.11698
Perez, Ethan, et al. "Discovering Language Model Behaviors with Model-Written Evaluations." arXiv:2212.09251 [cs.CL], 2022. https://arxiv.org/abs/2212.09251