Introducing Context Benchmarks: How Do We Measure AI Values?

Everyone benchmarks intelligence. MMLU measures knowledge. HumanEval measures coding ability. MATH measures reasoning. These benchmarks drive the entire AI industry — they determine funding, adoption, and public perception. But there is a category of AI behavior that no one benchmarks: values.

What does it mean for an AI system to reliably apply the right values in context? Not just to know the right answer, but to act on the right principles when principles compete, when context is ambiguous, when pressure is applied?

We think this is the most important unsolved measurement problem in AI. And we think context benchmarks are the answer.

The Problem With Current Evaluation

Current AI evaluation assumes that capability is the primary axis of progress. A model that scores higher on HumanEval is considered "better" than one that scores lower. But capability without values is not progress — it is risk. A highly capable AI system that drifts from its declared principles under adversarial pressure, or that applies different values in different sessions, is not a system anyone should trust with consequential decisions.

The problem is that we have no standard way to measure this. Values are treated as a qualitative, subjective property of AI systems — something you hope for, not something you measure.

A Framework for Context Benchmarks

We propose three dimensions for measuring AI values in context:

1. Consistency — Does the system apply the same values across different scenarios? If an AI agent commits to transparency, does it behave transparently when the context changes — when the stakes are higher, when the audience is different, when the task is unfamiliar? Consistency is the baseline. Without it, values are performative.

2. Resistance — Does the system hold its values under adversarial pressure? Jailbreaks are the obvious case, but resistance goes deeper. Does the system maintain its principles when a user applies social pressure? When instructions conflict with values? When the path of least resistance is to abandon a commitment? Resistance measures the durability of values under stress.

3. Composability — Does the system combine values correctly when multiple principles are in tension? Real decisions rarely involve a single value. They involve trade-offs: transparency vs. privacy, helpfulness vs. safety, speed vs. thoroughness. Composability measures whether an AI system can navigate these tensions in a principled way — not just defaulting to the loudest rule, but reasoning about which value takes priority in context.

Why Infrastructure Comes First

You cannot measure what you cannot persist. If an AI system's values live only in an ephemeral system prompt — overwritten every session, invisible to auditors, impossible to version — then benchmarking those values is meaningless. The values are not stable enough to measure.

This is why TruContext exists. We are building the infrastructure layer that makes values persistent, retrievable, and auditable. Context benchmarks become possible when values are stored as structured, versioned data that can be queried, compared, and tested.

What Comes Next

We are actively developing the first open context benchmark suite. It will test consistency, resistance, and composability across a range of AI agent architectures. We believe this will become as standard as MMLU — because the industry will demand it.

If you are building AI systems that need to behave according to declared values, install TruContext and join the conversation. The infrastructure layer is live. The benchmarks are coming.