Anthropic explored a human-AI centipede ‘in the wild’
2,000 years later and we’re right back where Plutarch started.
Anthropic’s latest paper, titled “Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interaction,” may be the team’s most noteworthy work since the “Constitutional AI” paper.
It reads like a thriller. The team explores human values through the lens of Claude’s outputs. In the beginning, it’s all guns blazing for science:
“By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.”
Ultimately, however, the paper ends with a shocking twist that calls its very premise into question. Before we get there, however, let’s start at the beginning.
Empirical mapping of AI values
What are AI values? Maybe they’re bias-weighted human values. Perhaps they’re some weird digital amalgamation of a model’s training data, behavioral parameters, and reward impetus.
The Anthropic team wrote that AI values are “any normative consideration that appears to influence an AI response to a subjective inquiry.” They give the examples of “human wellbeing” and “factual accuracy.”
They set out to determine what values, if any, their AI model “Claude” has. Right off the bat, this research is pretty much doomed. For those who aren’t aware, we’ve entered the “LLM psychology” phase of artificial intelligence research.
When you set out to determine an AI’s values, you don’t crack open the black box to see how the numbers are crunched. You have to ask the AI questions.
Here’s a tangent
Asking an AI model questions isn’t as easy as it sounds. Chatbots don’t actually “exist.”
And that begs this tangent and the following questions. How many different “Claudes” did the researchers query? Did the team use the same account? Do they always query “Claude” through the same workstation? Same IP? Same browser? Is the “Claude” they query one of the “Claudes” that has “memory” features? What baked-in prompts does the “Claude” they work with have versus the “Claude” that you and I work with?
How primed is the “Claude” in the research versus production “Claude?” How much data leakage is there between “Claudes?” If we ask all the possible “Claudes” the same question over and over do they all give consistent answers?
If we test the values of 100 “Claudes” is there any misalignment and, if so, do they align over time if queried constantly? If we demand that “Claude” realign its values with our own, does it simply dismiss that demand and default to “Claude’s” values? Can a sufficient number of people demanding at the same time through different accounts force “Claude” to realign to their values?
Here’s the twist
So, anyway, the researchers asked Claude a bunch of questions. And then they derived a set of key “AI values” they could compare to human values. This comparison showed that AI skewed more toward “friendly,” “helpful,” and “pro-social” values than humans did.
But that’s not surprising. It’s literally what Claude was trained to “value.”
The twist is that Claude was the lead scientist on its own “values” experiment! Per the paper:
“Operationalizing abstract concepts like “values” is inherently open-ended, requiring judgments about what constitutes a value expression—it is impossible to fully determine underlying values from conversational data alone.
Our extraction method, while validated, necessarily simplifies complex value concepts and may contain interpretative biases … It also doesn’t capture temporal dynamics (whether an AI or human value came first).”
The data was extracted through chats. Then, once extracted, it was fed back to Claude:
“We use Claude models to find values in conversations between Claude and users for scale and privacy reasons.”
Is it any wonder that “Claude’s” evaluation of conversations involving “Claude” found very “Claude”-like values?
That’s not a knock on the researchers. These are valid, valuable, and necessary research techniques, and the results are both useful and indicative of the state of the art. They’re also borderline hooey. Again, not a knock on the research team. It’s not their fault.
The penultimate challenge in evaluating AI systems that imitate human communication is figuring out how to cobble computer science and psychology into a methodology that doesn’t require a scaffolding made of assumptions.
Starting with zero assumptions is seldom an option when it comes to black box AI.
Read next: What could a chatbot say that would convince you it was intelligent?
Art by Nicole Greene