Shifting the AI Evaluation Lens
Contextual Awareness is All You Need Post #8: Reva Schwartz
A socio-technical systems perspective acknowledges that people and technology form an open and dynamic ecosystem that doesn’t exist in a vacuum. While the value of this framing is consistently raised, especially when evaluating AI’s societal impacts, it is often unclear to practitioners how to implement a sociotechnical lens to their evaluations.
The Status Quo is Prescriptive
The linguistic concepts of prescriptivism and descriptivism can help us understand the challenge of applying socio-technical practices in technology settings. Prescriptivism can be thought of as promoting a normative ideal for what usage or behavior should be, while descriptivism seeks to capture usage and behavior as they occur naturally, without imposing value judgments. In the context of large-scale AI production, machine learning practices dominate the AI lifecycle. These ML practices, which center systems over people, are inherently prescriptive in nature. In contrast, more descriptive socio-technical approaches that focus on people over systems are (to date) a challenge to scale effectively.
A substantial portion of the internet is used to train AI models, so prescriptivism is an obvious tool of choice for wrangling and streamlining such vast quantities of heterogeneous data. Since it isn’t practical to gather feedback directly from users at scale, organizations use frameworks and statistical models of human behavior to facilitate the process.
Recent efforts in AI alignment attempt to incorporate socio-technical concepts such as human values, user preferences and societal norms by prescribing them a priori (independent of observation) within the AI system during development. For example, AI tuning processes such as reinforcement learning from human feedback (RLHF) use human judgments to constrain system outcomes to a binary framework of “helpful” or “harmful” in pursuit of safety or preference alignment. Other model training paradigms may strive for some form of “neutrality” [1].
These varying tuning and content review processes across the AI lifecycle act as prescriptive proxies for the broader public. System responses are judged and categorized as right or wrong, good or bad, best or worst, acceptable or not, according to pre-specified criteria. More recently, large language models (LLMs) are being used as a “judge” to classify the outputs of other LLMs for more complex tasks. Whether LLM or human, these judgments serve as the basis for evaluating AI model capabilities and performance. AI benchmarking evaluations compare model output to prescriptive judgments on “canned” tasks, making minimal use of real user responses in context.
Using Descriptive Methods to Systematize Real-World Feedback
In contrast to prescriptive methods, descriptive approaches offer a way to expand evaluation of AI’s societal impacts by capturing contextual detail about the who and what of real-world interactions between users and systems. For example, Anthropic’s Clio is a fully automated tool for describing key trends in how people are using Claude [2].
The specific method of “thick description” can go even further. Rather than merely using static and surface level information, such as the explicit terms used in LLM system prompts, thick description provides a posteriori (based on observations) accounts of the context and implicit meaning behind dynamic user-system interactions [3]. Protocols can be designed to also capture subsequent user actions and decisions. Combined, this information can support deeper understanding of the hows and the whys of system interactions, and supply the detail necessary to evaluate AI’s societal impacts and place them into the broader context.
Like participatory methods used in the AI system design stage, the timing of descriptive methods matter. These techniques are best leveraged at the earliest stages to inform evaluator assumptions and reduce missteps. Instead of jumping directly to system alignment based on a priori requirements, evaluators can use scenarios and quasi-experimental design methods to comprehensively account for what materializes “on the ground”. For example, evaluators can conduct iterative rounds of scenario-based field testing to assess how users may over-rely on AI systems in real-world settings. Based on this information, a well-defined baseline construct for over-reliance can be refined, validated, and scaled for evaluation.
The Value of Real-World Feedback
Prescribing what AI should do from a technical and theoretical perspective may seem simpler than navigating complex human and societal factors, but all technology processes involve trade-offs and shortcutting real-world complexity may be less advantageous than presumed. Prescriptive methods help streamline the computational aspects of evaluating system output at scale. But with no valve to bring in direct and broad scale descriptive feedback from real system users, evaluators are stuck in a closed system. A lack of real-world evidence to verify evaluation assumptions and goals can create pervasive mismatches between assessment outcomes and the contexts in which systems are actually used.
In turn, these mismatches can permeate across the lifecycle; flattening the spectrum of human behavior underlying AI model development, reducing the validity and generalizability of AI evaluations, and causing users from all backgrounds to engage with systems that increasingly fail to meet their needs or expectations. Tuning and content mark-up processes also act as default substitutes for direct user feedback. With every layer of substitution, this “decontextualization” process is likely to reduce individuals to stereotypes, stifle AI-driven use cases, and obfuscate potential risks [4] [5]. Modeling these effects is made even more difficult by the adaptive nature of generative AI models.
In technology settings, evaluators can use descriptive and prescriptive approaches together to:
Help define the constructs that underpin AI model objective functions.
Clarify how people misuse and repurpose AI systems in unanticipated ways.
Enhance ecological validity by revealing gaps between evaluation outcomes and the real world.
Inform annotation schema design and evaluation scoring methods.
Verify whether the goals set during system design have been achieved.
Increase system resilience and robustness.
Depending on how approaches are constructed, descriptive evaluations can also provide valuable insights to other industry sectors about AI’s utility and other real-world phenomena. This information can help contextualize the broader implications of system deployment to:
Support organizational decision making and governance around AI.
Enhance customer segmentation processes.
Complement market trend analytics.
Drive new products, services, and customer experiences.
Save resources.
When combined with other AI assurance mechanisms, descriptive methods can enable users of all backgrounds to engage with technology safely and flexibly, a consistent goal of advanced technology.
[1] Fisher, J.R., Appel, R.E., Park, C.Y., Potter, Y., Jiang, L., Sorensen, T., Feng, S., Tsvetkov, Y., Roberts, M.E., Pan, J., Song, D.X., & Choi, Y. (2025). Political Neutrality in AI is Impossible- But Here is How to Approximate it. ArXiv, abs/2503.05728. https://arxiv.org/html/2503.05728v1
[2] https://www.anthropic.com/research/clio
[3] Nelson, A., “Thick Alignment”, keynote ACM FAccT 2023
[4] Hofmann V, Kalluri PR, Jurafsky D, King S (2024) AI generates covertly racist decisions about people based on their dialect. Nature :1–8.
[5] Casper, S., Davies, X., Shi, C., Gilbert, T.K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P.J., Wang, T., Marks, S., Ségerie, C., Carroll, M., Peng, A., Christoffersen, P.J., Damani, M., Slocum, S., Anwar, U., Siththaranjan, A., Nadeau, M., Michaud, E.J., Pfau, J., Krasheninnikov, D., Chen, X., Langosco, L.L., Hase, P., Biyik, E., Dragan, A.D., Krueger, D., Sadigh, D., & Hadfield-Menell, D. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. ArXiv, abs/2307.15217.
