Using AI without losing touch with what matters
AI is no longer a concern of the future for those involved in evaluation — it’s a present one. Whether we realize it or not, tools like ChatGPT, Copilot, and others are already influencing how data is collected, processed, and synthesized in the way we work.
At OpenCities, we’ve started integrating AI tools into our workflows. They’ve made some processes leaner, some faster. But we’ve also learned where they fall short — and where they risk undermining the values at the heart of good evaluation: trust, transparency, and technical rigour.
This piece reflects on the practical benefits and ethical boundaries of using AI in evaluations. We believe AI is the future. It can help us work better. But let’s be clear: AI isn’t a substitute for human insight. It can’t replace the experience or judgement that evaluators bring to their work. As we explore what AI can do, we must stay grounded in what matters the most: our responsibility to our clients and the communities they serve
1. Trust is Earned, Not Automated
Our starting point is simple: Evaluation is a human-to-human process. It’s built on trust between evaluator and client, between facilitator and participant, between teams and their constituents. While AI might help us crunch data faster or write neater summaries, it can’t build relationships, read a room, or navigate complex dynamics.
And it certainly can’t replace the responsibility that consultants bear to deliver thoughtful, high-quality work. Clients aren’t paying for shortcuts – they’re investing in expertise, contextual understanding, and care. That’s why we see AI not as a solution in itself, but as a tool to augment human insight.
2. Three Ways We’re Using AI — With Caution and Care
When used thoughtfully, AI can help us work smarter – saving time, surfacing patterns, and supporting reflection. But we understand that AI is not suitable for every use case. That’s why our approach at OpenCities is shaped by caution, care, and commitment to quality.
Below, we share three ways we’re currently using AI in our work – where it helps, where it doesn't, and what to watch out for.
Summaries
We’ve used tools like Otter.ai or even Microsoft Teams to generate transcripts and quick summaries of interviews and workshops. These tools save hours – especially for long sessions – and are useful for indexing and searching large volumes of qualitative material.
But AI transcriptions aren’t perfect. They mishear, drop nuance, and sometimes garble emotion. So, while they’re a useful first pass, we always cross-check with our own notes to capture the tone, context, and moments that matter most to our clients.
Synthesis
Tools like MaxQDA can now apply codes, cluster themes, and draft summary paragraphs for large volumes of text – from stakeholder interviews to open-ended survey responses, or policy documents. This can be especially useful when we rapidly need a “first cut” of the data to identify points of concurrence, points of contention, or simply points that need further evidencing.
But we’ve found that AI tools miss subtlety. They don’t understand our clients like we do and the decisions they need to make. They don’t understand the difference between what's said and what's meant. They don’t know what’s politically sensitive. Simply put, they flatten differences that matter.
So, we treat AI-based synthesis like scaffolding, not structure. It can help define themes – but it can’t define their meaning. That’s why our synthesis is always manual at the final stage. Because nothing can replace hard graft when it comes to interpretation and triangulation.
Spot-Checks
One emerging use we’re exploring is having AI “sense check” outputs. AI can act as a kind of thinking partner, helping to expand – not narrow our thinking. For example, we’ll often prompt AI to review out draft inputs with questions like:
· Are our responses clearly aligned to the evaluation questions?
· Are we consistent across the document?
· Are our findings well-evidenced?
· Are there perspectives or sources of evidence we may have missed?
· Are our recommendations actionable and useful to the client?
These small-scale applications help tighten quality — a sort of AI-enhanced proofreading for coherence and clarity. But again, this only works when you already know what good looks like.
3. The Lines We Won’t Cross
While we’re open to exploring what AI can do, we’re just as clear on what it shouldn’t do.
There are boundaries we don’t cross — not because the technology can’t, but because it shouldn’t replace the care, discernment, and responsibility that good evaluation demands.
Here’s what we don’t use AI for:
· Analyzing Sensitive Data — Especially when privacy, confidentiality, or participant trust is at stake.
· Interpreting Emotion or Intent — AI misses tone and hesitation – the subtle clues that often matter most.
· Drafting Sensitive Conclusions or Recommendations — These require political judgement, contextual awareness, and careful phrasing.
· Writing Final Reports without Full Human Revision — Because delivering work to clients is a matter of accountability, not automation.
These are moments that call for human presence. They’re where experience makes the difference, and where trust is either built or broken.
4. How We Make AI Work for Us?
So, producing meaningful results with AI requires careful guidance by skilled humans. We’re not just passive users, but facilitators and interpreters of AI’s outputs. Evaluators seeking to strengthen analysis with this new technology must invest more — not less — expertise, time, and attention in things like: strategic segmentation of data, cycles of inquiry and refinement of prompts, testing and verification, and careful interpretation.
In fact, pioneering practitioners tend to describe a more deliberate and layered analytical process, not a shortcut. Every step still relies on human judgment
That’s why it’s troubling that AI is entering the scene just as careful analysis seems to be increasingly undervalued — evaluations are often under pressure to cut corners, shrink budgets, rush timelines, and skip the hard thinking.
We see that trend — and we’re pushing back.
We use AI not to replace deep thinking, but to enhance it. AI can lighten the load. But it’s our insight, judgement, and care that makes the work count.
5. Ethical Questions Every Evaluator Should Ask Before Using AI
As this space evolves, we’re encouraging teams to ask:
· Transparency: Have we told our clients or participants how we’re using AI?
Transparency isn’t just about disclosure. It’s about jointly defining boundaries. That means clearly explaining: (i) what tools we’re using and why; (ii) what data is being fed into them and how it’s handled; and (iii) how we’re ensuing compliance with data protections and ethical standards. Trust is built through openness. If AI is being used in any part of the analysis, the people affected by that analysis deserve to know.
· Bias: Could the tool reinforce stereotypes or erase minority voices?
AI can reduce some human bias — like recall errors or confirmation bias when collecting data — but it also inherits bias from the data it’s trained on. And much of that data reflects deep-rooted inequities: racial, gendered, geographic. Left unchecked, AI can amplify dominant voices and miss marginal ones. We must interrogate not just what the tool says, but who it’s speaking for — and who might be missing.
· Oversight: Who reviews and takes responsibility for the outputs?
Even the most advanced AI systems are prone to errors — whether it’s confidently stating incorrect facts (hallucination) or misinterpreting tone. That’s why human oversight is essential at every step. Evaluators must take responsibility for verifying accuracy, interpreting meaning through a human lens, and making final judgments based on contextual, relational, and strategic understanding. There’s no “handing off” responsibility to a tool. If something’s wrong, that’s on us — not the model.
· Value: Does this tool help us deliver better work — or just faster work?
There are times where we need to work fast. The sort of predictive modelling AI can handle could save lives in times of crisis. But rarely are the stakes so high during evaluations – at least in the immediate term. When we have time, we should use it. Time allows us to go deeper, reflect longer, and interpret more carefully. When space is available, we owe it to our clients — and the work — to take it.
Conclusion: AI as Partner, Not Proxy
There’s a lot of hype about AI. And yes, it can help – especially with grunt work. But evaluation isn’t just about processing information. It’s about sense-making. About relationships. About reading between the lines. And no machine can do that better than we can.
We see AI as a junior assistant: helpful in parts, but not in charge. The job of the evaluator remains the same — to listen well, synthesize carefully, and build the kind of trust that no algorithm can replicate.
In that sense, the real question isn’t can we use AI in evaluation — it’s how we do so without losing the values that make evaluation matter.