Observe.AI at ACL 2026: Advancing Agentic AI for Contact Centers
We are excited to share that six research papers from Observe.AI have been accepted to ACL 2026, ranked 1st in the NLP conference. Together, these works reflect a broader research direction we are deeply invested in: building LLM-powered and agentic AI systems that can operate reliably in real enterprise workflows, especially in complex contact-center environments.
Contact centers are one of the richest and most operationally important sources of human conversation data. Every interaction captures customer intent, agent behavior, policy interpretation, compliance requirements, business outcomes, and opportunities for improvement. As LLMs become part of this ecosystem, the challenge is no longer limited to generating fluent responses. The real opportunity is to build AI systems that can reason with business context, use tools effectively, extract the right information, evaluate consistently, learn from errors, and support high-stakes operational decisions at scale.
At Observe.AI, our research is focused on this transition: from Large Language Models (LLMs) as response engines to agentic, auditable, and production-ready AI systems. The six accepted papers span automated quality assurance, tool-aware planning, classification refinement, fairness evaluation, conversational entity extraction, and synthetic dialogue generation. Each paper addresses a different building block required to make AI systems more reliable in real-world contact-center workflows.
Entity Exchange in the Wild: A Diagnostic Study of LLMs for Conversational Entity Extraction
Entity extraction is a critical building block for contact-center automation. Extracted entities such as order IDs, dates, account details, product names, locations, and issue types can drive CRM updates, workflow triggers, compliance checks, and downstream analytics. But in real conversations, entities are rarely provided cleanly. Customers revise themselves, agents confirm or correct values, information appears across multiple turns, and the final value may only become clear through conversational context.
This paper studies entity extraction as a conversation-understanding problem rather than a simple span extraction task. It introduces a diagnostic framework around three dimensions: how an entity becomes relevant, how its value evolves during the conversation, and how the final value is articulated. The work evaluates 16 LLMs on 6,387 real-world conversations across 12 entity types, showing that extraction quality can vary significantly depending on conversational dynamics.
One of the key findings is that performance can vary by 50–60% within the same model depending on how the entity is exchanged in the conversation. Error-aware prompting improves extraction across all axes, with average gains of up to 6.4%. The broader lesson is that reliable automation benefits from models that can track how information is negotiated over time, not just identify isolated facts from text.
Refining Contact Center Evaluation Plans via Backward Questions
Automated Quality Assurance in contact centers depends on more than checking whether an agent followed a script. In practice, QA often reflects nuanced, organization-specific judgment: what counts as a complete resolution, whether the agent showed ownership, whether a response met the spirit of a policy, or whether the customer’s issue was handled appropriately. This paper addresses the challenge of making that implicit QA expertise explicit enough for LLMs to apply consistently.
The work introduces Backward Question-based Refinement, a method that examines model errors by asking what question the model appears to have answered instead of the intended QA question. These backward questions help reveal hidden ambiguity in evaluation plans and show where criteria can be clarified. By using stronger LLMs to diagnose these gaps and refine the evaluation plan, the approach improves the performance of smaller, more cost-efficient models on contact-center QA tasks.
Across 12 QA questions, the method improves performance on 8 of them, with gains of up to 27.8% Macro F1. The work highlights a practical path for translating expert QA judgment into clearer, repeatable, and model-friendly evaluation logic—an important step toward scalable and trustworthy AI-assisted quality programs.
Multi-Agent Error-Driven Class Description Refinement for LLM-Based Classification
Classification is a foundational capability for contact-center intelligence. It powers use cases such as call reason detection, objection mining, escalation tracking, compliance monitoring, customer issue categorization, and trend analysis. However, real-world classification often depends not only on how a prompt is written, but also on how clearly the categories themselves are defined. Categories may overlap, class descriptions may be underspecified, and the distinction between two labels may require business context.
This paper focuses on improving the class descriptions that define the taxonomy. It proposes a multi-agent, error-driven refinement framework that studies misclassified examples, identifies commonly confused class pairs, and rewrites descriptions to make class boundaries clearer and more contrastive. Instead of treating classification only as an instruction-following problem, the work treats taxonomy design as an important part of LLM system performance.
Across contact-center and public benchmark datasets, the approach delivers up to 20.71% accuracy improvement over static class descriptions. The broader implication is that high-quality enterprise AI classification benefits from both strong prompting and well-structured business definitions. For production systems, this creates a pathway for classification taxonomies to improve as new examples, edge cases, and customer patterns emerge.
Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance Systems
As LLMs begin to support agent evaluation, coaching, and quality assurance, accuracy is only one part of the picture. These systems operate in workflows where consistency, transparency, and trust are essential. This makes it important to understand how model judgments behave when conversational details change, especially when those details should not affect the evaluation of the underlying agent behavior.
This paper introduces a counterfactual fairness evaluation framework for LLM-based contact-center QA. The core idea is to test whether a model’s judgment remains stable when non-essential attributes are altered while the underlying agent behavior remains the same. The study evaluates 18 LLMs on 3,000 real-world contact-center transcripts across 13 QA dimensions, using metrics such as Counterfactual Flip Rate and Mean Absolute Score Difference.
The findings show measurable variation in model judgments under counterfactual changes, with flip rates ranging from 5.4% to 13.0%. The paper also explores fairness-aware prompting and finds that it can improve consistency in some settings, while leaving room for further research and system-level evaluation. Overall, this work contributes a practical audit framework for assessing fairness and consistency before LLM-based QA systems are applied in workforce-facing workflows.
Mirage: A Diagnostic Framework for Evaluating the Realism of Synthetic Contact Center Dialogue Generation
Synthetic data is an important frontier for contact-center AI. It can help address privacy constraints, data sparsity, domain coverage, long-tail issue simulation, and safer experimentation. However, for synthetic conversations to be useful, they must do more than sound realistic. They must preserve the behavioral, operational, and interactional properties that make real conversations valuable for training and evaluating downstream systems.
This paper investigates how synthetic contact-center dialogues compare with real conversations and where current generation strategies can be improved. It benchmarks multiple generation strategies using structured supervision such as intent summaries, topic flows, and QA forms, and evaluates synthetic data through downstream AutoQA performance as well as a diagnostic framework spanning 17 metrics. These metrics cover emotional arcs, linguistic complexity, interaction style, and broader conversational properties.
The findings show that prompts optimized on real transcripts consistently outperform those optimized on synthetic transcripts for AutoQA, indicating that real conversations continue to capture important interaction patterns that are difficult to fully reproduce synthetically. The work provides a more rigorous way to evaluate synthetic dialogue quality not just by fluency, but by whether the data can support reliable model training, evaluation, and product development.
Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
The next generation of contact-center AI is increasingly agentic. Users do not only ask for a summary; they ask complex business questions that require multiple steps, multiple data sources, and multiple tools. A single query may require pulling structured metrics from a warehouse, retrieving transcript evidence through RAG, comparing customer segments, identifying patterns, and synthesizing a business recommendation. For such workflows, the quality of the final answer depends on whether the AI agent can plan correctly.
This paper evaluates LLMs on tool-aware planning for contact-center analytics. It introduces a benchmark where models must decompose complex business questions into executable sub-questions, assign the right tools, and preserve dependencies across steps. The framework evaluates whether models can produce plans that are logically sound and execution-ready within a real analytics stack involving structured and unstructured data.
The study evaluates 14 LLMs and finds that planning performance varies meaningfully across models, especially for compound queries and longer workflows. While the best metric-wise score reaches 84.8%, the strongest one-shot “A+ tier” exact planning rate is 49.75%, showing that tool-aware planning is a valuable evaluation axis in addition to final answer quality. The work highlights the importance of testing enterprise AI agents on planning, tool selection, dependency reasoning, and execution readiness.
Looking Ahead
The progress of AI in enterprise environments will not be defined by model capability alone. It will be defined by how well AI systems can be designed, evaluated, governed, and integrated into the workflows where real decisions are made.
For contact centers, this means building systems that can understand the complexity of human conversations, reason over business context, support agents and supervisors, and produce outputs that teams can trust and act on. It also means investing in the less visible but essential parts of AI: planning, evaluation, fairness, data quality, taxonomy design, and operational reliability.
Our ACL 2026 research reflects this belief. Each paper addresses a specific technical challenge, but together they point toward a larger direction: AI systems that are not only intelligent, but dependable in the environments where they matter most.
At Observe.AI, we are excited to continue advancing this work with the broader research community, our customers, and our product teams. As LLMs evolve into agentic systems, our focus remains clear: building AI that helps enterprises understand conversations more deeply, improve performance more systematically, and create better experiences for both customers and frontline teams.
We are proud of the researchers, engineers, and collaborators behind these papers, and we look forward to sharing more from this work at ACL 2026.
Subscribe to our newsletter.


.png)
.png)
.png)

.jpg)






