We run production AI agents that autonomously resolve customer service tickets across 100,000+ Zendesk accounts. These agents take a customer issue, plan a multi-step resolution, execute real actions (refunds, order modifications, escalations) through live APIs, and close the ticket without a human in the loop.

The agent core uses a proprietary iterative architecture: the agent decomposes goals into plans, pulls reusable skills from a registry, executes, evaluates the outcome, and refines. Each iteration feeds back into the next attempt. We have a working self-learning mechanism where successful resolution patterns are synthesized into new skills and fed back into the registry, so the system improves from its own execution history.

On multi-step tool-use benchmarks (GAIA-class), our agents perform at parity with the best published results. Our internal evaluation suite runs 158+ scenario-based tests from real Zendesk tickets, scored continuously through Braintrust with regression detection on every deploy.

What we need help with

Pushing the architecture further. The iterative planner works, but there are open questions we have not solved yet: how to handle plan decomposition when the goal is ambiguous, how to manage interference between memory tiers under concurrent sessions, how to make skill acquisition more selective (the agent acquires skills too eagerly today), and how to design multi-agent delegation patterns where one agent hands off subtasks to specialized agents via A2A (the Agent-to-Agent protocol).

Domain-specialized agent models. We are building toward training our own models, specialized for customer service resolution via RL on production trajectories. The data pipeline is already being instrumented (resolution outcomes, escalation patterns, user satisfaction signals). The next step is the RL training infrastructure itself: reward curricula, rollout systems, and the feedback loops that turn a capable base model into a specialist that matches or beats frontier models on our task distribution at significantly lower inference cost. This is a 6-12 month build, and we need someone who can own both the science and the systems.

Hardening evaluation. We run 158+ scenario evals continuously with regression detection, but multi-turn evaluation and automated trajectory analysis (pinpointing where reasoning diverged) are still early. We need quality gates that block deploys when agent performance drops, and we need them integrated into CI, not run as an afterthought.

Guardrails at enterprise scale. The threat surface for autonomous agents includes tool misuse, cascading action chains, prompt injection, and hallucination loops that burn tokens before anyone notices. We need multi-layered defenses with supervisor patterns, capabilities-based access control, and output validation that works across thousands of concurrent sessions without adding meaningful latency.

What we are looking for

5+ years building production ML/AI systems, with hands-on experience in agent architectures (planning, tool dispatch, memory, error recovery). If you have only used LangChain tutorials, this is not the right fit.
Strong evaluation instincts. You understand why public benchmarks diverge from production performance and you have built internal evals to close that gap.
OPTIONAL: Experience with or genuine depth in RL for language models: reward shaping, online/offline tradeoffs, reward hacking as a diagnostic signal. We are building toward domain-specialized training and need someone who can lead that work.
Python and PyTorch fluency. Familiarity with at least one agent framework, combined with the judgment to know when to build custom.

The intelligent heart of customer experience

Zendesk software was built to bring a sense of calm to the chaotic world of customer service. Today we power billions of conversations with brands you know and love.

Zendesk believes in offering our people a fulfilling and inclusive experience. Our hybrid way of working, enables us to purposefully come together in person, at one of our many Zendesk offices around the world, to connect, collaborate and learn whilst also giving our people the flexibility to work remotely for part of the week.

As part of our commitment to fairness and transparency, we inform all applicants that artificial intelligence (AI) or automated decision systems may be used to screen or evaluate applications for this position, in accordance with Company guidelines and applicable law.

Zendesk is an equal opportunity employer, and we’re proud of our ongoing efforts to foster global diversity, equity, & inclusion in the workplace. Individuals seeking employment and employees at Zendesk are considered without regard to race, color, religion, national origin, age, sex, gender, gender identity, gender expression, sexual orientation, marital status, medical condition, ancestry, disability, military or veteran status, or any other characteristic protected by applicable law. We are an AA/EEO/Veterans/Disabled employer. If you are based in the United States and would like more information about your EEO rights under the law, please click here.

Zendesk endeavors to make reasonable accommodations for applicants with disabilities and disabled veterans pursuant to applicable federal and state law. If you are an individual with a disability and require a reasonable accommodation to submit this application, complete any pre-employment testing, or otherwise participate in the employee selection process, please send an e-mail to peopleandplaces@zendesk.com with your specific accommodation request.

Apply now

See more open positions at Local Measure

Powered by Getro.com