Skip to content
GRPN · LEADERSHIP INSIGHTS · 2025-02Engineering Director

My AI Environment Contributions in Groupon Beyond a Claude Code Setup

Claude Code is where it starts. But the real work is building the environment around it — the agents, the evaluation loops, the governance. Here's what I shipped.

Roland Abou YounesEngineering Director · 2025-02 · 8 min read

Claude Code is the entry point, not the destination. Installing it takes twenty minutes. Building the environment that makes it useful for Groupon-specific work took the better part of a year. The gap between those two timelines is where the real engineering lives.

Six weeks into broad rollout, one of our engineers used an agent to refactor the API layer for a mid-tier service. The refactor looked good. Tests passed, the code was cleaner than what it replaced, and the PR review flagged nothing. The problem surfaced four days later in staging: the refactored service was publishing events in a schema that three downstream consumers expected in a different format. The agent had no way to know this. The service it touched had no documentation of the consumers depending on it, and the contract those consumers relied on lived implicitly in a schema definition three repositories away. The agent made a locally correct decision — the refactored API was cleaner by any measure internal to that service — and a globally wrong one. The fix took a week and involved engineers from two other teams.

That incident is what made the context injection work feel urgent. The problem was not the agent's reasoning — it was what the agent did not know going in. Every significant service at Groupon sits inside a dependency graph that has been built up over fifteen years of platform evolution. Some of that structure is formally documented in our architecture registry. A lot of it lives in the heads of engineers who have been here long enough to know which services talk to which other services, what the implicit contracts are, and where the landmines sit. A new engineer spends three to six months learning enough of this to avoid the class of mistake our agent made in six weeks. The question we had to answer was: how do you give an agent enough of that graph to make decisions that respect it?

The system we built queries our internal architecture registry at the start of any agentic task scoped to a specific service. The query resolves the service's direct dependents — the services that call it — and its direct dependencies — the services it calls — one level in each direction. For services flagged as high-traffic or payment-adjacent, the resolution extends two levels outward. The query also surfaces any schema contracts registered between services in the graph, any ongoing active migrations that might be affecting interfaces in the vicinity, and any services marked as stability-critical that the task's service communicates with. All of that gets formatted into a structured preamble — not a wall of JSON, but a readable summary organized by service relationship type — and injected into the agent's context before it begins.

The formatting matters more than it sounds. An early version of the system dumped the raw registry output into context. The agents read it but did not consistently use it. When we analyzed the cases where agents still made globally-incorrect decisions, the pattern was not that the context was missing — it was that the relevant constraint was buried in the middle of a large block of data and the agent effectively skimmed past it. The current format puts the highest-salience constraints first: downstream consumers of the service being modified, any registered schema contracts, any active migrations. Lower-salience information — the full dependency tree, historical change logs — appears at the end. Since restructuring the format, the rate of globally-incorrect agentic decisions on tasks covered by the registry has dropped by a factor we track internally and report quarterly.

The evaluation framework is the other piece I spent the most time on. The framing I use internally is: passing linting is not a bar, it is a floor. The evaluation runs against every significant agent output before it touches production — significant meaning anything that modifies a shared service, adds a dependency, or changes an interface. We designed the evaluation around failure modes we had actually observed in the first six months of usage, not failure modes we theorized might happen.

There are five checks in the current framework. The first checks API contract compliance: it compares the agent's output against the registered schema contracts for any interfaces the code touches and flags deviations. The second checks SDK version compatibility: Groupon's internal SDK has had three major versions in the past four years, and agents without explicit context about current version conventions occasionally write code that is syntactically valid for an older version. The third checks SQL partition logic: a pattern we saw repeatedly in early agent-written queries was technically correct SQL that scanned the wrong partition because the agent did not know how our data warehouse partitioning conventions worked. The fourth checks test coverage against a minimum threshold for any new code path added — not a blanket percentage target, but a check that covers the specific paths the agent created. The fifth is the one I am most proud of: a behavioral delta check that runs the existing test suite against the modified code and flags any change in test execution time greater than fifteen percent. Agents occasionally introduce inefficiencies — unnecessary loops, redundant fetches — that pass correctness tests but degrade performance in ways that show up only at scale.

The subtle bug the evaluation caught that I remember most clearly involved a data migration script an agent wrote for moving legacy merchant records into a new schema. The SQL logic was correct, the data types matched, and the first pass over sample data looked clean. What the evaluation flagged was that the script's batch size parameter was hardcoded to a value that worked for the test dataset but would cause memory issues at the scale of the actual migration — about thirty times more records. This was not a logical error in any classical sense. The tests would not have caught it. The behavioral delta check caught it because the test suite run time spiked when the evaluation ran against a larger sample. We added a configurable batch size parameter and a check that validates it against the expected record count before the migration runs. The agent's code was useful — the bug it shipped with was the kind of thing that would have been found in production at the worst possible moment.

The governance model is the piece that changes most often, because it is the piece most dependent on accumulated evidence rather than first principles. The current structure has three tiers. Agents act autonomously on any task scoped entirely to the test environment and assessed below the risk threshold — which in practice means changes to non-shared services, new utility functions, test additions, and documentation. They propose for human review on any change touching a production service, modifying a shared interface, or assessed above the threshold. They escalate immediately — meaning no action, immediate notification — on anything touching authentication, billing, customer PII, or any service marked as payment-critical in the registry.

The risk threshold is not a single number. It is a composite score derived from four dimensions: blast radius (how many services or users would be affected if this went wrong), reversibility (how easily the change can be rolled back), novelty (whether the change is within the pattern distribution of changes we have previously seen agents make correctly), and data sensitivity (whether the change involves any customer or merchant data). A change that scores high on reversibility and low on everything else gets a lower composite score than a change that scores moderate on all four. The threshold for autonomous action is a composite score below fifteen. The threshold for escalation to human review is above thirty. Between fifteen and thirty, the agent proposes and waits.

The categories themselves are reviewed quarterly. The review process involves the engineering leads for the teams that have been using agents most actively, the security team, and one person from legal. We look at every escalation from the prior quarter — what was flagged, whether the human review agreed with the escalation or found it unnecessary, and whether any autonomous decisions led to incidents. Where we see consistent patterns of unnecessary escalation, we consider moving the category toward autonomous. Where we see incidents in autonomous categories, we reconsider the threshold calibration or move the category up.

What we got wrong initially was treating governance as a policy problem rather than an evidence problem. The first governance framework we shipped was based on first-principles reasoning about what kinds of changes seemed risky. It was not bad reasoning, but it was disconnected from what actually went wrong in practice. The multi-location merchant normalization bug described in our merchant analytics work was not a governance failure — it surfaced in a context the governance framework did not cover. But the API contract failure from the refactoring incident was exactly the kind of thing governance should have caught, and it did not because the framework did not yet have visibility into contract compliance. We added it after the incident. The framework is better for having been tested against reality rather than theory.

The other thing we got wrong initially was thinking governance was primarily about preventing bad outcomes. It is also about building confidence. The teams that use agents most aggressively now are the teams that went through the most careful governance review early on — not because the governance slowed them down, but because the evidence it produced gave them legitimate confidence about where the agents were reliable. The governance model is not a brake on agent usage. It is the mechanism by which agent usage earns trust.

Three things remain human-only in our current setup, and I expect them to stay that way for the foreseeable future. The first is any decision that sets a precedent — architectural choices that will constrain how a system evolves over the next three to five years. Agents can analyze options and surface tradeoffs, but the person making the call needs to be able to defend it and own it in a way that matters institutionally. The second is anything involving a novel integration with an external partner — Groupon's merchant relationships involve commercial and contractual dimensions that require human judgment about trust and business context that we have not figured out how to encode. The third is security architecture: not security-adjacent code, which agents handle well under evaluation, but decisions about how Groupon's security model itself should evolve. Those decisions are small in number and large in consequence, and they belong to people.

GRPN · OPEN SEATS · 2026

See open seats.

Building across engineering, product, data, sales, ops, finance, and people. Every role is an Operator role.

EngineeringProductDataSalesOperationsFinancePeople
View Open Roles

We get people offline through quality local experiences at great value. That's still the mission. Everything above is what it takes to deliver it in 2026.