GRPN · LEADERSHIP INSIGHTS · 2025-04Engineering Team

How Subagents Collapse a Day of Analysis into 20 Minutes

The merchant performance review used to take most of a Monday. We rebuilt the workflow around a three-agent pipeline and it now runs in under twenty minutes. What we learned about where institutional knowledge lives — and how to encode it.

Groupon EngineeringEngineering Team · 2025-04 · 4 min read

Monday morning had a shape. The analyst who owned merchant performance reviews came in, pulled the weekend transaction data from Redshift, and spent the first two hours making the data usable. This was not glamorous work and it was not simple work. The raw query returned somewhere between forty and sixty thousand rows depending on the weekend, structured around individual transactions rather than merchant-level summaries. Getting from transactions to merchants meant applying a taxonomy that had been built over several years and revised dozens of times — a controlled vocabulary for merchant categories that handled edge cases like multi-location restaurant groups, seasonal merchants who activated and deactivated on irregular schedules, and merchants who had changed category classifications after acquisitions. The taxonomy lived in a separate database. The join logic to apply it correctly was documented in a Confluence page that three people knew about and nobody kept current.

By ten o'clock the analyst had a normalized dataset. Then came the cohort segmentation: separating merchants by acquisition channel (direct sales, self-serve, partner referral), by cohort vintage (first six months, six to eighteen months, eighteen months plus), and by revenue tier. This mattered because a retention signal that looked alarming in aggregate often turned out to be concentrated in a specific cohort with a known explanation — a batch of merchants acquired through a channel that historically showed lower second-year retention, or a seasonal dip that affected all merchants in a particular category at the same time of year. Segmenting before interpreting was not optional. It was the difference between a useful analysis and a misleading one.

The retention model ran against the segmented data. This step was mostly automated even before the agent pipeline — the model itself was not the bottleneck. The bottleneck was the cross-reference that came after: matching merchants whose retention trajectory had diverged from their cohort baseline against the support ticket database to see whether the divergence correlated with service issues, contract disputes, or nothing documented. That cross-reference required pulling a separate dataset, doing a manual join on merchant ID, and reading enough tickets to form a judgment about which divergences were explicable and which were genuinely anomalous. Two to three hours, depending on how many merchants had flagged.

The structured summary for the director came last. A document that laid out what the data showed, which merchants needed attention and why, and what the data suggested about cause. This part required the most judgment and the most time — not because the writing was difficult but because the analyst had to decide what counted as signal versus noise, which anomalies were worth escalating, and which findings had actionable implications versus which were interesting but not urgent. By the time the summary landed in the director's inbox, it was mid-afternoon at the earliest. Decisions that could have been made on Monday morning were being made on Monday afternoon, or deferred to Tuesday.

The three-agent pipeline replaced that sequence without replacing the judgment it required. What it required instead was encoding that judgment explicitly, which turned out to be most of the work.

The first agent handles Redshift querying and normalization. Its instructions specify the exact taxonomy join logic — including the edge case handling for multi-location merchants that had previously lived only in the analyst's head — and direct it to produce a merchant-level summary with explicit fields for category, acquisition channel, cohort vintage, and revenue tier. Critically, the agent's instructions also specify what to do when a merchant does not fit cleanly into the taxonomy: flag it with a confidence score rather than forcing a category assignment. The original manual process handled ambiguous merchants inconsistently, depending on how rushed the analyst was. The agent handles them consistently, every time, and surfaces them for review rather than silently categorizing them.

The second agent takes the normalized output and runs the retention model. Its instructions go beyond the mechanical model-running: they specify what constitutes a meaningful divergence from cohort baseline (more than one standard deviation, excluding known seasonal patterns for the relevant quarter), and they direct the agent to query the support ticket database and surface any tickets from the past ninety days associated with merchants that flagged. The agent does not interpret those tickets — that is the third agent's job — but it formats them alongside the retention data so the context is preserved. The cross-reference that used to take two hours now happens in parallel with the retention calculation and adds no time to the pipeline.

The third agent produces the structured summary. Its instructions specify the output format the director needs, the priority ordering for merchant flags (revenue impact first, trajectory second, then unexplained anomalies), and the language conventions the organization uses for this type of analysis. It is instructed not to hedge on findings where the data is clear and to flag explicitly where interpretation is uncertain. The first version of these instructions produced summaries that were technically accurate but verbose in ways the analyst recognized immediately as covering up uncertainty. The fix was to make the confidence reporting explicit: the agent now distinguishes between "data is clear, interpretation is high confidence" and "data is ambiguous, flag for human review" instead of producing text that blurred those distinctions.

Beyond multi-location merchants, we encountered several other failure modes in the first month. The normalization agent made incorrect assumptions about merchant deactivation: it treated any merchant without a transaction in the prior thirty days as churned, which produced a dramatically inflated churn signal for seasonal businesses. The fix required adding a seasonality flag to the merchant taxonomy and instructing the agent to apply it before flagging absences. A more subtle problem: the retention model agent occasionally produced outputs that were mathematically valid but did not match how the organization defined retention. The organization's definition had evolved over time and was not consistently documented. Resolving it required the analyst and the data team lead to sit down and produce a written definition that the agent's instructions could reference. The agent forced a clarification that had been deferred for years because everyone thought someone else had documented it.

The institutional knowledge problem is the real problem the pipeline solved, and it is worth being specific about what that means. Institutional knowledge in analytical workflows is not exotic expertise — it is the accumulation of decisions made about ambiguous cases that were never written down because writing them down was slower than just handling them. The analyst who ran the Monday review knew that a particular restaurant group with locations in three cities should be treated as a single merchant for retention purposes. She knew which quarter's data to exclude when calculating seasonal baselines for spa merchants because a platform outage had distorted that quarter's numbers. She knew that a specific acquisition channel had higher sixty-day churn and lower twelve-month churn than the numbers suggested because the merchants it brought in took longer to ramp. None of that was documented. It was in her head and she applied it by feel.

Encoding it into the agent's specification required a methodology: we ran the agent against six months of historical data and had the analyst review the output, flagging every case where the agent's treatment of a merchant differed from how she would have handled it. Each flag became a documented edge case. The agent's instructions grew by about forty percent through that process. The result was not an agent that perfectly replicated the analyst's judgment — it was an agent that made the analyst's implicit knowledge explicit, and in doing so made it auditable, transferable, and improvable by anyone who came after.

Running the analysis twice a week instead of once changed the decisions that got made and when they got made. The original Monday review was retrospective by design: it looked at the prior week's data and produced a report on what had happened. The additional Thursday run introduced something close to real-time: directors reviewing merchant health on Thursday morning could see the impact of interventions made earlier in the week. Merchants who had been flagged Monday and contacted by account management on Tuesday showed up in Thursday's analysis with updated trajectory data. The feedback loop between intervention and signal went from one week to three days. That is not a marginal improvement. Two decisions were made differently in the first month of Thursday runs — both involved merchants who had been flagged Monday, contacted Tuesday, and showed improved trajectory by Thursday. Under the old cadence, those merchants would have had to wait a full week to show up as recovering rather than at-risk. Under the new cadence, account management had confirmation of the intervention's impact before the week was out.

The analyst who ran this process every Monday now runs it twice a week and spends the hours the analysis used to consume on the questions the analysis surfaces. That part is true but undersells what actually changed. What changed is that her job shifted from executing a repeatable analytical process to owning the specification that governs how the process executes. She reviews the agent's outputs, maintains the edge case documentation, flags when the analysis produces something that doesn't look right, and drives the quarterly reviews where we assess whether the taxonomy and the retention definition still match how the business actually works. The analytical judgment did not get automated. It got moved upstream — from the execution of the analysis to the specification that makes the execution possible. That is a different job. Some analysts prefer it. Some don't. It is worth being honest about both.

More insights

How I Built My AI Chief of Staff

Dušan Šenkypl · 2025-05 · 7 min read

The Engineer Is Dead. The Builder Is What Comes Next.

Dušan Šenkypl · 2025-03 · 5 min read

GRPN · OPEN SEATS · 2026

See open seats.

Building across engineering, product, data, sales, ops, finance, and people. Every role is an Operator role.

EngineeringProductDataSalesOperationsFinancePeople

View Open Roles

We get people offline through quality local experiences at great value. That's still the mission. Everything above is what it takes to deliver it in 2026.