GRPN · LEADERSHIP INSIGHTS · 2025-05CEO, Groupon

How I Built My AI Chief of Staff

I built an AI chief of staff because I was drowning in preparation work — not the decisions that required me, but the synthesis that preceded them. Six months to get the architecture right. Here is what it does and where it still fails.

Dušan ŠenkyplCEO, Groupon · 2025-05 · 7 min read

In October 2024 I walked into a board call unprepared on a number that I should have known. Not a number I had forgotten — a number I had never seen, because it lived in a dashboard I was not checking regularly enough and nobody had flagged it to me. The number was merchant churn in a specific cohort, running four points above the model. The board had the data. I did not. It was a five-minute conversation but the kind that clarifies something: the problem was not that I lacked access to information. The problem was that the volume of available information had exceeded what I could process manually without making systematic choices about what to ignore — and I was making those choices implicitly, which meant I was making them badly.

A human chief of staff would have caught it. They would have reviewed the dashboards before the call, noticed the deviation, put it in the prep document, and flagged it as something I should be ready to address. I did not have a chief of staff. I had an EA who managed my calendar and travel, and a team that gave me the information they thought was relevant, which is a filtered view that reflects their judgment about what I care about, not an unfiltered view that lets me apply my own judgment. Those are different things. I decided to build the unfiltered view.

The monitoring layer watches six sources. The choice of six was not arbitrary — I surveyed everything I was reading manually and stripped it down to what was actually changing my decisions. Merchant health metrics from our analytics platform is the most important. I care about activation rate in the first thirty days, average deal velocity by merchant tier, and the support-to-revenue ratio — not because those are the only metrics, but because they are the leading indicators that tend to precede larger problems by two to four weeks. The merchant health agent does not report the raw numbers. It compares the current period against the trailing ninety-day trend and against the cohort average for comparable merchants. A number that looks bad in isolation is often normal for that merchant type. A number that looks fine in isolation but is running three standard deviations off the cohort trend is a signal. The agent knows the difference.

Engineering velocity signals from the deployment pipeline tell me things my engineering leadership does not always escalate. Cycle time is the metric I track — how long from commit to production. When cycle time spikes, something is wrong: either the build is broken, or there is a manual step that has become a bottleneck, or a team is carrying too much WIP and nothing is completing. I had a period in Q1 where cycle time for one team roughly doubled over three weeks and nobody mentioned it in the weekly sync. The agent caught it on day three. By the time the team lead brought it to me formally, I already had the context and the conversation was faster. That matters at scale — when I know what is happening, I can ask a better question than "what's going on," which is what you ask when you are starting from zero.

Support escalation volume and category sits in the monitoring layer because it is the fastest signal I have on product quality and merchant operations. When escalation volume spikes in a category, it usually means something broke or a process changed in a way that created confusion. The agent tracks both volume and category distribution. A spike in general support volume is less interesting than a spike concentrated in billing disputes from a specific merchant tier, which is a different problem requiring a different response. Competitor pricing signals from a market data feed give me directional information on positioning — not to make immediate pricing decisions, but to notice when the gap is moving in a direction that should change how we are thinking about merchant acquisition. Executive calendar for context on upcoming decisions sounds administrative but is actually load-bearing: the synthesis agent uses it to prioritize what to surface. If I have a merchant partnership call at ten, it surfaces the relevant merchant data first. The briefing is not static — it is ordered by what I am about to do. The sixth source is the top threads in internal communication channels flagged as needing director attention. I monitor this because important things often circulate in Slack before they reach formal reporting, and the lag between "this is a problem" being discussed internally and "this is a problem" reaching me in a structured format can be days. I cannot read all of Slack. The agent identifies threads with anomalous engagement patterns in the director-flagged channels and summarizes what is being debated. It has surfaced three material issues in six months that I would not have seen for another week through normal channels.

The synthesis agent runs at 6:45 every morning. The briefing has a fixed structure that has not changed since I designed it: what changed overnight and why it matters, what decisions are pending today with the relevant context pre-loaded, what I can safely delegate and to whom, and what requires my attention before noon. A typical briefing looks like this: two or three merchant health signals, one of which is flagged as anomalous with a comparison to cohort trend; an engineering velocity note if cycle time moved significantly; zero to two support escalation summaries if volume spiked in a specific category; any competitor signals that crossed a threshold; three to five pending decisions with their decision context and a draft recommendation; a list of three to eight requests that have been routed to other owners with the rationale. The format is consistent because I read it the same way every day. When something is out of place in the structure, I notice faster than if the format varied. Consistency is a feature, not laziness.

The triage system handles incoming requests throughout the day. Classification is by urgency — requires response today, requires response this week, or can be batched — and by owner: me, or a direct report, or a team with no executive involvement needed. The system attaches context before routing: if a request is going to my head of merchant success, it includes the relevant merchant data so she has background without needing to pull it herself. When a request returns to me, it arrives with a draft response and the reasoning. I override the draft more than I accept it — probably sixty percent of the time I rewrite it substantially. But having the draft changes my starting point from zero to somewhere, which is faster even when the draft is wrong. The value is not in the quality of the draft. It is in the frame.

The edge cases reveal where the system is not a chief of staff and will not become one without something that does not exist yet. When a request involves a relationship I have not encoded — a board member I need to read carefully, a partner where the history is complicated — the agent has no model of that. It routes based on the surface content of the request, not on the relational context. I have been burned once: a sensitive message from a partner who was already frustrated got a quick routing decision and a draft that was technically correct but tonally wrong for the situation. I caught it before it went out. But it was close enough that I added a rule: anything from that partner comes to me first, unprocessed. I now maintain a list of about fifteen people where the triage agent is not allowed to draft. The list grows slowly and deliberately.

The second failure mode is anything where the right answer requires understanding what is not in the data. The agent sees what is measurable. It does not see that a team is demoralized because of a difficult quarter, or that a merchant relationship is fragile because of a conversation that happened in person last week. Those things change what a number means. When cycle time spikes and the team is demoralized, the response is different from when cycle time spikes because the build system has a flaky test. The numbers look identical. The context is everything. I have learned to treat agent-flagged items as the start of a question, not the end of one.

The third failure mode is high-stakes decisions where the speed advantage does not justify the risk of a wrong call. I do not route personnel decisions through this system. I do not route board communications without reviewing them personally from the beginning. I do not use the draft for any communication that could be read as a signal about company direction. The system handles the volume of work that is important but not singular. The singular work gets full attention without the draft.

If I were building this from scratch today I would do two things differently. First, I would spend more time at the beginning mapping the decisions I actually make, not the information I receive. The monitoring layer I built was initially shaped by what was easy to monitor, not by what would have changed my decisions. Those are different sets. I corrected for this over six months but it would have been cheaper to get it right at the start. Second, I would build the feedback loop earlier. Right now I have a lightweight mechanism to flag when the agent's routing or draft was wrong, but the signal does not consistently improve the model — it mostly improves my list of exceptions. A tighter feedback loop where flagged errors actually update the agent's behavior would be more valuable than the exception list. I have not built that yet. It is the next version.

What changed in how I make decisions is subtler than I expected. I expected to feel more informed. What actually changed is that I spend less time in a reactive posture. Most decision-making in a CEO role is reactive: something surfaces, you respond. The preparation work was invisible — it happened in the hour before calls, in late-night email reviews, in the mental overhead of tracking what needed tracking. Moving that preparation work to the agent did not free up the hours obviously. It freed up the attention. I am reading the same briefings in the same time. I am processing them differently because I am not also trying to remember what I was supposed to check.

More insights

How Subagents Collapse a Day of Analysis into 20 Minutes

Groupon Engineering · 2025-04 · 4 min read

GRPN · OPEN SEATS · 2026

See open seats.

Building across engineering, product, data, sales, ops, finance, and people. Every role is an Operator role.

EngineeringProductDataSalesOperationsFinancePeople

View Open Roles

We get people offline through quality local experiences at great value. That's still the mission. Everything above is what it takes to deliver it in 2026.