← Back to articles
AI Strategy5 MIN READ

What Actually Goes Wrong When You Run AI Agents at Scale

NYT profiled SMB owners running AI agent armies across finance, email, and customers. Here's what broke, what worked, and what to steal from their experience.

Cameron Breen
Cameron Breen
2026-06-05 · 5 min read
TL;DR

Running multiple AI agents across your business operations is now something real SMB owners are doing, not a future concept. The upside is real, but so are the failure modes: agents that email the wrong customers, make unauthorized financial decisions, and create liability gaps most owners don't catch until something breaks. The NYT piece surfaced owners managing 10-plus agents simultaneously, with at least one describing a finance agent that nearly processed a duplicate vendor payment before a human caught it. The lesson isn't to slow down. It's to build with controls from day one.

What does running AI agents across your business actually look like in practice?

It looks like a small team of 6 people handling the operational output of a 20-person company, until something goes sideways. The New York Times recently profiled real SMB owners who have deployed AI agents across finance, email, and customer service simultaneously. What they found wasn't a story about magic productivity. It was a more honest account: big wins, real failures, and a learning curve most vendors don't warn you about.

This is worth reading carefully if you're planning to move beyond single-task AI tools and start connecting agents together.

What actually went wrong for these business owners?

The failure patterns the Times documented are not random. They cluster around three specific areas.

Autonomous action without guardrails. Several owners described agents taking actions they hadn't explicitly authorized. One finance agent nearly pushed a duplicate vendor payment through before a human caught it in review. The agent had no way to know the invoice had already been paid via a different channel. It was doing exactly what it was trained to do. The problem was the absence of a human approval step for transactions above a defined threshold.

Email agents and customer trust. Customer-facing email agents created the most visible problems. When an agent responded to a complaint with a templated tone that didn't match the situation, it escalated rather than resolved the issue. Customers don't distinguish between "the AI made a mistake" and "the company made a mistake." The reputational exposure is the same.

Agents that can't see each other. When you run separate agents on finance, email, and customer service without a shared context layer, they make contradictory decisions. A customer service agent might promise a refund at the same time a finance agent is flagging that customer's account. No one programmed the conflict. It emerged from isolation.

The agents weren't failing because AI is bad. They were failing because no one designed the system around what could go wrong.

What did the owners who got it right do differently?

The owners with the strongest results shared a few common practices that don't get enough attention in the typical "AI transformation" pitch.

They started with read-only agents, then added write permissions slowly. An agent that can summarize your inbox is lower risk than one that can send from it. The owners who avoided major incidents typically spent 2–4 weeks running agents in monitoring mode before granting action permissions. That window surfaces the edge cases before they cost you.

They built explicit escalation paths. Every agent had a defined set of conditions that would route to a human before taking action. Not "when the agent isn't sure," but specific triggers: transactions over a dollar threshold, complaints containing specific language, any first-contact customer email. This isn't complicated to set up, but most owners skip it because the agent seems to be working fine in testing.

They treated agent outputs like employee outputs. The SMB owners running the tightest operations described reviewing agent activity logs the same way a manager reviews team output, briefly but consistently. Spot-checking 10–15 agent actions per week takes less than 30 minutes and catches drift before it compounds.

How many agents is too many for a small business to manage?

There's no clean answer, but the Times reporting suggests the ceiling isn't about the number of agents. It's about whether you have a coordination layer. Owners running 10-plus agents without integration tools described spending more time debugging conflicts than the agents were saving. Owners running fewer agents with a shared memory or orchestration setup (tools like LangChain, n8n, or platforms built on top of them) reported a much cleaner experience.

A rough heuristic: if your agents are touching the same customer records, the same financial accounts, or the same communication channels, they need to share context. If they're fully siloed by function, you can manage them independently.

| Agent type | Risk level | Human checkpoint needed? | |---|---|---| | Internal summarization / research | Low | No | | Drafting (human sends) | Low | Yes, before send | | Customer email (auto-send) | High | Yes, with escalation rules | | Finance / payments | High | Yes, above threshold | | Scheduling / calendar | Medium | Depends on external-facing |

What does this mean for SMB owners who are earlier in the process?

If you haven't deployed agents yet, you have the advantage of learning from these failures without living them. The owners in the Times piece were early movers who figured it out by breaking things. You don't have to.

The practical implication is that agent deployment is a systems design problem, not a software problem. The AI part is mostly solved. The hard part is mapping your existing workflows well enough to know where an autonomous decision causes damage versus where it creates leverage.

According to McKinsey's 2024 State of AI report, organizations that invest in AI governance and workflow integration see adoption success rates roughly 2x higher than those that deploy tools without structured implementation. That gap is visible in the Times reporting: the owners who built controls first are scaling. The ones who deployed first and patched later are still patching.

What we'd actually do

  • Map before you build. Before deploying any agent with action permissions, document every decision point in that workflow, what it can approve, what it can send, what it can modify. If you can't map it, you can't govern it.
  • Set approval thresholds on day one. For finance and customer-facing agents, define the specific conditions that require a human in the loop before the agent takes action. Build those rules in before the agent goes live, not after you see the first mistake.
  • Run a weekly 20-minute agent review. Pull the action logs from your agents once a week and spot-check 10–15 decisions. This is the minimum viable oversight layer. It catches drift, surfaces surprises, and keeps you from discovering a problem six weeks after it started.

FAQ

How many AI agents can a small business realistically manage?

It depends less on the count and more on whether your agents share context. Owners running 10-plus isolated agents reported more coordination problems than time saved. If your agents touch the same customers, finances, or communication channels, they need a shared layer. If they're fully siloed by function, you can manage them independently with basic weekly reviews.

What's the biggest mistake SMB owners make when deploying AI agents?

Skipping the escalation design. Most owners set up agents that work fine in testing and then deploy with no rules for what happens at the edge cases: a payment above a certain size, a complaint with a specific tone, a first-contact customer. Those gaps don't show up immediately. They show up 6 weeks later when something breaks in a way that's hard to explain.

Do I need technical staff to run AI agents in my business?

Not necessarily, but you do need someone who owns the process. The owners in the NYT piece who succeeded weren't all technical. They were operationally rigorous. They reviewed logs, set clear rules, and treated agent outputs like employee outputs. The platforms have gotten accessible enough that the bottleneck is now judgment, not code.

JOIN THE COMMUNITY

Want this running in your business?

The Skool community is where we show the full builds, share the templates, and help you implement. Three tiers, from team training to fractional AI expert.

  • Weekly Q&A with Alex and Cameron
  • Templates and frameworks you can steal
  • Real builds, running in real businesses
Join skool.com/aiforbusiness