7 AI Agents Write This Newsletter

I Write a Newsletter About AI. AI Writes Most of It. Here's the Whole System.

Each article in this newsletter used to take me roughly eight hours of focused work, spread across two or three evenings. Today, the same article takes about forty-five minutes of my time, and most of that is reading.

The rest is handled by a pipeline of seven agents that I built over the last few months, each of which does one specific job before passing its output to the next.

I'm a product manager who writes about AI and product management, which made the situation increasingly awkward: I was spending most of my week doing the kind of repetitive, structured work I keep telling other people to automate.

So I sat down and built the pipeline I had been writing about.

It has now produced thirty-nine articles, and the rest of this post is a complete walkthrough of how it actually works: what each agent does, the design decisions that made it usable, and the parts I got wrong.

Let’s dig in!

The Seven Agents

The pipeline is deliberately simple.

Each agent has a single, narrow responsibility, and it stops as soon as that job is done. There is no clever orchestration layer deciding what should happen next.

The agents always run in the same order, because the order of work in content production is itself fixed.

7 Agents. Each with a specific job.

Discovery searches a curated list of blogs, subreddits, X accounts, scores candidate topics against a rubric, and returns the five strongest options.
Research takes the topic I select and gathers the source article plus four to six supporting pieces: papers, interviews, follow-ups, anything that adds depth.
Writer loads my voice samples (previously published articles I'm happy with), reads the research, and produces the full draft.
Fact-Checker reads the draft as if it had never seen the research, then re-verifies every date, statistic, name, and technical claim against the sources. (Ensures zero hallucination.)
Twitter Article writes the long-form X companion piece. It draws on the same research but takes a different angle, so the X version isn't just a summary.
Image Agent plans every visual the article needs and writes the Gemini prompts to generate them.
Packager assembles the final Word document: the article, the X companion, the images, and a fact-check log, and copies it to the right Google Drive folder.

Of those seven, only two ever stop and ask me anything. Discovery pauses so I can pick a topic, and the Writer pauses after the draft so I can approve or send it back.

The other five run end-to-end without input.

A Real Run

To make this concrete, here's what a session looks like in practice. I kicked off Discovery while drafting this article. The instruction I gave was straightforward:

Run Discovery for an ai_case_study. Companies that shipped something interesting with LLMs. Skip Netflix, Airbnb, Stripe, Meta, Spotify, Shopify, LinkedIn, Grab, Uber, Figma, Notion, Snap, Anthropic.

Three minutes and eight tool calls later, the agent returned the five candidates it considered strongest, each with a score against the rubric:

I replied with go with option 2, and roughly forty-five minutes later, a finished draft of the DoorDash piece, researched, written in my voice, fact-checked, illustrated, and packaged into a Word document, was sitting in my output folder waiting for review.

That is the entire loop, every time.

The interesting part is not the loop itself. It's the design decisions that make the loop reliable enough to trust. Those are what the rest of this article is about.

Also read: Claude Just Changed the Vibe Coding Game

Claude.md File That Holds It Together

The single most important file in the project isn't a Python script. It's a Markdown file at the project root called CLAUDE.md. Claude reads it at the start of every session.

And in a few seconds, it has learned the entire system: the project types, the pipeline order, the rules around fact-checking, the conventions for image generation, and so on.

That file is the contract between the agents and me.

The most important sentence in it is this:

» Claude handles judgment. Python handles execution.

That single line is the design principle the whole system is organised around. Every time I tried to have Claude do something mechanical, such as managing file paths, calling an image API directly, or parsing structured data, it failed in subtle, frustrating ways.

And every time I had a Python script try to make a judgment, such as picking a topic or evaluating whether a draft was any good, the output was confidently wrong.

Agents are good at thinking. Scripts are good at doing. Once I stopped trying to blur that line, the system stopped breaking.

Two Layers: Configuration and Engine

I publish four kinds of content:

AI case studies
Educational explainers
Product case studies
AI tutorials.

Each one has its own writing guidelines, its own structural conventions, and its own voice samples, a small set of previously published articles I'm happy with, used as a tone reference for the Writer.

What I realised early on is that those four formats don't actually need different pipelines. They need a different configuration. The agents themselves don't change.

Discovery still searches and scores, the Writer still drafts, the Fact-Checker still verifies. They load a different prompt and a different set of samples depending on the format.

One pipeline, four modes. The order in which the agents run never changes either: Discovery, Research, Writer, Fact-Checker, Twitter Article, Image Agent, Packager.

Early on, I considered building a meta-agent that would decide which step to run next, because that's the kind of thing that sounds elegant in a system diagram.

I killed the idea quickly. Spending tokens and latency on a model deciding something I already know is just a waste. The rule I ended up with is the same as the one above: use LLMs for judgment, use code for sequencing.

Three Decisions: That Made It Work

A few specific choices, more than anything else, are what turned this from a flaky prototype into something I actually use every week.

The first was being deliberate about which model runs which agent.

The Writer uses Opus because it's producing content I'll publish under my own name, and the difference in voice is genuinely worth the cost.

Every other agent: Discovery, Research, Fact-Checker, Packager, runs on Sonnet, which is faster, cheaper, and more than accurate enough for the work it's doing.

Putting Opus on fact-checking would be like hiring a novelist to proofread a spreadsheet: technically possible, mostly wasted.

The second decision was to give every agent an explicit tool budget. Discovery's prompt, for instance, contains a single line: "BUDGET: 10 tool calls maximum."

Without that line, the agent would happily run twenty or thirty searches and rarely find better topics for the extra effort.

Tight budgets force the agent to make sharper decisions earlier, just as real deadlines force humans to. The constraint is the feature.

The third, and the one I underestimated most at the start, was making the Writer load voice samples before it writes a single word.

Same model, same prompt, but with the samples in context, the output reads like the newsletter, and without them, it reads like a competent but generic AI draft.

The samples are the difference between a draft I rewrite from scratch and a draft I lightly edit. I now treat sample selection as a first-class part of the system rather than an afterthought.

The Part That Didn't Work

I built a proper web interface for the pipeline, to be honest.

Real-time streaming of agent output, approval gates rendered in the browser, all the visible polish you'd expect. It works. I demoed it to a few people.

And I have essentially never used it for actual writing.

The reason is simple. When something breaks, and in a system with seven agents, something breaks regularly, what I want is the terminal. I want to see the error, describe the problem to Claude in plain language, get a fix, and keep going.

The browser made that loop slower, because every debug step had to round-trip through a UI that wasn't designed for debugging.

The git log is the honest record: there was one commit to build the UI, and twenty-one subsequent commits to fix things that broke because of it. The interface that looks most impressive in a screenshot is rarely the one you reach for under time pressure.

The Visual Layer

Every article needs images, and Gemini generates them. The catch with Gemini is that the prompts have to be unusually specific, exact text, exact dimensions, exact layout.

If you leave any of those vague, Gemini will invent its own version, and the invented version is almost always wrong in some small but visible way.

Most of the rules I follow now exist because something went wrong in a previous article.

Never put more than three boxes in a single horizontal row, because Gemini will clip the fourth.
Never style footer text as handwriting, because Gemini will hallucinate the words.
Never assume the model will preserve a specific font weight without being told.

Each rule is a scar. For a long time, I carried all of these rules in my head and re-typed them at the start of every session, which was both annoying and unreliable.

The fix was to turn them into a Claude Code skill, a single Markdown file that Claude loads when I type /japm-visuals.

The skill contains five image templates, six visual themes, and a final review step where Claude re-reads each finished visual as if it were a first-time reader and flags anything that looks unclear or inconsistent.

That last step alone has caught problems on every article I've run it on, including several I would otherwise have shipped.

One skill that decides what to visualise and creates it in 6 themes

The Result (So Far)

A few facts about the system as it stands today.

Thirty-nine finished articles have come out of this pipeline so far: pieces on Netflix, Airbnb, Stripe, Meta, Spotify, Shopify, Grab, DoorDash, Figma, Notion, Snap, Anthropic, Perplexity, Temu, Peloton, and roughly two dozen others.
About 3,960 lines of Python. That covers the seven agents, one orchestrator, a small tool registry, four shared utilities, and the FastAPI layer behind the web UI I no longer use.
Forty-five minutes of my time per article, and two decisions on my part. I pick the topic, and I approve the draft. Everything else happens without me.

What used to take 8 hours now takes 45 mins.

The Broader Point

The reason any of this works is that the mechanical parts of content production.

Finding topics, gathering sources, fact-checking, sourcing images, and assembling the document don't actually require judgment. They look like they do, because they involve language and taste, but most of the work in each step is structured and repeatable.

Once you accept that, those steps can all be automated, and surprisingly cleanly.

The parts that need judgment are smaller than they appear from the outside.

In my case, there are exactly two:

Deciding what to write about
Deciding whether the draft is good enough to publish.

Everything else, however creative it felt while I was doing it manually, was just time.

The question I'd encourage anyone running a weekly workflow to ask is a version of the same one I had to ask myself: which two or three decisions in this process are actually mine to make, and which of the surrounding steps are infrastructure I haven't built yet?

— Sid

P.S. Here is a list of all the things I am/was building with AI. Let me know which teardown you would like to see next.